Spark 2.1: Table Schemas Become More Static

In Spark SQL, the origins of table schemas can be classified into two groups.

Group A. Users specify the schema. There are two cases. Case 1 CREATE TABLE AS SELECT: the schema is determined by the result schema of the SELECT clause. Case 2 CREATE TABLE: users explicitly specify the schema. These two cases are applicable to both Hive tables and Data Source tables. For example,

Group B. The schemas are inferred at runtime There are also two cases. Case 1 CREATE DATA SOURCE TABLE without SCHEMA: Users do not specify the schema but the path to the file location. The created table is EXTERNAL. Case 2 CREATE HIVE TABLE without SCHEMA: Some Hive serde libraries can infer the schema and Hive metastore automatically generates and records the schema. For example,

Prior to Spark 2.1, Spark SQL does not store the inferred schema in the external catalog for the Case 1 in Group B. When users refreshing the metadata cache, accessing the table at the first time after (re-)starting Spark, Spark SQL will infer the schema and store the info in the metadata cache for improving the performance of subsequent metadata requests. However, the runtime schema inference could cause undesirable schema changes after each reboot of Spark.

Since Spark 2.1, we infer the schema and store it in the external catalog when creating the table. This becomes more consistent with Hive. The schema will not be changed, even if the metadata cache is refreshed. If the schema is changed, they need to recreate a new table using the file location. Now, the concept of table schema is very similar to the one in traditional RDBMS. 🙂

Leave a Reply

Your email address will not be published. Required fields are marked *