In Spark SQL, the origins of table schemas can be classified into two groups.
Group A. Users specify the schema. There are two cases. Case 1 CREATE TABLE AS SELECT: the schema is determined by the result schema of the SELECT clause. Case 2 CREATE TABLE: users explicitly specify the schema. These two cases are applicable to both Hive tables and Data Source tables. For example,
-- CREATE HIVE TABLE AS SELECT
CREATE TABLE tab STORED AS TEXTFILE
AS SELECT * from input;
-- CREATE DATA SOURCE TABLE
CREATE TABLE jsonTable (_1 string, _2 string)
Group B. The schemas are inferred at runtime There are also two cases. Case 1 CREATE DATA SOURCE TABLE without SCHEMA: Users do not specify the schema but the path to the file location. The created table is
EXTERNAL. Case 2 CREATE HIVE TABLE without SCHEMA: Some Hive serde libraries can infer the schema and Hive metastore automatically generates and records the schema. For example,
-- CREATE DATA SOURCE TABLE with location
CREATE TABLE jsonTable
OPTIONS (path '/tmp/inputFiles')
-- CREATE HIVE TABLE with serde
CREATE TABLE avroTable
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.avro.AvroSerDe'
STORED AS AVRO
Prior to Spark 2.1, Spark SQL does not store the inferred schema in the external catalog for the Case 1 in Group B. When users refreshing the metadata cache, accessing the table at the first time after (re-)starting Spark, Spark SQL will infer the schema and store the info in the metadata cache for improving the performance of subsequent metadata requests. However, the runtime schema inference could cause undesirable schema changes after each reboot of Spark.
Since Spark 2.1, we infer the schema and store it in the external catalog when creating the table. This becomes more consistent with Hive. The schema will not be changed, even if the metadata cache is refreshed. If the schema is changed, they need to recreate a new table using the file location. Now, the concept of table schema is very similar to the one in traditional RDBMS. 🙂