Spark 2.1: Table Schemas Become More Static

In Spark SQL, the origins of table schemas can be classified into two groups.

Group A. Users specify the schema. There are two cases. Case 1 CREATE TABLE AS SELECT: the schema is determined by the result schema of the SELECT clause. Case 2 CREATE TABLE: users explicitly specify the schema. These two cases are applicable to both Hive tables and Data Source tables. For example,

-- CREATE HIVE TABLE AS SELECT
CREATE TABLE tab STORED AS TEXTFILE
AS SELECT * from input;
-- CREATE DATA SOURCE TABLE
CREATE TABLE jsonTable (_1 string, _2 string)
USING org.apache.spark.sql.json;

-- CREATE HIVE TABLE AS SELECT

CREATE TABLE tab STORED AS TEXTFILE

AS SELECT * from input;

-- CREATE DATA SOURCE TABLE

CREATE TABLE jsonTable (_1 string, _2 string)

USING org.apache.spark.sql.json;

Group B. The schemas are inferred at runtime There are also two cases. Case 1 CREATE DATA SOURCE TABLE without SCHEMA: Users do not specify the schema but the path to the file location. The created table is EXTERNAL. Case 2 CREATE HIVE TABLE without SCHEMA: Some Hive serde libraries can infer the schema and Hive metastore automatically generates and records the schema. For example,

-- CREATE DATA SOURCE TABLE with location
CREATE TABLE jsonTable
USING org.apache.spark.sql.json
OPTIONS (path '/tmp/inputFiles')
-- CREATE HIVE TABLE with serde
CREATE TABLE avroTable 
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.avro.AvroSerDe'
STORED AS AVRO
TBLPROPERTIES ('avro.schema.url'='myHost/myAvroSchema.avsc');

-- CREATE DATA SOURCE TABLE with location

CREATE TABLE jsonTable

USING org.apache.spark.sql.json

OPTIONS (path '/tmp/inputFiles')

-- CREATE HIVE TABLE with serde

CREATE TABLE avroTable

ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.avro.AvroSerDe'

STORED AS AVRO

TBLPROPERTIES ('avro.schema.url'='myHost/myAvroSchema.avsc');

Prior to Spark 2.1, Spark SQL does not store the inferred schema in the external catalog for the Case 1 in Group B. When users refreshing the metadata cache, accessing the table at the first time after (re-)starting Spark, Spark SQL will infer the schema and store the info in the metadata cache for improving the performance of subsequent metadata requests. However, the runtime schema inference could cause undesirable schema changes after each reboot of Spark.

Since Spark 2.1, we infer the schema and store it in the external catalog when creating the table. This becomes more consistent with Hive. The schema will not be changed, even if the metadata cache is refreshed. If the schema is changed, they need to recreate a new table using the file location. Now, the concept of table schema is very similar to the one in traditional RDBMS. 🙂

Developer Diaries of gatorsmile | Apache Spark

By a smiling gator – Xiao Li

Spark 2.1: Table Schemas Become More Static

Leave a Reply Cancel reply