There exist three types of non-temporary cataloged tables in Spark: EXTERNAL, MANAGED, and VIEW. VIEW is used for persistent views; EXTERNAL and MANAGED are used for tables. Like Hive, when dropping an EXTERNAL table, Spark only drops the metadata but keeps the data files intact. When dropping a MANAGED table, Spark removes both metadata and data files. You can find out the table type by the SparkSession API spark.catalog.getTable
(added in Spark 2.1) or the DDL command DESC EXTENDED
/ DESC FORMATTED
Since starting native DDL supports (in Spark 2.0), we introduced the behavior changes. In Hive, users are allowed to specify the LOCATION for storing/locating the table data, which can be either EXTERNAL or MANAGED. However, in Spark, LOCATION is mandatory for EXTERNAL tables. When users creating a table with the specified LOCATION, the table type will be EXTERNAL even if users do not specify the EXTERNAL keyword. We do not allow users to create a MANAGED table with the users supplied LOCATION. The data files are stored in a newly created directory under the location defined by spark.sql.warehouse.dir
, which defaults to the directory spark-warehouse
in the current directory that the Spark application is started. The examples can make it clear:
1 2 3 4 5 6 7 8 9 10 |
// The created tables are MANAGED. // The location of t1 data files is {current_working_directory}/spark-warehouse/t1 spark.sql(CREATE TABLE t1 (i INT)) // The created tables are EXTERNAL spark.sql(CREATE TABLE t2 (i int) LOCATION '/tmp/tables/t2') spark.sql(CREATE EXTERNAL TABLE t3 (i int) LOCATION '/tmp/tables/t3') // Failed to create tables spark.sql(CREATE EXTERNAL TABLE t4 (i int)) |
Above is the examples for creating Hive serde tables. When creating data source tables, we do not allow users to specify the EXTERNAL keyword at all. The table type is still determined by whether users provide the table location. However, in Spark 2.1, the LOCATION clause is not provided in the SQL syntax of creating data source tables. Users need to do it by providing the path
key map in the OPTIONS clause. (TIPs: Spark 2.2 will unify the CREATE TABLE statements for both Hive serde tables and data source tables. We will explain it in a separate post)
1 2 3 4 5 6 7 8 9 10 |
// The created tables are MANAGED. // The location of t5 data files is {current_working_directory}/spark-warehouse/t5 spark.sql(CREATE TABLE t5 (i int) USING PARQUET); // The created tables are EXTERNAL spark.sql(CREATE TABLE t6 (i int) USING PARQUET OPTIONS('path'='/tmp/tables/t6'); // Failed to create tables spark.sql(CREATE EXTERNAL TABLE t7 (i int)); spark.sql(CREATE EXTERNAL TABLE t8 (i int) USING PARQUET OPTIONS('path'='/tmp/tables/t8'); |
Spark 2.1 and prior 2.x versions do not allow users to create a Hive serde table using DataFrameWriter APIs. (TIPs: this restriction will be lifted in Spark 2.2. We will introduce a new source format hive
). Users can create either EXTERNAL or MANAGED tables, as shown below.
1 2 3 4 5 6 7 |
val df = Seq(1,2).toDF() // The created tables are MANAGED. df.write.saveAsTable("t10") // The created tables are EXTERNAL df.write.option("path", "/tmp/tables/t9").saveAsTable("t9") |
It looks pretty confusing. Fortunately, Spark 2.2 will resolves all of them. : )