Table Types in Spark: External or Managed?

There exist three types of non-temporary cataloged tables in Spark: EXTERNAL, MANAGED, and VIEW. VIEW is used for persistent views; EXTERNAL and MANAGED are used for tables. Like Hive, when dropping an EXTERNAL table, Spark only drops the metadata but keeps the data files intact. When dropping a MANAGED table, Spark removes both metadata and data files. You can find out the table type by the SparkSession API spark.catalog.getTable (added in Spark 2.1) or the DDL command DESC EXTENDED / DESC FORMATTED

Since starting native DDL supports (in Spark 2.0), we introduced the behavior changes. In Hive, users are allowed to specify the LOCATION for storing/locating the table data, which can be either EXTERNAL or MANAGED. However, in Spark, LOCATION is mandatory for EXTERNAL tables. When users creating a table with the specified LOCATION, the table type will be EXTERNAL even if users do not specify the EXTERNAL keyword. We do not allow users to create a MANAGED table with the users supplied LOCATION. The data files are stored in a newly created directory under the location defined by spark.sql.warehouse.dir, which defaults to the directory spark-warehouse in the current directory that the Spark application is started. The examples can make it clear:

Above is the examples for creating Hive serde tables. When creating data source tables, we do not allow users to specify the EXTERNAL keyword at all. The table type is still determined by whether users provide the table location. However, in Spark 2.1, the LOCATION clause is not provided in the SQL syntax of creating data source tables. Users need to do it by providing the path key map in the OPTIONS clause. (TIPs: Spark 2.2 will unify the CREATE TABLE statements for both Hive serde tables and data source tables. We will explain it in a separate post)

Spark 2.1 and prior 2.x versions do not allow users to create a Hive serde table using DataFrameWriter APIs. (TIPs: this restriction will be lifted in Spark 2.2. We will introduce a new source format hive). Users can create either EXTERNAL or MANAGED tables, as shown below.

It looks pretty confusing. Fortunately, Spark 2.2 will resolves all of them. : )

Leave a Reply

Your email address will not be published. Required fields are marked *