In Spark 2.1, catalog-based partition handling is introduced, as explained in one databricks blog. This is a well known performance issue for data source tables with tens of thousands of partitions. In the initial schema discovery, the recursive file scanning in the file system could take tens of minutes, especially for cloud object storage services like Amazon S3 and Openstack’s Swift. This improvement greatly reduces the time for schema discovery of partitioned data source tables when the query only needs a small fraction of table partitions.
In Spark 2.0 and prior, different from Hive serde tables, the data source tables do not persistently store the partition-level metadata into the Catalog. The metadata are discovered and loaded to the metadata cache at the first time. Since 2.1, we persist the partition-level metadata for both Hive serde tables and data source tables. The partition-specific DDL statements can be supported for data source tables, including
RENAME PARTIITON and
However, after Spark 2.1 release, differences still exist between Hive serde tables and data source tables. : ) We still maintain a metadata cache for data source tables to store the file status info of the leaf files. These info are not recorded into the catalog. Having such a cache is pretty critical for repeatitive queries.
For data source tables, we have data files, local memory metadata caches, and global metadata in Catalog. The inconsistency could exist if the data files or metadata catalog are being shared and modified. To resolve the inconsistency,
REFRESH TABLE can refresh the local metadata cache, and
MSCK REPAIR TABLE or
ALTER TABLE RECOVER PARTITIONS can refresh both local memory metadata caches and the global metadata in Catalog.
Note, this feature changes the external behaviors. Below is a typical case when you creating an external partitioned data source table in Spark 2.1.
// Create an External Table. The location '/temp/test' contains data.
|CREATE TABLE tab1 (fieldOne long, partCol int)
|OPTIONS (path '/temp/test')
|PARTITIONED BY (partCol)
// The data in the path '/temp/test' are visible in Spark 2.0 but invisible in Spark 2.1
// If the external table is not partitioned, this is not an issue.
// Discover the partitions and refresh the Catalog and cache.
spark.sql("ALTER TABLE tab1 RECOVER PARTITIONS")
// After the refresh, they are visible!