Category Archives: Overview

Introduction to Catalog-based Partition Handling for Data Source Tables

In Spark 2.1, catalog-based partition handling is introduced, as explained in one databricks blog. This is a well known performance issue for data source tables with tens of thousands of partitions. In the initial schema discovery, the recursive file scanning in the file system could take tens of minutes, especially for cloud object storage services like Amazon S3 and Openstack’s Swift. This improvement greatly reduces the time for schema discovery of partitioned data source tables when the query only needs a small fraction of table partitions.

In Spark 2.0 and prior, different from Hive serde tables, the data source tables do not persistently store the partition-level metadata into the Catalog. The metadata are discovered and loaded to the metadata cache at the first time. Since 2.1, we persist the partition-level metadata for both Hive serde tables and data source tables. The partition-specific DDL statements can be supported for data source tables, including SHOW PARTITIONS, ADD PARTITION, DROP PARTITION, RECOVER PARTITIONS, RENAME PARTIITON and SET LOCATION.

However, after Spark 2.1 release, differences still exist between Hive serde tables and data source tables. : ) We still maintain a metadata cache for data source tables to store the file status info of the leaf files. These info are not recorded into the catalog. Having such a cache is pretty critical for repeatitive queries.

For data source tables, we have data files, local memory metadata caches, and global metadata in Catalog. The inconsistency could exist if the data files or metadata catalog are being shared and modified. To resolve the inconsistency, REFRESH TABLE can refresh the local metadata cache, and MSCK REPAIR TABLE or ALTER TABLE RECOVER PARTITIONS can refresh both local memory metadata caches and the global metadata in Catalog.

Note, this feature changes the external behaviors. Below is a typical case when you creating an external partitioned data source table in Spark 2.1.

What is Spark SQL?

The name of Spark SQL is confusing. Many articles view it as a pure SQL query engine and compare it with the other SQL engines. However, this is inaccurate. SQL is just one of the interfaces of Spark SQL. Spark SQL also has SparkSession, and DataFrame/Dataset APIs. Like the other Spark modules, Spark SQL provides four language supports: Scala, Python, JAVA and R. Currently, Python and R are just the wrapper of the internal codes written in Scala. Thus, native Spark users still prefer to using the Scala interface.

Especially after the Spark 2.0 release, Spark SQL becomes part of the new Spark core. Compared with RDD APIs, Spark SQL APIs are user friendly and high level. Internally, Spark SQL contains the Catalyst optimizer, the Tungsten execution engine, and data source integration support. It now powers the next generation streaming engine (i.e., structured streaming), and advanced analytics components (machine learning library MLlib, graph-parallel computation library GraphFrame, and TensorFlow binding library TensorFrames). Therefore, our improvements to Spark SQL can benefit all these user-facing components. That’s why Spark SQL is the most active component in Spark.