Category Archives: Parquet

Spark 2.1: Lessons Learned when Upgrading to Parquet 1.8.1

Parquet is the default file format of Apache Spark. Most of Spark performance measurements are using Parquet as the format of test data files. Thus, Parquet is pretty important to Spark. However, Parquet filter pushdown for string and binary columns was disabled since 1.5.2, due to a bug in Parquet 1.6.0. (PARQUET-251 Binary column statistics errors )

Since Spark 2.0, we tried to upgrade Parquet to 1.8.1 and thought we can re-enable the filter pushdown. However, we hit another Parquet bug PARQUET-686 and caused incorrect results (For details, see Spark-17213). Although Spark 2.1 eventually upgraded to Parquet 1.8.1, we are still unable to push down the string/binary filters. Actually, this is not the only bug Parquet 1.8.1 has. Fortunately, after many cross-community discussions, the Parquet community decided to alleviate our pains and provided the release 1.8.2. Recently, the vote of Parquet 1.8.2 release started. Really appreciate their helps. 🙂

Adding the external dependency is painful. We hit more bugs due to the Hive dependency. Thus, Spark 2.0 introduced the native DDL/DML processing and the native view support, but we still rely on Hive metastore for persistent metadata storage. Of course, we are still experiencing various bugs caused by Hive metastore APIs and not all of them can be resolved. In the future releases, the Spark community is trying to completely decouple the Hive support and catalog. More changes are coming.