In Spark 2.1 and prior, Spark writes a single file out per task. The number of saved files is equal to the the number of partitions of the RDD being saved. Thus, this could result in ridiculously large files. This becomes annoying to end users. To avoid generating huge files, the RDD needs to be repartitioned to control the number of output files.
1 2 |
val df = spark.read.parquet(inputDirectory) df.repartition(NUM_PARTITIONS).write.mode("overwrite").parquet(outputDirectory) |
This extra step is annoying. repartition
is pretty expensive because it shuffles the data across the networks. Limiting the max number of records written per file is highly desirable. It can avoid generating huge files. In the next release, Spark provides two methods for users to set the limit.
1 2 3 4 5 6 7 |
val df = spark.read.parquet(inputDirectory) // Method 1: specify the limit in the option of DataFrameWriter API. df.write.option("maxRecordsPerFile", 10000) .mode("overwrite").parquet(outputDirectory) // Method 2: specify the limit via setting the session-scoped SQLConf configuration. spark.conf.set("spark.sql.files.maxRecordsPerFile", 10000) df.write.mode("overwrite").parquet(outputDirectory) |