Anticipated Feature in Spark 2.2: Max Records Written Per File

In Spark 2.1 and prior, Spark writes a single file out per task. The number of saved files is equal to the the number of partitions of the RDD being saved. Thus, this could result in ridiculously large files. This becomes annoying to end users. To avoid generating huge files, the RDD needs to be repartitioned to control the number of output files.

This extra step is annoying. repartition is pretty expensive because it shuffles the data across the networks. Limiting the max number of records written per file is highly desirable. It can avoid generating huge files. In the next release, Spark provides two methods for users to set the limit.

Leave a Reply

Your email address will not be published. Required fields are marked *