Spark 2.1: Built-in Data Source Option Names Are Case Insensitive

Below is an incomplete option list supported by the built-in data sources in Spark 2.1.

  • JDBC’s options:
    user, password, url, dbtable, driver, partitionColumn, lowerBound, upperBound, numPartitions, fetchsize, truncate, createTableOptions, batchsize and isolationLevel.
  • CSV’s options:
    path, sep, delimiter, mode, encoding, charset, quote, escape, comment, header, inferSchema, ignoreLeadingWhiteSpace, ignoreTrailingWhiteSpace, nullValue, nanValue, positiveInf, negativeInf, compression, codec, dateFormat, timestampFormat, maxColumns, maxCharsPerColumn, escapeQuotes, and quoteAll.
  • JSON’s options:
    path, samplingRatio, primitivesAsString, prefersDecimal, allowComments, allowUnquotedFieldNames, allowSingleQuotes, allowNumericLeadingZeros, allowNonNumericNumbers, allowBackslashEscapingAnyCharacter, compression, mode, columnNameOfCorruptRecord, dateFormat, and timestampFormat.
  • Parquet’s options:
    path, compression, and mergeSchema.
  • ORC’s options:
    path, compression, and orc.compress.
  • FileStream’s options:
    pathmaxFilesPerTrigger, maxFileAge, and latestFirst.
  • Text’s options:
    path and compression
  • LibSVM’s options:
    path, vectorType and numFeatures

Prior to Spark 2.1, all of them are case sensitive. Note, some of them are NOT using lower-camel-case naming convention. If your inputs do not exactly match the names, it is highly possible that they are ignored without any message. Of course, this solution is not user friendly. Since Spark 2.1, the above option names become case insensitive, except the format Text and LibSVM. Will fix these two in Spark 2.2 🙂

Leave a Reply

Your email address will not be published. Required fields are marked *