Problem Description:

Structured file ingestion failing with java.lang.IllegalArgumentException: Delimiter cannot be more than one character error, sample stack trace looks like below,

20/08/10 10:34:02 ERROR FileFormatWriter: Aborting job 4c9f9c40-fe2d-41c1-8c57-d90064af1218.
java.lang.IllegalArgumentException: Delimiter cannot be more than one character: @|#
    at org.apache.spark.sql.execution.datasources.csv.CSVUtils$.toChar(CSVUtils.scala:118)
    at org.apache.spark.sql.execution.datasources.csv.CSVOptions.<init>(CSVOptions.scala:88)
    at org.apache.spark.sql.execution.datasources.csv.CSVOptions.<init>(CSVOptions.scala:41)
    at org.apache.spark.sql.execution.datasources.csv.CSVFileFormat.buildReader(CSVFileFormat.scala:105)
    at org.apache.spark.sql.execution.datasources.FileFormat$class.buildReaderWithPartitionValues(FileFormat.scala:131)
    at org.apache.spark.sql.execution.datasources.TextBasedFileFormat.buildReaderWithPartitionValues(FileFormat.scala:162)
    at org.apache.spark.sql.execution.FileSourceScanExec.inputRDD$lzycompute(DataSourceScanExec.scala:456)
    at org.apache.spark.sql.execution.FileSourceScanExec.inputRDD(DataSourceScanExec.scala:450)
    at org.apache.spark.sql.execution.FileSourceScanExec.inputRDDs(DataSourceScanExec.scala:477)
    at org.apache.spark.sql.execution.ProjectExec.inputRDDs(basicPhysicalOperators.scala:46)
    at org.apache.spark.sql.execution.WholeStageCodegenExec.doExecute(WholeStageCodegenExec.scala:631)
    at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:146)
    at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:134)
at org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$5.apply(SparkPlan.scala:187)


Root cause:

Spark 2 doesn't support multi-character delimiter during CSV read. The databricks runtime version(5.5) we use for the submission of the job has spark version 2.X. So by default files with multi-character delimiter will fail with the below-mentioned error. 


Solution:

Spark 3 can handle multi-character delimiter so if we submit the with databricks runtime 7.2X we can avoid above mentioned error while crawling the data. Below is the advanced configuration one needs to set at the table or the source level to run an ingestion job on different runtime than the default one.

Key:- databricks_spark_runtime 
Value:- 7.2.x-scala2.12


Applicable IWX versions:

IWX 4.2