Problem Statement: Spark pipeline build job, with source tables stored in Parquet format, fail with below error message:

ERROR FileFormatWriter: Aborting job 2393319a-d298-423b-903a-6dc6b0053412.
org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 4.0 failed 4 times, most recent failure: Lost task 0.3 in stage 4.0 (TID 317, ip-10-45-32-67.aws.nonprod.xxx.com, executor 1): parquet.io.ParquetDecodingException: Can not read value at 0 in block -1 in file s3://xxx



Root cause:
This issue occurs when some decimal columns stored in Parquet written by Spark are not readable by Hive due to a mismatch between Hive and Spark representations for the decimal column.


Hive uses fixed bytes (INT 32) for decimal representation whereas Spark (1.4 and later) changes it dynamically based on precision which is INT32 for 1 <= precision <= 9 and INT64 for 10 <= precision <= 18. This issue happens when there is a mismatch in these representations.

Solution:

Set the following advanced configuration to make sure both Hive and Spark are using the same representation
key:spark.sql.parquet.writeLegacyFormat

value:true

Applicable IWX Versions

3.x