Description


Infoworks can ingest data in Parquet format in hive. The data would be stored recursively in nested HDFS directories and if you try to read the data through spark shell, it would not show any results.


Solution: To read the hive table data stored in recursive directories in HDFS through Spark shell, we need to set the below configuration in df_spark-defaults.conf file in $IW_HOME/conf directory and then run the spark shell command.


spark.sql.hive.convertMetastoreParquet false
spark.sql.parquet.writeLegacyFormat true
spark.mapreduce.input.fileinputformat.input.dir.recursive true
spark.hive.mapred.supports.subdirectories true
spark.mapred.input.dir.recursive true
spark.hadoop.mapreduce.input.fileinputformat.input.dir.recursive true
spark.sql.crossJoin.enabled true


--Spark Shell command---


spark-shell --properties-file <absolute_path_for_df_spark-defaults.conf file>