Description


During the data crawl for Teradata JDBC source, the mapreduce job to crawl the data might take more time if the number of rows in the table are more (around 11 M) and if the partition columns are not provided in the table configuration.


Root cause:


This issue occurs if the Primary partition is not provided in the table configuration during the full load. Primary partition creates a folder for each primary partition value on HDFS and this improves the hive query performance.


Solution:


Enable the Partition Hive table option in the Table configuration and select the column on which the table needs to be partitioned in HDFS. This will help while crawling bigger tables and in processing the data in parallel. 


Number of partitions that would be created = number of distinct values of primary partition column





--Aditya