Problem Description:


Incremental ingestion failing with /<target_hdfs_path>/<table_id>/merged/orc does not exists. Sample stack trace looks like below,


[INFO] 2021-03-17 07:05:41,745 [pool-5-thread-2] infoworks.tools.hadoop.hdfs.HDFSUtils:629 :: Creating hdfs directory /data/PROD/core/infoworks/prod_core_db_infoworks_dhub_life70/5fbf7adbafba099dae5901f3//cdc//orc/
[ERROR] 2021-03-17 07:05:41,770 [pool-5-thread-2] infoworks.discovery.dbcrawler.rdbms.utils.CrawlWorkerThread:314 :: Error during table crawl due to Path /data/PROD/core/infoworks/prod_core_db_infoworks_dhub_life70/5fbf7adbafba099dae5901f3/merged/orc does not exist
java.io.FileNotFoundException: Path /data/PROD/core/infoworks/prod_core_db_infoworks_dhub_life70/5fbf7adbafba099dae5901f3/merged/orc does not exist
    at infoworks.tools.hadoop.hdfs.HDFSUtils.recusiveFirstFileSearch(HDFSUtils.java:336)
    at infoworks.tools.format.OrcUtils.getHiveSchema(OrcUtils.java:216)
    at infoworks.tools.format.OrcUtils.getHiveSchema(OrcUtils.java:212)
    at infoworks.discovery.dbcrawler.rdbms.utils.CrawlWorkerThread.addNewPartitionsPostCDC(CrawlWorkerThread.java:529)
    at infoworks.discovery.dbcrawler.rdbms.utils.CrawlWorkerThread.postCrawlData(CrawlWorkerThread.java:509)
    at infoworks.discovery.dbcrawler.rdbms.utils.CrawlWorkerThread.crawlData(CrawlWorkerThread.java:683)
    at infoworks.discovery.dbcrawler.rdbms.utils.CrawlWorkerThread.call(CrawlWorkerThread.java:257)
    at infoworks.discovery.dbcrawler.rdbms.utils.CrawlWorkerThread.call(CrawlWorkerThread.java:75)
    at java.util.concurrent.FutureTask.run(FutureTask.java:266)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)


Root cause:


This happens when someone deletes the target HDFS path manually. Infoworks maintains directory structure for ingestion job and the final data set will be stored inside /merged directory by the end of each job. And if someone deletes this directory subsequent incremental job will fail with the above-mentioned error.


Solution:


To fix this issue need to run the ingestion as initialize and ingest(Full load). This will populate the directory structures in the underlying storage location.


Applicable IWX versions:


IWX 2.X, 3.x