Description and Root cause:

The root cause of the problem is an extra entry in the Postgres Table "task_instance". When a Task is removed from the workflow and is triggered. Airflow finds the difference between the dags and marks the task as removed instead of deleting the task from the airflow databases. The First step after this is to mark the workflow_run as RUNNING and update the status of the relevant tasks. We get the tasks from Postgres and try to find the matching task in our workflow_run by means of the task_id parameter. The tasks from Postgres contain the removed task which does not match any of the tasks in our workflow_run. This throws an exception. The workflow_run is stuck in a PENDING state since this is the hook that's supposed to update the workflow_run status to RUNNING. As the hook fails, it also fails the task which fails the entire workflow_run. But this does not trigger the exception-catching block of our task code (this updates the task status as well). This is because it has not entered our task code at all but failed before it. At this point, the hook that is supposed to update the workflow_run state to FAILED uses the same function as the one before and fails in a similar manner. Hence the workflow_state is stuck perpetually in PENDING state while in the backend Airflow has failed the DAG.

JIRA: https://infoworks.atlassian.net/browse/IPD-15181


Steps to apply the fix:

cd /opt/infoworks/apricot-meteor/infoworks_python/infoworks/orchestrator/core/components/airflow
mv iw_airflow_component_utils.py /tmp/iw_airflow_component_utils.py.bkp
mv {DOWNLOAD_FOLDER}/iw_airflow_component_utils.py .  (450619210 16210 iw_airflow_component_utils.py)
cd /opt/infoworks/apricot-meteor/infoworks_python/infoworks/orchestrator/core/engine/airflow
mv iw_airflow_engine_utils.py /tmp/iw_airflow_engine_utils.py.bkp
mv {DOWNLOAD_FOLDER}/iw_airflow_engine_utils.py .     (1537420746 89673 iw_airflow_engine_utils.py)
Restart orchestrator


Applicable IWX versions:

IWX 3.3