Problem:
I would like to use a custom autoscaling policy for my Dataproc Cluster for Ephemeral jobs or I would like to use secondary worker nodes for the Dataproc Cluster
Solution:
Infoworks provides a pre ingestion job hook that can be used to run a bash script before beginning the ingestion job.
In the below steps, we would leverage the pre ingestion job hook to replace the default autoscaling policy with a user-defined custom autoscaling policy.
Steps:
1. Create a custom autoscaling policy on the GCP console and take a note of the autoscaling policy ID
https://cloud.google.com/dataproc/docs/concepts/configuring-clusters/autoscaling
2. Create a bash script like below,
#/bin/bash if ! grep -q interactive "/proc/sys/kernel/hostname" then master_node=$(cat /proc/sys/kernel/hostname) cluster_name=${master_node::-2} gcloud dataproc clusters update $cluster_name \ --autoscaling-policy=autoscale-015a243a33b20d5eba4e5e98\ --region=us-central1 echo "Autoscaling Policy Updated" else echo "Interactive Cluster, not updating autoscaling policy" fi
--autoscaling-policy=autoscale-015a243a33b20d5eba4e5e98
Replace with your actual autoscaling policy ID from step 1
--region=us-central1
Replace with your actual region for the Dataproc Cluster
3. Create a pre ingestion job hook and upload the bash script.
https://docs.infoworks.io/infoworks-5.1.2/admin-and-operations/extensions#managing-job-hooks
4. Add the ingestion hook to the Infoworks source where you would like to use the custom autoscaling policy
https://docs.infoworks.io/infoworks-5.1.2/admin-and-operations/extensions#using-a-job-hook
Note:
1. The above script updates the autoscaling policy only for ephemeral clusters
2. A pre ingestion job hook is applied for all tables in the source and cannot be applied individually for table
Affects Version:
Infoworks 5.0, 5.1.X