Problem:


I would like to use a custom autoscaling policy for my Dataproc Cluster for Ephemeral jobs or I would like to use secondary worker nodes for the Dataproc Cluster



Solution:


Infoworks provides a pre ingestion job hook that can be used to run a bash script before beginning the ingestion job.

In the below steps, we would leverage the pre ingestion job hook to replace the default autoscaling policy with a user-defined custom autoscaling policy.



Steps:


1. Create a custom autoscaling policy on the GCP console and take a note of the autoscaling policy ID

   https://cloud.google.com/dataproc/docs/concepts/configuring-clusters/autoscaling


2. Create a bash script like below,


#/bin/bash
if ! grep -q interactive "/proc/sys/kernel/hostname"
then
 master_node=$(cat /proc/sys/kernel/hostname)
 cluster_name=${master_node::-2}
 gcloud dataproc clusters update $cluster_name \
    --autoscaling-policy=autoscale-015a243a33b20d5eba4e5e98\
    --region=us-central1
 echo "Autoscaling Policy Updated"
else
 echo "Interactive Cluster, not updating autoscaling policy"
fi



--autoscaling-policy=autoscale-015a243a33b20d5eba4e5e98  

Replace with your actual autoscaling policy ID from step 1


--region=us-central1

Replace with your actual region for the Dataproc Cluster



3. Create a pre ingestion job hook and upload the bash script.

https://docs.infoworks.io/infoworks-5.1.2/admin-and-operations/extensions#managing-job-hooks


4. Add the ingestion hook to the Infoworks source where you would like to use the  custom autoscaling policy

     https://docs.infoworks.io/infoworks-5.1.2/admin-and-operations/extensions#using-a-job-hook


Note:

1. The above script updates the autoscaling policy only for ephemeral clusters

2. A pre ingestion job hook is applied for all tables in the source and cannot be applied individually for table



Affects Version:


Infoworks 5.0, 5.1.X