Creating a custom Dataproc Image
Google provides an option to launch a Dataproc cluster with a custom machine image
Custom machine image will decrease the cluster provisioning times as it comes pre-installed with user-defined scripts, and packages providing a consistent Dataproc cluster
Limitations
The custom Dataproc Image expires in 60 days
New images announced in Dataproc Release Notes are not available for use as the base for custom images until one week from their announcement date.
Prerequisites
Enable the Dataproc API, Compute Engine API, and Cloud Storage APIs
Clone Dataproc custom images repository to the machine executing the script to build custom Dataproc Image
Compose a script to include the required init scripts, custom packages, self-signed certificates, etc
Install and initialize the Google Cloud CLI, ensure the IAM/SA user initialized has access to create a VM instance, Dataproc Cluster, access to the target bucket where we want to store the Dataproc Image
If you use a custom image hosted in another project, the Dataproc Service Agent service account in your project must have compute.images.get permission on the image in the host project. You can do this by granting the roles/compute.imageUser role on the hosted image to your project's Dataproc Service Agent service account (see Sharing custom images within an organization)
Steps
Use gcloud config set project <your-project> to specify which project to use to create and save your custom image.
Run the below python command
python generate_custom_image.py \ --image-name '<new_custom_image_name>' \ --dataproc-version '<dataproc_version>' \ --customization-script '<custom_script_to_install_custom_packages>' \ --zone '<zone_to_create_instance_to_build_custom_image>' \ --gcs-bucket '<gcs_bucket_to_write_logs>'
If there is no default network present in the GCP project, we will have to provide the –network and –subnetwork (subnetwork will require a full subnetwork URL)
Example:
python generate_custom_image.py \ --image-name iwx-custom \ --dataproc-version 1.5.75-ubuntu18 \ --customization-script /home/scripts/init_tpt_install_client.sh \ --zone us-central1-a \ --gcs-bucket gs://staging_bucket_supp/custom_image \ --network gcp-cs-vpc \ --subnetwork projects/iw-gcp-cs-host/regions/us-central1/subnetworks/ gcp-cs-vpc-subnet-1
The script is expected to perform the following steps,
1. Get the user's gcloud project ID.
2. Get Dataproc's base image name with the Dataproc version.
3. Run the Shell script to create a custom Dataproc image.
a. Create a disk with Dataproc's base image.
b. Create a GCE instance with the disk.
c. Run custom install packages script to install custom packages.
d. Shutdown instance.
e. Create a custom Dataproc image from the disk.
4. Set the custom image label (required for launching custom Dataproc image).
5. Run a Dataproc workflow to smoke-test the custom image.
Once the script is successfully executed, you will be able to see the smoke test has been completed successfully and the expiry date for the custom image. The smoke test will launch a Dataproc cluster with the custom image that was created
You can also navigate to https://console.cloud.google.com/compute/images to look at the custom image that was created.
Please make note of the imageUri value which is required to be passed a parameter for Infoworks Cluster template
Configuring cluster template on Infoworks with custom image
Click on Admin < Manage Environments < select required Dataproc Env < Click on Compute Template
Create a new compute template and configure the required cluster details or clone an existing cluster template
Please check the below screenshot to choose a custom Image URI and fill in the Image URI captured from Step 5
Example: https://compute.googleapis.com/compute/v1/projects/iw-gcp-cs-host/global/images/iwx-custom
Once the cluster is launched you can manually login to one of the nodes to identify the required packages configured for the custom image were installed. You can also verify the cluster configuration on GCP console.
Note: As the custom image expires after 60 days, the responsibility of having the cluster image available at all times at the declared Image URI location on the cluster template lies with end-user
References:
https://docs.infoworks.io/getting-started/configuring-gcp-dataproc#configuring-environment
https://github.com/GoogleCloudDataproc/custom-images
https://cloud.google.com/dataproc/docs/guides/dataproc-images
Applicable Infoworks Version's: 5.x.x
Author: Anirudh Chekuri