Creating a custom Dataproc Image


  • Google provides an option to launch a Dataproc cluster with a custom machine image

  • Custom machine image will decrease the cluster provisioning times as it comes pre-installed with user-defined scripts, and packages providing a consistent Dataproc cluster



Limitations

  • The custom Dataproc Image expires in 60 days 

  • New images announced in Dataproc Release Notes are not available for use as the base for custom images until one week from their announcement date.

Prerequisites

  • Enable the Dataproc API, Compute Engine API, and Cloud Storage APIs

  • Clone Dataproc custom images repository to the machine executing the script to build custom Dataproc Image

  • Compose a script to include the required init scripts, custom packages, self-signed certificates, etc 

  • Install and initialize the Google Cloud CLI, ensure the IAM/SA user initialized has  access to create a VM instance, Dataproc Cluster, access to the target bucket where we want to store the Dataproc Image

  • If you use a custom image hosted in another project, the Dataproc Service Agent service account in your project must have compute.images.get permission on the image in the host project. You can do this by granting the roles/compute.imageUser role on the hosted image to your project's Dataproc Service Agent service account (see Sharing custom images within an organization)



Steps 

  1. Use gcloud config set project <your-project> to specify which project to use to create and save your custom image.

  2. Run the below python command

python generate_custom_image.py \

    --image-name '<new_custom_image_name>' \

    --dataproc-version '<dataproc_version>' \

    --customization-script '<custom_script_to_install_custom_packages>' \

    --zone '<zone_to_create_instance_to_build_custom_image>' \

    --gcs-bucket '<gcs_bucket_to_write_logs>'



  • If there is no default network present in the GCP project, we will have to provide the –network and –subnetwork (subnetwork will require a full subnetwork URL)


Example:


python generate_custom_image.py \

    --image-name iwx-custom \

    --dataproc-version 1.5.75-ubuntu18 \

    --customization-script /home/scripts/init_tpt_install_client.sh \

    --zone us-central1-a \

    --gcs-bucket gs://staging_bucket_supp/custom_image \

    --network gcp-cs-vpc \

    --subnetwork projects/iw-gcp-cs-host/regions/us-central1/subnetworks/ gcp-cs-vpc-subnet-1



The script is expected to perform the following steps,


1. Get the user's gcloud project ID.

2. Get Dataproc's base image name with the Dataproc version.

3. Run the Shell script to create a custom Dataproc image.

    a. Create a disk with Dataproc's base image.

    b. Create a GCE instance with the disk.

    c. Run custom install packages script to install custom packages.

    d. Shutdown instance.

    e. Create a custom Dataproc image from the disk.

4. Set the custom image label (required for launching custom Dataproc image).

5. Run a Dataproc workflow to smoke-test the custom image.



  1. Once the script is successfully executed, you will be able to see the smoke test has been completed successfully and the expiry date for the custom image. The smoke test will launch a Dataproc cluster with the custom image that was created



  1. You can also navigate to https://console.cloud.google.com/compute/images to look at the custom image that was created. 

  2. Please make note of the imageUri value which is required to be passed a parameter for Infoworks Cluster template



Configuring cluster template on Infoworks with custom image

  1. Click on Admin < Manage Environments < select required Dataproc Env < Click on Compute Template 

  2. Create a new compute template and configure the required cluster details or clone an existing cluster template

  3. Please check the below screenshot to choose a custom Image URI and fill in the Image URI captured from Step 5


 Example: https://compute.googleapis.com/compute/v1/projects/iw-gcp-cs-host/global/images/iwx-custom



  1. Once the cluster is launched you can manually login to one of the nodes to identify the required packages configured for the custom image were installed. You can also verify the cluster configuration on GCP console.






Note: As the custom image expires after 60 days, the responsibility of having the cluster image available at all times at the declared Image URI location on the cluster template lies with end-user 




References: 


https://docs.infoworks.io/getting-started/configuring-gcp-dataproc#configuring-environment

https://github.com/GoogleCloudDataproc/custom-images

https://cloud.google.com/dataproc/docs/guides/dataproc-images






Applicable Infoworks Version's: 5.x.x 




Author: Anirudh Chekuri