forked from GoogleCloudPlatform/professional-services
-
Notifications
You must be signed in to change notification settings - Fork 0
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Merge pull request GoogleCloudPlatform#439 from rilkeanheart/master
Adding example files/scripts to support shutdown of dataproc clusters…
- Loading branch information
Showing
4 changed files
with
399 additions
and
0 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,80 @@ | ||
# Custom Dataproc Scheduled Cluster Deletion | ||
This repository provides scripts for launching a Google Cloud Dataproc Cluster while specifying the maximum idle time after which the cluster will be deleted. The custom scripts will consider active SSH sessions and YARN based jobs in determining whether or not the cluster is considered active. In addition, there is an optional parameter to pass a list of additional processes for which the cluster should be considered active. The scripts will also detect if the cluster is in an ERROR state, caused by one or more initialization actions returning a non zero result, and will delete the cluster accordingly. | ||
|
||
**Note:** if you are currently utilizing the [Dataproc Jobs API](https://cloud.google.com/dataproc/docs/concepts/jobs/life-of-a-job), then you should be using the [Cluster Scheduled Deletion](https://cloud.google.com/dataproc/docs/concepts/configuring-clusters/scheduled-deletion) feature as is. | ||
|
||
## Solution Architecture | ||
------------------------------------------------------ | ||
![Architecture Diagram](img/idle-script.png) | ||
|
||
1. The cluster is created specifying the initialization action and the location of the idle script stored in Google Cloud Storage | ||
2. The cluster downloads the idle script and schedules it as a cron job to run every 5 minutes | ||
3. The script runs and saves instance level metadata tracking the idle status of the cluster | ||
4. The script logs status of checks and shutdown events via Stackdriver logging | ||
5. If shutdown conditions are met, the script shutsdown the cluster | ||
|
||
## Components | ||
------------------------------------------------------ | ||
1. create-idlemonitoringjob.sh: This BASH script is an [initialization action](https://cloud.google.com/dataproc/docs/concepts/configuring-clusters/init-actions) that copies the cluster idle monitoring script (IsIdle.sh) from Cloud Storage and schedules the script to run every 5 minutes and shutting down the cluster if it is idle for a specified amount of time. The location of the cluster monitoring script AND the duration of idle time before deletion should be passed as [instance metadata](https://cloud.google.com/compute/docs/storing-retrieving-metadata). | ||
|
||
2. idle-check.sh: This script checks for active YARN jobs, recently completed YARN jobs, and active SSH connections. If none are detected, the cluster is considered idle. If the cluster remains idle for more time than was specified at cluster creation, the script will delete the cluster. | ||
|
||
3. Cluster Instance Metadata Keys: | ||
- [CLUSTER_NAME]_maxIdleSeconds: Holds the value of how long a cluster can be idle before it should be deleted. | ||
- [CLUSTER_NAME]_isIdle: Specifies whether or not the cluster is currently considered to be idle. | ||
- [CLUSTER_NAME]_isIdleStatusSince: Specifies a timestamp corresponding to when the cluster’s status became idle OR was last known to be active. | ||
- [CLUSTER_NAME]_persistDiagnosticTarball: Specifies whether diagnostic logs will be written to the Cloud Storage staging bucket via the [diagnose command](https://cloud.google.com/dataproc/docs/support/diagnose-command) upon cluster deletion. | ||
|
||
## Usage | ||
------------------------------------------------------ | ||
|
||
### Preparation: Copy scripts to a Cloud Storage bucket | ||
|
||
Download all artifacts from Git: | ||
``` | ||
git clone https://github.com/GoogleCloudPlatform/professional-services.git | ||
``` | ||
Copy all artifacts to Cloud Storage: | ||
``` | ||
gsutil cp ./professional-services/examples/dataproc-idle-check/*sh gs://<BUCKET> | ||
``` | ||
|
||
### Cluster start: Start the cluster specifying key parameters | ||
1. [Mandatory] Specify the location of the create-idlemonitoringjob.sh script as a “--initialization-actions” parameter. | ||
2. [Mandatory] Specify the location of the idle-check.sh script as the value of the metadata key “script_storage_location”. The location of the idle-check.sh script and the maximum idle time should be specified as metadata using the “script_storage_location” and “max-idle” keys, respectively. | ||
3. [Mandatory] Specify the maximum idle time to allow the cluster to be idle as the value of the metadata key “max-idle”. Similar to the parameter associated with Scheduled Cluster deletion, the max-idle duration parameter should be provided in IntegerUnit format, where the unit can be “s, m, h, d” (seconds, minutes, hours, days, respectively). Examples: “30m” or “1d” (30 minutes or 1 day from when the cluster becomes idle). | ||
4. [Optional] Specify, as the value of the metadata key “key_process_list”, a semi-colin separated list of process names (in addition to YARN jobs and active SSH connections) for which the cluster should be considered active. | ||
5. [Optional] Specify if the cluster should write diagnostic logs to the Cloud Storage staging bucket (TRUE/FALSE) as the value of the metadata key "persist_diagnostic_tarball" (TRUE). Unless specified, the default value is FALSE. The diagnostic output is saved in a folder specific to the job under which the DIAGNOSE command was run, the best way to locate the diagnostic output is " gsutil ls gs://[GCS STAGING BUCKET]/google-cloud-dataproc-metainfo/*/*/diagnostic.tar.gz". | ||
|
||
>Note: [Google APIs](https://developers.google.com/identity/protocols/googlescopes) must also be included in scopes in order for the scripts to read and write cluster metadata. | ||
An example of starting a cluster while specifying a maximum idle time of 45 minutes: | ||
``` | ||
gcloud dataproc clusters create hive-cluster-1 \ | ||
--region=us-central1 \ | ||
--master-machine-type n1-standard-1 \ | ||
--worker-machine-type n1-standard-1 \ | ||
--scopes=https://www.googleapis.com/auth/cloud-platform \ | ||
--initialization-actions gs://<BUCKET>/create-idlemonitoringjob.sh \ | ||
--metadata 'script_storage_location=gs://<BUCKET>,key_process_list=python;sed,max-idle=45m,persist_diagnostic_tarball=TRUE' | ||
``` | ||
|
||
Once started, the monitor script will continuously check to determine if the cluster is idle. The scrip will also check the cluster status to ensure it is not in an error state (e.g, one or more initialization actions exited with a non zero result). If the cluster is in an error state or has idled for a time greater than the duration specified at cluster start, the script will delete the cluster and the associated project metadata. | ||
|
||
## Logging | ||
|
||
The monitor script will continously log all idle checks and shutdown events via Stackdriver logging to a file called "idle-check-log". These log messages can be viewed in the Google Cloud Platform console under Logging->LogsViewer section by applying filters: "global" for resource type and "idle-check-log" for log name. | ||
``` | ||
resource.type="global" | ||
logName="projects/[PROJECT_NAME]/logs/idle-check-log" | ||
``` | ||
Alternatively `gcloud logging read projects/[PROJECT_NAME]/logs/idle-check-log` command can also be used | ||
|
||
## Open Issues & Roadmap | ||
In the future, we hope to add support for other analytical engines and/or processes. Presto, MySQL shell running DML/DDL, and Flink are just a few examples. Feel free to create an issue within the repo with any request and/or submit a pull request with added functionality! | ||
|
||
## License | ||
APACHE LICENSE, VERSION 2.0 | ||
|
||
## Disclaimer | ||
This is not an official Google project. |
119 changes: 119 additions & 0 deletions
119
examples/dataproc-idle-shutdown/create-idlemonitoringjob.sh
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,119 @@ | ||
#!/bin/bash | ||
# Copyright 2019 Google, Inc. | ||
# | ||
# Licensed under the Apache License, Version 2.0 (the "License"); | ||
# you may not use this file except in compliance with the License. | ||
# You may obtain a copy of the License at | ||
# | ||
# http://www.apache.org/licenses/LICENSE-2.0 | ||
# | ||
# Unless required by applicable law or agreed to in writing, software | ||
# distributed under the License is distributed on an "AS IS" BASIS, | ||
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. | ||
# See the License for the specific language governing permissions and | ||
# limitations under the License. | ||
|
||
readonly MY_CLUSTER_NAME="$(/usr/share/google/get_metadata_value attributes/dataproc-cluster-name)" | ||
readonly MASTER_INSTANCE_NAME="$(/usr/share/google/get_metadata_value attributes/dataproc-master)" | ||
readonly MASTER_INSTANCE_ZONE="$(/usr/share/google/get_metadata_value zone)" | ||
readonly MAX_IDLE_SECONDS_KEY="${MY_CLUSTER_NAME}_maxIdleSeconds" | ||
readonly DATAPROC_PERSIST_DIAG_TARBALL_KEY="${MY_CLUSTER_NAME}_persistDiagnosticTarball" | ||
readonly KEY_PROCESS_LIST_KEY="${MY_CLUSTER_NAME}_keyProcessList" | ||
readonly MACHINE_ROLE=$(/usr/share/google/get_metadata_value attributes/dataproc-role) | ||
readonly MASTER_NAME="$(/usr/share/google/get_metadata_value attributes/dataproc-master)" | ||
readonly CLUSTER_NAME="$(/usr/share/google/get_metadata_value attributes/dataproc-cluster-name)" | ||
readonly PERSIST_DIAG_TARBALL_TRUE="TRUE" | ||
readonly PERSIST_DIAG_TARBALL_FALSE="FALSE" | ||
readonly SCRIPT_STORAGE_LOCATION=$(/usr/share/google/get_metadata_value attributes/script_storage_location) | ||
readonly MAX_IDLE_PARAMETER=$(/usr/share/google/get_metadata_value attributes/max-idle) | ||
readonly DATAPROC_PERSIST_DIAG_TARBALL=$(/usr/share/google/get_metadata_value attributes/persist_diagnostic_tarball) | ||
readonly KEY_PROCESS_LIST_PARAMETER=$(/usr/share/google/get_metadata_value attributes/key_process_list) | ||
|
||
function checkMaster() { | ||
local isMaster="false" | ||
if [[ "${MACHINE_ROLE}" == 'Master' && "$MASTER_NAME" == "${CLUSTER_NAME}-m" ]]; then | ||
isMaster="true" | ||
fi | ||
echo "$isMaster" | ||
} | ||
|
||
function startIdleJobChecker() { | ||
|
||
gcloud logging write idle-check-log "${MY_CLUSTER_NAME}: attempting to start idle checker using: ${SCRIPT_STORAGE_LOCATION}" --severity=NOTICE | ||
if [[ -n ${SCRIPT_STORAGE_LOCATION} ]]; then | ||
|
||
# Get and validate the max idle parameter. Following line evaluates to nothing if not provided or properly formatted. | ||
parsedMaxIdleParameter=$( echo "${MAX_IDLE_PARAMETER}" | sed -n '/^\([0-9]*\)\(s\|m\|h\|d\)$/p' | sed 's/^\([0-9]*\)\(s\|m\|h\|d\)$/\1\,\2/' ) | ||
if [[ -n ${parsedMaxIdleParameter} ]]; then | ||
idleTimeUnit=${parsedMaxIdleParameter#*,} | ||
idleTimeAmount=${parsedMaxIdleParameter%,*} | ||
idleTimeAmountSeconds=300 | ||
|
||
# Convert max idle to minutes (most readable and likely value) | ||
case "$idleTimeUnit" in | ||
's') | ||
idleTimeAmountSeconds=$(( idleTimeAmount )) | ||
;; | ||
'm') | ||
idleTimeAmountSeconds=$(( idleTimeAmount * 60 )) | ||
;; | ||
'h') | ||
idleTimeAmountSeconds=$(( idleTimeAmount * 60 * 60 )) | ||
;; | ||
'd') | ||
idleTimeAmountSeconds=$(( idleTimeAmount * 60 * 60 * 24 )) | ||
;; | ||
esac | ||
|
||
# Record the max seconds value as instance metadata | ||
gcloud logging write idle-check-log "${MY_CLUSTER_NAME}: Idle time for cluster set to ${idleTimeAmountSeconds} " --severity=NOTICE | ||
gcloud compute instances add-metadata "${MASTER_INSTANCE_NAME} --zone ${MASTER_INSTANCE_ZONE} --metadata ${MAX_IDLE_SECONDS_KEY}=${idleTimeAmountSeconds}" & | ||
|
||
# Record preference for persisting diagnostic information on cluster shutdown | ||
persistDiagnosticTarball=${PERSIST_DIAG_TARBALL_FALSE} | ||
parsedPesistDiagnosticTarballParameter=$(echo "${DATAPROC_PERSIST_DIAG_TARBALL}" | sed -n '/\(TRUE\|true\)/p' ) | ||
if [[ -n ${parsedPesistDiagnosticTarballParameter} ]]; then | ||
persistDiagnosticTarball=${PERSIST_DIAG_TARBALL_TRUE} | ||
fi | ||
gcloud logging write idle-check-log "${MY_CLUSTER_NAME}: Request to persist diagnostic tarbeall - ${persistDiagnosticTarball} " --severity=NOTICE | ||
gcloud compute instances add-metadata "${MASTER_INSTANCE_NAME}" --zone "${MASTER_INSTANCE_ZONE}" --metadata "${DATAPROC_PERSIST_DIAG_TARBALL_KEY}=${persistDiagnosticTarball}" & | ||
|
||
# Record key process list parameter | ||
keyProcessListParam=${KEY_PROCESS_LIST_PARAMETER} | ||
gcloud logging write idle-check-log "${MY_CLUSTER_NAME}: Key non-yarn processes to monitor - ${keyProcessListParam} " --severity=NOTICE | ||
gcloud compute instances add-metadata "${MASTER_INSTANCE_NAME}" --zone "${MASTER_INSTANCE_ZONE}" --metadata "${KEY_PROCESS_LIST_KEY}=${keyProcessListParam}" & | ||
|
||
|
||
gcloud logging write idle-check-log "${MY_CLUSTER_NAME}: Establishing idle-check process to determine when master node can be deleted" --severity=NOTICE | ||
cd /root || exit | ||
mkdir DataprocShutdown | ||
cd DataprocShutdown || exit | ||
|
||
# copy the script from GCS | ||
gsutil cp "${SCRIPT_STORAGE_LOCATION}/idle-check.sh" . | ||
# make it executable | ||
chmod 700 idle-check.sh | ||
# run IsIdle script | ||
./idle-check.sh | ||
|
||
#sudo bash -c 'echo "" >> /etc/crontab' | ||
sudo bash -c 'echo "*/5 * * * * root /root/DataprocShutdown/idle-check.sh" >> /etc/crontab' & | ||
else | ||
gcloud logging write idle-check-log "Must provide value for 'max-idle' The duration from the moment when the cluster enters the idle state to the moment when the cluster starts to delete. Provide the duration in IntegerUnit format, where the unit can be 's, m, h, d' (seconds, minutes, hours, days, respectively)." --severity=NOTICE | ||
exit 1; | ||
fi | ||
else | ||
gcloud logging write idle-check-log "${MY_CLUSTER_NAME}: Value for STORAGE_LOCATION is required" --severity=NOTICE | ||
exit 1; | ||
fi | ||
} | ||
|
||
function main() { | ||
is_master_node=$(checkMaster) | ||
gcloud logging write idle-check-log "${MY_CLUSTER_NAME}: Is master is $is_master_node" --severity=NOTICE | ||
if [[ "$is_master_node" == "true" ]]; then | ||
startIdleJobChecker | ||
fi | ||
} | ||
|
||
main "$@" |
Oops, something went wrong.