title | description | services | author | ms.reviewer | ms.service | ms.topic | ms.date | ms.author | ROBOTS |
---|---|---|---|---|---|---|---|---|---|
Customize HDInsight Clusters using script actions - Azure |
Learn how to customize HDInsight clusters using Script Action. |
hdinsight |
hrasheed-msft |
jasonh |
hdinsight |
conceptual |
10/05/2016 |
hrasheed |
NOINDEX |
Script Action can be used to invoke custom scripts during the cluster creation process for installing additional software on a cluster.
The information in this article is specific to Windows-based HDInsight clusters. For Linux-based clusters, see Customize Linux-based HDInsight clusters using Script Action.
Important
Linux is the only operating system used on HDInsight version 3.4 or greater. For more information, see HDInsight retirement on Windows.
HDInsight clusters can be customized in a variety of other ways as well, such as including additional Azure Storage accounts, changing the Apache Hadoop configuration files (core-site.xml, hive-site.xml, etc.), or adding shared libraries (e.g., Apache Hive, Apache Oozie) into common locations in the cluster. These customizations can be done through Azure PowerShell, the Azure HDInsight .NET SDK, or the Azure portal. For more information, see Create Apache Hadoop clusters in HDInsight.
[!INCLUDE upgrade-powershell]
Script Action is only used while a cluster is in the process of being created. The following diagram illustrates when Script Action is executed during the creation process:
When the script is running, the cluster enters the ClusterCustomization stage. At this stage, the script is run under the system admin account, in parallel on all the specified nodes in the cluster, and provides full admin privileges on the nodes.
Note
Because you have admin privileges on the cluster nodes during the ClusterCustomization stage, you can use the script to perform operations like stopping and starting services, including Hadoop-related services. So, as part of the script, you must ensure that the Ambari services and other Hadoop-related services are up and running before the script finishes running. These services are required to successfully ascertain the health and state of the cluster while it is being created. If you change any configuration on the cluster that affects these services, you must use the helper functions that are provided. For more information about helper functions, see Develop Script Action scripts for HDInsight.
The output and the error logs for the script are stored in the default Storage account you specified for the cluster. The logs are stored in a table with the name u<\cluster-name-fragment><\time-stamp>setuplog. These are aggregate logs from the script run on all the nodes (head node and worker nodes) in the cluster.
Each cluster can accept multiple script actions that are invoked in the order in which they are specified. A script can be ran on the head node, the worker nodes, or both.
HDInsight provides several scripts to install the following components on HDInsight clusters:
Name | Script |
---|---|
Install Apache Spark | https://hdiconfigactions.blob.core.windows.net/sparkconfigactionv03/spark-installer-v03.ps1 . See Install and use Apache Spark on HDInsight clusters. |
Install R | https://hdiconfigactions.blob.core.windows.net/rconfigactionv02/r-installer-v02.ps1 . See Install and use R on HDInsight clusters. |
Install Apache Solr | https://hdiconfigactions.blob.core.windows.net/solrconfigactionv01/solr-installer-v01.ps1 . See Install and use Apache Solr on HDInsight clusters. |
Install Apache Giraph | https://hdiconfigactions.blob.core.windows.net/giraphconfigactionv01/giraph-installer-v01.ps1 . See Install and use Apache Giraph on HDInsight clusters. |
Pre-load Apache Hive libraries | https://hdiconfigactions.blob.core.windows.net/setupcustomhivelibsv01/setup-customhivelibs-v01.ps1 . See Add Apache Hive libraries on HDInsight clusters |
From the Azure portal
-
Start creating a cluster as described at Create Apache Hadoop clusters in HDInsight.
-
Under Optional Configuration, for the Script Actions blade, click add script action to provide details about the script action, as shown below:
Property Value Name Specify a name for the script action. Script URI Specify the URI to the script that is invoked to customize the cluster. s Head/Worker Specify the nodes (**Head** or **Worker**) on which the customization script is run.. Parameters Specify the parameters, if required by the script. Press ENTER to add more than one script action to install multiple components on the cluster.
-
Click Select to save the script action configuration and continue with cluster creation.
This following PowerShell script demonstrates how to install Spark on Windows based HDInsight cluster.
# Provide values for these variables
$subscriptionID = "<Azure Suscription ID>" # After "Connect-AzureRmAccount", use "Get-AzureRmSubscription" to list IDs.
$nameToken = "<Enter A Name Token>" # The token is use to create Azure service names.
$namePrefix = $nameToken.ToLower() + (Get-Date -Format "MMdd")
$resourceGroupName = $namePrefix + "rg"
$location = "EAST US 2" # used for creating resource group, storage account, and HDInsight cluster.
$hdinsightClusterName = $namePrefix + "spark"
$httpUserName = "admin"
$httpPassword = "<Enter a Password>"
$defaultStorageAccountName = "$namePrefix" + "store"
$defaultBlobContainerName = $hdinsightClusterName
#############################################################
# Connect to Azure
#############################################################
Try{
Get-AzureRmSubscription
}
Catch{
Connect-AzureRmAccount
}
Select-AzureRmSubscription -SubscriptionId $subscriptionID
#############################################################
# Prepare the dependent components
#############################################################
# Create resource group
New-AzureRmResourceGroup -Name $resourceGroupName -Location $location
# Create storage account
New-AzureRmStorageAccount `
-ResourceGroupName $resourceGroupName `
-Name $defaultStorageAccountName `
-Location $location `
-Type Standard_GRS
$defaultStorageAccountKey = (Get-AzureRmStorageAccountKey `
-ResourceGroupName $resourceGroupName `
-Name $defaultStorageAccountName)[0].Value
$defaultStorageAccountContext = New-AzureStorageContext `
-StorageAccountName $defaultStorageAccountName `
-StorageAccountKey $storageAccountKey
New-AzureStorageContainer `
-Name $defaultBlobContainerName `
-Context $defaultStorageAccountContext
#############################################################
# Create cluster with ApacheSpark
#############################################################
# Specify the configuration options
$config = New-AzureRmHDInsightClusterConfig `
-DefaultStorageAccountName "$defaultStorageAccountName.blob.core.windows.net" `
-DefaultStorageAccountKey $defaultStorageAccountKey
# Add a script action to the cluster configuration
$config = Add-AzureRmHDInsightScriptAction `
-Config $config `
-Name "Install Spark" `
-NodeType HeadNode `
-Uri https://hdiconfigactions.blob.core.windows.net/sparkconfigactionv03/spark-installer-v03.ps1 `
# Start creating a cluster with Spark installed
New-AzureRmHDInsightCluster `
-ResourceGroupName $resourceGroupName `
-ClusterName $hdinsightClusterName `
-Location $location `
-ClusterSizeInNodes 2 `
-ClusterType Hadoop `
-OSType Windows `
-DefaultStorageContainer $defaultBlobContainerName `
-Config $config
To install other software, you will need to replace the script file in the script:
When prompted, enter the credentials for the cluster. It can take several minutes before the cluster is created.
The following sample demonstrates how to install Apache Spark on Windows based HDInsight cluster. To install other software, you will need to replace the script file in the code.
To create an HDInsight cluster with Spark
-
Create a C# console application in Visual Studio.
-
From the Nuget Package Manager Console, run the following command.
Install-Package Microsoft.Rest.ClientRuntime.Azure.Authentication -Pre Install-Package Microsoft.Azure.Management.ResourceManager -Pre Install-Package Microsoft.Azure.Management.HDInsight
-
Use the following using statements in the Program.cs file:
using System; using System.Security; using Microsoft.Azure; using Microsoft.Azure.Management.HDInsight; using Microsoft.Azure.Management.HDInsight.Models; using Microsoft.Azure.Management.ResourceManager; using Microsoft.IdentityModel.Clients.ActiveDirectory; using Microsoft.Rest; using Microsoft.Rest.Azure.Authentication;
-
Place the code in the class with the following:
private static HDInsightManagementClient _hdiManagementClient; // Replace with your AAD tenant ID if necessary private const string TenantId = UserTokenProvider.CommonTenantId; private const string SubscriptionId = "<Your Azure Subscription ID>"; // This is the GUID for the PowerShell client. Used for interactive logins in this example. private const string ClientId = "1950a258-227b-4e31-a9cf-717495945fc2"; private const string ResourceGroupName = "<ExistingAzureResourceGroupName>"; private const string NewClusterName = "<NewAzureHDInsightClusterName>"; private const int NewClusterNumWorkerNodes = 2; private const string NewClusterLocation = "East US"; private const string NewClusterVersion = "3.2"; private const string ExistingStorageName = "<ExistingAzureStorageAccountName>"; private const string ExistingStorageKey = "<ExistingAzureStorageAccountKey>"; private const string ExistingContainer = "<ExistingAzureBlobStorageContainer>"; private const string NewClusterType = "Hadoop"; private const OSType NewClusterOSType = OSType.Windows; private const string NewClusterUsername = "<HttpUserName>"; private const string NewClusterPassword = "<HttpUserPassword>"; static void Main(string[] args) { System.Console.WriteLine("Running"); // Authenticate and get a token var authToken = Authenticate(TenantId, ClientId, SubscriptionId); // Flag subscription for HDInsight, if it isn't already. EnableHDInsight(authToken); // Get an HDInsight management client _hdiManagementClient = new HDInsightManagementClient(authToken); CreateCluster(); } private static void CreateCluster() { var parameters = new ClusterCreateParameters { ClusterSizeInNodes = NewClusterNumWorkerNodes, Location = NewClusterLocation, ClusterType = NewClusterType, OSType = NewClusterOSType, Version = NewClusterVersion, DefaultStorageInfo = new AzureStorageInfo(ExistingStorageName, ExistingStorageKey, ExistingContainer), UserName = NewClusterUsername, Password = NewClusterPassword, }; ScriptAction sparkScriptAction = new ScriptAction("Install Spark", new Uri("https://hdiconfigactions.blob.core.windows.net/sparkconfigactionv03/spark-installer-v03.ps1"), ""); parameters.ScriptActions.Add(ClusterNodeType.HeadNode, new System.Collections.Generic.List<ScriptAction> { sparkScriptAction }); parameters.ScriptActions.Add(ClusterNodeType.WorkerNode, new System.Collections.Generic.List<ScriptAction> { sparkScriptAction }); _hdiManagementClient.Clusters.Create(ResourceGroupName, NewClusterName, parameters); } /// <summary> /// Authenticate to an Azure subscription and retrieve an authentication token /// </summary> /// <param name="TenantId">The AAD tenant ID</param> /// <param name="ClientId">The AAD client ID</param> /// <param name="SubscriptionId">The Azure subscription ID</param> /// <returns></returns> static TokenCloudCredentials Authenticate(string TenantId, string ClientId, string SubscriptionId) { var authContext = new AuthenticationContext("https://login.microsoftonline.com/" + TenantId); var tokenAuthResult = authContext.AcquireToken("https://management.core.windows.net/", ClientId, new Uri("urn:ietf:wg:oauth:2.0:oob"), PromptBehavior.Always, UserIdentifier.AnyUser); return new TokenCloudCredentials(SubscriptionId, tokenAuthResult.AccessToken); } /// <summary> /// Marks your subscription as one that can use HDInsight, if it has not already been marked as such. /// </summary> /// <remarks>This is essentially a one-time action; if you have already done something with HDInsight /// on your subscription, then this isn't needed at all and will do nothing.</remarks> /// <param name="authToken">An authentication token for your Azure subscription</param> static void EnableHDInsight(TokenCloudCredentials authToken) { // Create a client for the Resource manager and set the subscription ID var resourceManagementClient = new ResourceManagementClient(new TokenCredentials(authToken.Token)); resourceManagementClient.SubscriptionId = SubscriptionId; // Register the HDInsight provider var rpResult = resourceManagementClient.Providers.Register("Microsoft.HDInsight"); }
-
Press F5 to run the application.
The Microsoft Azure HDInsight service is a flexible platform that enables you to build big-data applications in the cloud by using an ecosystem of open-source technologies formed around Hadoop. Microsoft Azure provides a general level of support for open-source technologies, as discussed in the Support Scope section of the Azure Support FAQ website. The HDInsight service provides an additional level of support for some of the components, as described below.
There are two types of open-source components that are available in the HDInsight service:
- Built-in components - These components are pre-installed on HDInsight clusters and provide core functionality of the cluster. For example, Apache Hadoop YARN ResourceManager, the Hive query language (HiveQL), and the Apache Mahout library belong to this category. A full list of cluster components is available in What's new in the Hadoop cluster versions provided by HDInsight?.
- Custom components - You, as a user of the cluster, can install or use in your workload any component available in the community or created by you.
Built-in components are fully supported, and Microsoft Support will help to isolate and resolve issues related to these components.
Warning
Components provided with the HDInsight cluster are fully supported and Microsoft Support will help to isolate and resolve issues related to these components.
Custom components receive commercially reasonable support to help you to further troubleshoot the issue. This might result in resolving the issue OR asking you to engage available channels for the open source technologies where deep expertise for that technology is found. For example, there are many community sites that can be used, like: MSDN forum for HDInsight, http://stackoverflow.com. Also Apache projects have project sites on http://apache.org, for example: Hadoop, Spark.
The HDInsight service provides several ways to use custom components. Regardless of how a component is used or installed on the cluster, the same level of support applies. Below is a list of the most common ways that custom components can be used on HDInsight clusters:
- Job submission - Hadoop or other types of jobs that execute or use custom components can be submitted to the cluster.
- Cluster customization - During cluster creation, you can specify additional settings and custom components that will be installed on the cluster nodes.
- Samples - For popular custom components, Microsoft and others may provide samples of how these components can be used on the HDInsight clusters. These samples are provided without support.
See Develop Script Action scripts for HDInsight.
- Create Apache Hadoop clusters in HDInsight provides instructions on how to create an HDInsight cluster by using other custom options.
- Develop Script Action scripts for HDInsight
- Install and use Apache Spark on HDInsight clusters
- Install and use Apache Solr on HDInsight clusters.
- Install and use Apache Giraph on HDInsight clusters.