Skip to content

Latest commit

 

History

History
334 lines (271 loc) · 19.6 KB

hdinsight-hadoop-customize-cluster.md

File metadata and controls

334 lines (271 loc) · 19.6 KB
title description services author ms.reviewer ms.service ms.topic ms.date ms.author ROBOTS
Customize HDInsight Clusters using script actions - Azure
Learn how to customize HDInsight clusters using Script Action.
hdinsight
hrasheed-msft
jasonh
hdinsight
conceptual
10/05/2016
hrasheed
NOINDEX

Customize Windows-based HDInsight clusters using Script Action

Script Action can be used to invoke custom scripts during the cluster creation process for installing additional software on a cluster.

The information in this article is specific to Windows-based HDInsight clusters. For Linux-based clusters, see Customize Linux-based HDInsight clusters using Script Action.

Important

Linux is the only operating system used on HDInsight version 3.4 or greater. For more information, see HDInsight retirement on Windows.

HDInsight clusters can be customized in a variety of other ways as well, such as including additional Azure Storage accounts, changing the Apache Hadoop configuration files (core-site.xml, hive-site.xml, etc.), or adding shared libraries (e.g., Apache Hive, Apache Oozie) into common locations in the cluster. These customizations can be done through Azure PowerShell, the Azure HDInsight .NET SDK, or the Azure portal. For more information, see Create Apache Hadoop clusters in HDInsight.

[!INCLUDE upgrade-powershell]

Script Action in the cluster creation process

Script Action is only used while a cluster is in the process of being created. The following diagram illustrates when Script Action is executed during the creation process:

HDInsight cluster customization and stages during cluster creation

When the script is running, the cluster enters the ClusterCustomization stage. At this stage, the script is run under the system admin account, in parallel on all the specified nodes in the cluster, and provides full admin privileges on the nodes.

Note

Because you have admin privileges on the cluster nodes during the ClusterCustomization stage, you can use the script to perform operations like stopping and starting services, including Hadoop-related services. So, as part of the script, you must ensure that the Ambari services and other Hadoop-related services are up and running before the script finishes running. These services are required to successfully ascertain the health and state of the cluster while it is being created. If you change any configuration on the cluster that affects these services, you must use the helper functions that are provided. For more information about helper functions, see Develop Script Action scripts for HDInsight.

The output and the error logs for the script are stored in the default Storage account you specified for the cluster. The logs are stored in a table with the name u<\cluster-name-fragment><\time-stamp>setuplog. These are aggregate logs from the script run on all the nodes (head node and worker nodes) in the cluster.

Each cluster can accept multiple script actions that are invoked in the order in which they are specified. A script can be ran on the head node, the worker nodes, or both.

HDInsight provides several scripts to install the following components on HDInsight clusters:

Name Script
Install Apache Spark https://hdiconfigactions.blob.core.windows.net/sparkconfigactionv03/spark-installer-v03.ps1. See Install and use Apache Spark on HDInsight clusters.
Install R https://hdiconfigactions.blob.core.windows.net/rconfigactionv02/r-installer-v02.ps1. See Install and use R on HDInsight clusters.
Install Apache Solr https://hdiconfigactions.blob.core.windows.net/solrconfigactionv01/solr-installer-v01.ps1. See Install and use Apache Solr on HDInsight clusters.
Install Apache Giraph https://hdiconfigactions.blob.core.windows.net/giraphconfigactionv01/giraph-installer-v01.ps1. See Install and use Apache Giraph on HDInsight clusters.
Pre-load Apache Hive libraries https://hdiconfigactions.blob.core.windows.net/setupcustomhivelibsv01/setup-customhivelibs-v01.ps1. See Add Apache Hive libraries on HDInsight clusters

Call scripts using the Azure portal

From the Azure portal

  1. Start creating a cluster as described at Create Apache Hadoop clusters in HDInsight.

  2. Under Optional Configuration, for the Script Actions blade, click add script action to provide details about the script action, as shown below:

    Use Script Action to customize a cluster

    PropertyValue
    Name Specify a name for the script action.
    Script URI Specify the URI to the script that is invoked to customize the cluster. s
    Head/Worker Specify the nodes (**Head** or **Worker**) on which the customization script is run..
    Parameters Specify the parameters, if required by the script.

    Press ENTER to add more than one script action to install multiple components on the cluster.

  3. Click Select to save the script action configuration and continue with cluster creation.

Call scripts using Azure PowerShell

This following PowerShell script demonstrates how to install Spark on Windows based HDInsight cluster.

# Provide values for these variables
$subscriptionID = "<Azure Suscription ID>" # After "Connect-AzureRmAccount", use "Get-AzureRmSubscription" to list IDs.

$nameToken = "<Enter A Name Token>"  # The token is use to create Azure service names.
$namePrefix = $nameToken.ToLower() + (Get-Date -Format "MMdd")

$resourceGroupName = $namePrefix + "rg"
$location = "EAST US 2" # used for creating resource group, storage account, and HDInsight cluster.

$hdinsightClusterName = $namePrefix + "spark"
$httpUserName = "admin"
$httpPassword = "<Enter a Password>"

$defaultStorageAccountName = "$namePrefix" + "store"
$defaultBlobContainerName = $hdinsightClusterName

#############################################################
# Connect to Azure
#############################################################

Try{
    Get-AzureRmSubscription
}
Catch{
    Connect-AzureRmAccount
}
Select-AzureRmSubscription -SubscriptionId $subscriptionID

#############################################################
# Prepare the dependent components
#############################################################

# Create resource group
New-AzureRmResourceGroup -Name $resourceGroupName -Location $location

# Create storage account
New-AzureRmStorageAccount `
    -ResourceGroupName $resourceGroupName `
    -Name $defaultStorageAccountName `
    -Location $location `
    -Type Standard_GRS
$defaultStorageAccountKey = (Get-AzureRmStorageAccountKey `
                                -ResourceGroupName $resourceGroupName `
                                -Name $defaultStorageAccountName)[0].Value
$defaultStorageAccountContext = New-AzureStorageContext `
                                -StorageAccountName $defaultStorageAccountName `
                                -StorageAccountKey $storageAccountKey  
New-AzureStorageContainer `
    -Name $defaultBlobContainerName `
    -Context $defaultStorageAccountContext

#############################################################
# Create cluster with ApacheSpark
#############################################################

# Specify the configuration options
$config = New-AzureRmHDInsightClusterConfig `
            -DefaultStorageAccountName "$defaultStorageAccountName.blob.core.windows.net" `
            -DefaultStorageAccountKey $defaultStorageAccountKey


# Add a script action to the cluster configuration
$config = Add-AzureRmHDInsightScriptAction `
            -Config $config `
            -Name "Install Spark" `
            -NodeType HeadNode `
            -Uri https://hdiconfigactions.blob.core.windows.net/sparkconfigactionv03/spark-installer-v03.ps1 `

# Start creating a cluster with Spark installed
New-AzureRmHDInsightCluster `
        -ResourceGroupName $resourceGroupName `
        -ClusterName $hdinsightClusterName `
        -Location $location `
        -ClusterSizeInNodes 2 `
        -ClusterType Hadoop `
        -OSType Windows `
        -DefaultStorageContainer $defaultBlobContainerName `
        -Config $config

To install other software, you will need to replace the script file in the script:

When prompted, enter the credentials for the cluster. It can take several minutes before the cluster is created.

Call scripts using .NET SDK

The following sample demonstrates how to install Apache Spark on Windows based HDInsight cluster. To install other software, you will need to replace the script file in the code.

To create an HDInsight cluster with Spark

  1. Create a C# console application in Visual Studio.

  2. From the Nuget Package Manager Console, run the following command.

     Install-Package Microsoft.Rest.ClientRuntime.Azure.Authentication -Pre
     Install-Package Microsoft.Azure.Management.ResourceManager -Pre
     Install-Package Microsoft.Azure.Management.HDInsight
    
  3. Use the following using statements in the Program.cs file:

     using System;
     using System.Security;
     using Microsoft.Azure;
     using Microsoft.Azure.Management.HDInsight;
     using Microsoft.Azure.Management.HDInsight.Models;
     using Microsoft.Azure.Management.ResourceManager;
     using Microsoft.IdentityModel.Clients.ActiveDirectory;
     using Microsoft.Rest;
     using Microsoft.Rest.Azure.Authentication;
    
  4. Place the code in the class with the following:

     private static HDInsightManagementClient _hdiManagementClient;
    
     // Replace with your AAD tenant ID if necessary
     private const string TenantId = UserTokenProvider.CommonTenantId;
     private const string SubscriptionId = "<Your Azure Subscription ID>";
     // This is the GUID for the PowerShell client. Used for interactive logins in this example.
     private const string ClientId = "1950a258-227b-4e31-a9cf-717495945fc2";
     private const string ResourceGroupName = "<ExistingAzureResourceGroupName>";
     private const string NewClusterName = "<NewAzureHDInsightClusterName>";
     private const int NewClusterNumWorkerNodes = 2;
     private const string NewClusterLocation = "East US";
     private const string NewClusterVersion = "3.2";
     private const string ExistingStorageName = "<ExistingAzureStorageAccountName>";
     private const string ExistingStorageKey = "<ExistingAzureStorageAccountKey>";
     private const string ExistingContainer = "<ExistingAzureBlobStorageContainer>";
     private const string NewClusterType = "Hadoop";
     private const OSType NewClusterOSType = OSType.Windows;
     private const string NewClusterUsername = "<HttpUserName>";
     private const string NewClusterPassword = "<HttpUserPassword>";
    
     static void Main(string[] args)
     {
         System.Console.WriteLine("Running");
    
         // Authenticate and get a token
         var authToken = Authenticate(TenantId, ClientId, SubscriptionId);
         // Flag subscription for HDInsight, if it isn't already.
         EnableHDInsight(authToken);
         // Get an HDInsight management client
         _hdiManagementClient = new HDInsightManagementClient(authToken);
    
         CreateCluster();
     }
    
     private static void CreateCluster()
     {
         var parameters = new ClusterCreateParameters
         {
             ClusterSizeInNodes = NewClusterNumWorkerNodes,
             Location = NewClusterLocation,
             ClusterType = NewClusterType,
             OSType = NewClusterOSType,
             Version = NewClusterVersion,
             DefaultStorageInfo = new AzureStorageInfo(ExistingStorageName, ExistingStorageKey, ExistingContainer),
             UserName = NewClusterUsername,
             Password = NewClusterPassword,
         };
    
         ScriptAction sparkScriptAction = new ScriptAction("Install Spark",
             new Uri("https://hdiconfigactions.blob.core.windows.net/sparkconfigactionv03/spark-installer-v03.ps1"), "");
    
         parameters.ScriptActions.Add(ClusterNodeType.HeadNode, new System.Collections.Generic.List<ScriptAction> { sparkScriptAction });
         parameters.ScriptActions.Add(ClusterNodeType.WorkerNode, new System.Collections.Generic.List<ScriptAction> { sparkScriptAction });
    
         _hdiManagementClient.Clusters.Create(ResourceGroupName, NewClusterName, parameters);
     }
    
     /// <summary>
     /// Authenticate to an Azure subscription and retrieve an authentication token
     /// </summary>
     /// <param name="TenantId">The AAD tenant ID</param>
     /// <param name="ClientId">The AAD client ID</param>
     /// <param name="SubscriptionId">The Azure subscription ID</param>
     /// <returns></returns>
     static TokenCloudCredentials Authenticate(string TenantId, string ClientId, string SubscriptionId)
     {
         var authContext = new AuthenticationContext("https://login.microsoftonline.com/" + TenantId);
         var tokenAuthResult = authContext.AcquireToken("https://management.core.windows.net/",
             ClientId,
             new Uri("urn:ietf:wg:oauth:2.0:oob"),
             PromptBehavior.Always,
             UserIdentifier.AnyUser);
         return new TokenCloudCredentials(SubscriptionId, tokenAuthResult.AccessToken);
     }
     /// <summary>
     /// Marks your subscription as one that can use HDInsight, if it has not already been marked as such.
     /// </summary>
     /// <remarks>This is essentially a one-time action; if you have already done something with HDInsight
     /// on your subscription, then this isn't needed at all and will do nothing.</remarks>
     /// <param name="authToken">An authentication token for your Azure subscription</param>
     static void EnableHDInsight(TokenCloudCredentials authToken)
     {
         // Create a client for the Resource manager and set the subscription ID
         var resourceManagementClient = new ResourceManagementClient(new TokenCredentials(authToken.Token));
         resourceManagementClient.SubscriptionId = SubscriptionId;
         // Register the HDInsight provider
         var rpResult = resourceManagementClient.Providers.Register("Microsoft.HDInsight");
     }
    
  5. Press F5 to run the application.

Support for open-source software used on HDInsight clusters

The Microsoft Azure HDInsight service is a flexible platform that enables you to build big-data applications in the cloud by using an ecosystem of open-source technologies formed around Hadoop. Microsoft Azure provides a general level of support for open-source technologies, as discussed in the Support Scope section of the Azure Support FAQ website. The HDInsight service provides an additional level of support for some of the components, as described below.

There are two types of open-source components that are available in the HDInsight service:

  • Built-in components - These components are pre-installed on HDInsight clusters and provide core functionality of the cluster. For example, Apache Hadoop YARN ResourceManager, the Hive query language (HiveQL), and the Apache Mahout library belong to this category. A full list of cluster components is available in What's new in the Hadoop cluster versions provided by HDInsight?.
  • Custom components - You, as a user of the cluster, can install or use in your workload any component available in the community or created by you.

Built-in components are fully supported, and Microsoft Support will help to isolate and resolve issues related to these components.

Warning

Components provided with the HDInsight cluster are fully supported and Microsoft Support will help to isolate and resolve issues related to these components.

Custom components receive commercially reasonable support to help you to further troubleshoot the issue. This might result in resolving the issue OR asking you to engage available channels for the open source technologies where deep expertise for that technology is found. For example, there are many community sites that can be used, like: MSDN forum for HDInsight, http://stackoverflow.com. Also Apache projects have project sites on http://apache.org, for example: Hadoop, Spark.

The HDInsight service provides several ways to use custom components. Regardless of how a component is used or installed on the cluster, the same level of support applies. Below is a list of the most common ways that custom components can be used on HDInsight clusters:

  1. Job submission - Hadoop or other types of jobs that execute or use custom components can be submitted to the cluster.
  2. Cluster customization - During cluster creation, you can specify additional settings and custom components that will be installed on the cluster nodes.
  3. Samples - For popular custom components, Microsoft and others may provide samples of how these components can be used on the HDInsight clusters. These samples are provided without support.

Develop Script Action scripts

See Develop Script Action scripts for HDInsight.

See also