title | description | services | author | ms.reviewer | ms.service | ms.topic | ms.date | ms.author | ROBOTS |
---|---|---|---|---|---|---|---|---|---|
Install and use Giraph on Hadoop clusters in HDInsight - Azure |
Learn how to customize HDInsight cluster with Giraph, and how to use Giraph. |
hdinsight |
hrasheed-msft |
jasonh |
hdinsight |
conceptual |
02/05/2016 |
hrasheed |
NOINDEX |
Learn how to customize Windows based HDInsight cluster with Apache Giraph using Script Action, and how to use Giraph to process large-scale graphs. For information on using Giraph with a Linux-based cluster, see Install Apache Giraph on HDInsight Hadoop clusters (Linux).
Important
The steps in this document only work with Windows-based HDInsight clusters. HDInsight is only available on Windows for versions lower than HDInsight 3.4. Linux is the only operating system used on HDInsight version 3.4 or greater. For more information, see HDInsight retirement on Windows. For information on how to install Giraph on a Linux-based HDInsight cluster, see Install Apache Giraph on HDInsight Hadoop clusters (Linux).
You can install Giraph on any type of cluster (Hadoop, Storm, HBase, Spark) on Azure HDInsight by using Script Action. A sample script to install Giraph on an HDInsight cluster is available from a read-only Azure storage blob at https://hdiconfigactions.blob.core.windows.net/giraphconfigactionv01/giraph-installer-v01.ps1. The sample script works only with HDInsight cluster version 3.1. For more information on HDInsight cluster versions, see HDInsight cluster versions.
Related articles
- Install Apache Giraph on HDInsight Hadoop clusters (Linux)
- Create Apache Hadoop clusters in HDInsight: general information on creating HDInsight clusters.
- Customize HDInsight cluster using Script Action: general information on customizing HDInsight clusters using Script Action.
- Develop Script Action scripts for HDInsight.
Apache Giraph allows you to perform graph processing by using Hadoop, and can be used with Azure HDInsight. Graphs model relationships between objects, such as the connections between routers on a large network like the Internet, or relationships between people on social networks (sometimes referred to as a social graph). Graph processing allows you to reason about the relationships between objects in a graph, such as:
- Identifying potential friends based on your current relationships.
- Identifying the shortest route between two computers in a network.
- Calculating the page rank of webpages.
-
Start creating a cluster by using the CUSTOM CREATE option, as described at Create Hadoop clusters in HDInsight.
-
On the Script Actions page of the wizard, click add script action to provide details about the script action, as shown below:
Property Value Name Specify a name for the script action. For example, Install Giraph. Script URI Specify the Uniform Resource Identifier (URI) to the script that is invoked to customize the cluster. For example, https://hdiconfigactions.blob.core.windows.net/giraphconfigactionv01/giraph-installer-v01.ps1 Node Type Specify the nodes on which the customization script is run. You can choose All nodes, Head nodes only, or Worker nodes only. Parameters Specify the parameters, if required by the script. The script to install Giraph does not require any parameters, so you can leave this blank. You can add more than one script action to install multiple components on the cluster. After you have added the scripts, click the checkmark to start creating the cluster.
We use the SimpleShortestPathsComputation example to demonstrate the basic Pregel implementation for finding the shortest path between objects in a graph. Use the following steps to upload the sample data and the sample jar, run a job by using the SimpleShortestPathsComputation example, and then view the results.
-
Upload a sample data file to Azure Blob storage. On your local workstation, create a new file named tiny_graph.txt. It should contain the following lines:
[0,0,[[1,1],[3,3]]] [1,0,[[0,1],[2,2],[3,1]]] [2,0,[[1,2],[4,4]]] [3,0,[[0,3],[1,1],[4,4]]] [4,0,[[3,4],[2,4]]]
Upload the tiny_graph.txt file to the primary storage for your HDInsight cluster. For instructions on how to upload data, see Upload data for Apache Hadoop jobs in HDInsight.
This data describes a relationship between objects in a directed graph, by using the format [source_id, source_value,[[dest_id], [edge_value],...]]. Each line represents a relationship between a source_id object and one or more dest_id objects. The edge_value (or weight) can be thought of as the strength or distance of the connection between source_id and dest_id.
Drawn out, and using the value (or weight) as the distance between objects, the above data might look like this:
-
Run the SimpleShortestPathsComputation example. Use the following Azure PowerShell cmdlets to run the example by using the tiny_graph.txt file as input.
[!IMPORTANT] Azure PowerShell support for managing HDInsight resources using Azure Service Manager is deprecated, and was removed on January 1, 2017. The steps in this document use the new HDInsight cmdlets that work with Azure Resource Manager.
Please follow the steps in Install and configure Azure PowerShell to install the latest version of Azure PowerShell. If you have scripts that need to be modified to use the new cmdlets that work with Azure Resource Manager, see Migrating to Azure Resource Manager-based development tools for HDInsight clusters for more information.
$clusterName = "clustername" # Giraph examples jar $jarFile = "wasb:///example/jars/giraph-examples.jar" # Arguments for this job $jobArguments = "org.apache.giraph.examples.SimpleShortestPathsComputation", "-ca", "mapred.job.tracker=headnodehost:9010", "-vif", "org.apache.giraph.io.formats.JsonLongDoubleFloatDoubleVertexInputFormat", "-vip", "wasb:///example/data/tiny_graph.txt", "-vof", "org.apache.giraph.io.formats.IdWithValueTextOutputFormat", "-op", "wasb:///example/output/shortestpaths", "-w", "2" # Create the definition $jobDefinition = New-AzureHDInsightMapReduceJobDefinition -JarFile $jarFile -ClassName "org.apache.giraph.GiraphRunner" -Arguments $jobArguments # Run the job, write output to the Azure PowerShell window $job = Start-AzureHDInsightJob -Cluster $clusterName -JobDefinition $jobDefinition Write-Host "Wait for the job to complete ..." -ForegroundColor Green Wait-AzureHDInsightJob -Job $job Write-Host "STDERR" Get-AzureHDInsightJobOutput -Cluster $clusterName -JobId $job.JobId -StandardError Write-Host "Display the standard output ..." -ForegroundColor Green Get-AzureHDInsightJobOutput -Cluster $clusterName -JobId $job.JobId -StandardOutput
In the above example, replace clustername with the name of your HDInsight cluster that has Giraph installed.
-
View the results. Once the job has finished, the results will be stored in two output files in the wasb:///example/out/shotestpaths folder. The files are called part-m-00001 and part-m-00002. Perform the following steps to download and view the output:
$subscriptionName = "<SubscriptionName>" # Azure subscription name $storageAccountName = "<StorageAccountName>" # Azure Storage account name $containerName = "<ContainerName>" # Blob storage container name # Select the current subscription Select-AzureSubscription $subscriptionName # Create the Storage account context object $storageAccountKey = Get-AzureStorageKey $storageAccountName | %{ $_.Primary } $storageContext = New-AzureStorageContext -StorageAccountName $storageAccountName -StorageAccountKey $storageAccountKey # Download the job output to the workstation Get-AzureStorageBlobContent -Container $containerName -Blob example/output/shortestpaths/part-m-00001 -Context $storageContext -Force Get-AzureStorageBlobContent -Container $containerName -Blob example/output/shortestpaths/part-m-00002 -Context $storageContext -Force
This will create the example/output/shortestpaths directory structure in the current directory on your workstation, and download the two output files to that location.
Use the Cat cmdlet to display the contents of the files:
Cat example/output/shortestpaths/part*
The output should appear similar to the following:
0 1.0 4 5.0 2 2.0 1 0.0 3 1.0
The SimpleShortestPathComputation example is hard coded to start with object ID 1 and find the shortest path to other objects. So the output should be read as
destination_id distance
, where distance is the value (or weight) of the edges traveled between object ID 1 and the target ID.Visualizing this, you can verify the results by traveling the shortest paths between ID 1 and all other objects. Note that the shortest path between ID 1 and ID 4 is 5. This is the total distance between ID 1 and 3, and then ID 3 and 4.
See Customize HDInsight clusters using Script Action. The sample demonstrates how to install Apache Spark using Azure PowerShell. You need to customize the script to use https://hdiconfigactions.blob.core.windows.net/giraphconfigactionv01/giraph-installer-v01.ps1.
See Customize HDInsight clusters using Script Action. The sample demonstrates how to install Spark using the .NET SDK. You need to customize the script to use https://hdiconfigactions.blob.core.windows.net/giraphconfigactionv01/giraph-installer-v01.ps1.
- Install Apache Giraph on HDInsight Hadoop clusters (Linux)
- Create Apache Hadoop clusters in HDInsight: general information on creating HDInsight clusters.
- Customize HDInsight cluster using Script Action: general information on customizing HDInsight clusters using Script Action.
- Develop Script Action scripts for HDInsight.
- Install and use Apache Spark on HDInsight clusters: Script Action sample about installing Spark.
- Install Apache Solr on HDInsight clusters: Script Action sample about installing Solr.