Skip to content

Commit b37e22c

Browse files
committed
fix merge issue
2 parents ef5041d + 8eb2815 commit b37e22c

14 files changed

+9
-97
lines changed

articles/hdinsight-connect-excel-hive-ODBC-driver.md

-6
Original file line numberDiff line numberDiff line change
@@ -19,12 +19,6 @@ Before you begin this article, you must have the following:
1919
- A computer that is running Windows 8, Windows 7, Windows Server 2012, or Windows Server 2008 R2.
2020
- Office 2013 Professional Plus, Office 365 Pro Plus, Excel 2013 Standalone, or Office 2010 Professional Plus.
2121

22-
##In this article
23-
24-
1. [Install the Microsoft Hive ODBC Driver](#InstallHiveODBCDriver)
25-
2. [Create a Hive ODBC Data Source](#CreateHiveODBCDataSource)
26-
3. [Import data into Excel from an HDInsight cluster](#ImportData)
27-
4. [Next steps](#nextsteps)
2822

2923
##<a id="InstallHiveODBCDriver"></a>Install the Microsoft Hive ODBC Driver
3024

articles/hdinsight-connect-excel-power-query.md

-8
Original file line numberDiff line numberDiff line change
@@ -17,8 +17,6 @@
1717
ms.author="bradsev"/>
1818

1919

20-
21-
2220
#Connect Excel to Hadoop with Power Query
2321

2422
One key feature of Microsoft's big data solution is the integration of Microsoft Business Intelligence (BI) components with Hadoop clusters in HDInsight. A primary example of this integration is the ability to connect Excel to the Azure storage account containing the data associated with your Hadoop cluster by using Microsoft Power Query for Excel. This article walks you through how to set up and use Power Query from Excel to query data associated with an Hadoop cluster managed with HDInsight.
@@ -31,12 +29,6 @@ Before you begin this article, you must have the following:
3129
- A computer that is running Windows 7, Windows Server 2008 R2, or above.
3230
- Office 2013 Professional Plus, Office 365 Pro Plus, Excel 2013 Standalone, or Office 2010 Professional Plus.
3331

34-
## In this article
35-
36-
- [Install Microsoft Power Query for Excel](#InstallPowerQuery)
37-
- [Import data into Excel](#ImportData)
38-
- [Next steps](#NextSteps)
39-
4032

4133
## <a id="InstallPowerQuery"></a>Install Microsoft Power Query for Excel
4234

articles/hdinsight-dotnet-avro-serialization.md

-11
Original file line numberDiff line numberDiff line change
@@ -22,17 +22,6 @@
2222
##Overview
2323
This topic shows how to use the <a href="https://hadoopsdk.codeplex.com/wikipage?title=Avro%20Library" target="_blank">Microsoft Avro Library</a> to serialize objects and other data structures into streams in order to persist them to memory, a database or a file, and also how to deserialize them to recover the original objects.
2424

25-
## In this article
26-
27-
- [Apache Avro](#apacheAvro)
28-
- [The Hadoop scenario](#hadoopScenario)
29-
- [Serialization in the Microsoft Avro Library](#serializationMAL)
30-
- [Microsoft Avro Library prerequisites](#prerequisites)
31-
- [Microsoft Avro Library installation](#installation)
32-
- [Microsoft Avro Library source code](#sourceCode)
33-
- [Compiling the Schema with the Microsoft Avro Library](#compiling)
34-
- [Guide to the samples for the Microsoft Avro Library](#samples)
35-
3625

3726
##<a name="apacheAvro"></a>Apache Avro
3827
The <a href="https://hadoopsdk.codeplex.com/wikipage?title=Avro%20Library" target="_blank">Microsoft Avro Library</a> implements the Apache Avro data serialization system for the Microsoft.NET environment. Apache Avro provides a compact binary data interchange format for serialization. It uses <a href="http://www.json.org" target="_blank">JSON</a> to define language agnostic schema that underwrites language interoperability. Data serialized in one language can be read in another. Currently C, C++, C#, Java, PHP, Python, and Ruby are supported. Detailed information on the format can be found in the <a href="http://avro.apache.org/docs/current/spec.html" target="_blank">Apache Avro Specification</a>. Note that the current version of the Microsoft Avro Library does not support the Remote Procedure Calls (RPC) part of this specification.

articles/hdinsight-hadoop-access-yarn-app-logs.md

-6
Original file line numberDiff line numberDiff line change
@@ -34,12 +34,6 @@ To install the HDInsight SDK from a Visual Studio application, go the **Tools**
3434

3535
This command adds .NET libraries for HDInsight and references to them to the current Visual Studio project.
3636

37-
## In this article
38-
39-
- [YARN Timeline Server](#YARNTimelineServer)
40-
- [YARN Applications and Logs](#YARNAppsAndLogs)
41-
- [Enumerating Applications and Downloading Logs Programmatically](#enumerate-and-download)
42-
4337

4438
## <a name="YARNTimelineServer"></a>YARN Timeline Server
4539

articles/hdinsight-hadoop-collect-debug-heap-dumps.md

-7
Original file line numberDiff line numberDiff line change
@@ -22,13 +22,6 @@ Heap dumps can be automatically collected for Hadoop services and placed inside
2222

2323
The collection of heap dumps for various services must be enabled for services on individual clusters. The default for this feature is to be off for a cluster. These heap dumps can be large in size so it is advisable to monitor the blob storage account where they are being saved once the collection has been enabled.
2424

25-
## In this article
26-
27-
- [For which services can heap dumps be enabled?](#whichServices)
28-
- [The configuration elements that enable heap dumps](#configuration)
29-
- [How to enable heap dumps with Azure HDInsight PowerShell](#powershell)
30-
- [How to enable heap dumps with HDInsight .NET SDK](#sdk)
31-
3225

3326
## <a name="whichServices"></a>For which services can heap dumps be enabled?
3427

articles/hdinsight-hadoop-r-scripts.md

-9
Original file line numberDiff line numberDiff line change
@@ -23,15 +23,6 @@ You can install R on any type of cluster in Hadoop on HDInsight using **Script A
2323
Script action lets you run scripts to customize a cluster, only when the cluster is being created. For more information, see [Customize HDInsight cluster using script action][hdinsight-cluster-customize].
2424

2525

26-
## In this article
27-
28-
- [What is R?](#whatIs)
29-
- [How do I install R?](#install)
30-
- [How do I run R scripts in HDInsight](#useR)
31-
- [Install R on HDInsight Hadoop clusters using PowerShell](#usingPS)
32-
- [Install R on HDInsight Hadoop clusters using the .NET SDK](#usingSDK)
33-
- [See also](#seeAlso)
34-
3526

3627
## <a name="whatIs"></a>What is R?
3728

articles/hdinsight-hadoop-script-actions.md

-11
Original file line numberDiff line numberDiff line change
@@ -23,17 +23,6 @@ Script Actions provide Azure HDInsight functionality that is used to install add
2323
Script Action can be deployed from Azure PowerShell or by using the HDInsight .NET SDK. For more information, see [Customize HDInsight cluster using Script Actions][hdinsight-cluster-customize].
2424

2525

26-
## In this article
27-
28-
- [Best practices for script development](#bestPracticeScripting)
29-
- [Helper methods for custom scripts](#helpermethods)
30-
- [Checklist for deploying a Script Action](#deployScript)
31-
- [How to run a Script Action](#runScriptAction)
32-
- [Custom script samples](#sampleScripts)
33-
- [How to test your custom script with the HDInsight Emulator](#testScript)
34-
- [How to debug your custom script](#debugScript)
35-
- [See also](#seeAlso)
36-
3726

3827
## <a name="bestPracticeScripting"></a>Best practices for script development
3928

articles/hdinsight-sample-10gb-graysort.md

-7
Original file line numberDiff line numberDiff line change
@@ -44,13 +44,6 @@ The input and output format, used by all three applications, read and write the
4444

4545
- You must have installed Azure PowerShell, and have configured them for use with your account. For instructions on how to do this, see [Install and configure Azure PowerShell][powershell-install-configure].
4646

47-
##In this article
48-
This topic shows you how to run the series of MapReduce programs that make up the Sample, presents the Java code for the MapReduce program, summarizes what you have learned, and outlines some next steps. It has the following sections.
49-
50-
1. [Run the sample with Azure PowerShell](#run-sample)
51-
2. [The Java code for the TeraSort MapReduce program](#java-code)
52-
3. [Summary](#summary)
53-
4. [Next steps](#next-steps)
5447

5548
<h2><a id="run-sample"></a>Run the sample with Azure PowerShell</h2>
5649

articles/hdinsight-sample-csharp-streaming.md

-8
Original file line numberDiff line numberDiff line change
@@ -43,15 +43,7 @@ For more information on the Hadoop streaming interface, see [Hadoop Streaming][h
4343
- You must have provisioned an HDInsight cluster. For instructions on the various ways in which such clusters can be created, see [Provision HDInsight Clusters](../hdinsight-provision-clusters/)
4444

4545
- You must have installed Azure PowerShell, and have configured them for use with your account. For instructions on how to do this, see [Install and configure Azure PowerShell][powershell-install-configure].
46-
47-
48-
##In this article
49-
This topic shows you how to run the sample, presents the Java code for the MapReduce program, summarizes what you have learned, and outlines some next steps. It has the following sections.
5046

51-
1. [Run the sample with Azure PowerShell](#run-sample)
52-
2. [The C# code for Hadoop Streaming](#java-code)
53-
3. [Summary](#summary)
54-
4. [Next steps](#next-steps)
5547

5648
<h2><a id="run-sample"></a>Run the sample with Azure PowerShell</h2>
5749

articles/hdinsight-sample-pi-estimator.md

+2-9
Original file line numberDiff line numberDiff line change
@@ -42,18 +42,11 @@ The other samples that are available to help you get up to speed in using HDInsi
4242
- You must have provisioned an HDInsight cluster. For instructions on the various ways in which such clusters can be created, see [Provision HDInsight Clusters](../hdinsight-provision-clusters/).
4343

4444
- You must have installed Azure PowerShell, and have configured it for use with your account. For instructions on how to do this, see [Install and configure Azure PowerShell][powershell-install-configure].
45-
46-
##In this article
47-
This topic shows you how to run the sample, presents the Java code for the pi estimator MapReduce program, summarizes what you have learned, and outlines some next steps. It has the following sections:
48-
49-
1. [Run the sample with Azure PowerShell](#run-sample)
50-
2. [The Java code for the pi estimator MapReduce program](#java-code)
51-
3. [Summary](#summary)
52-
4. [Next steps](#next-steps)
45+
5346

5447
<h2><a id="run-sample"></a>Run the sample with Azure PowerShell</h2>
5548

56-
**To submit the MapReduce job**
49+
**To submit the MapReduce job**s
5750

5851
1. Open Azure PowerShell. For instructions on how to use the Azure PowerShell console window, see [Install and configure Azure PowerShell][powershell-install-configure].
5952
2. Set the two variables in the following commands, and then run them:

articles/hdinsight-sample-wordcount.md

-7
Original file line numberDiff line numberDiff line change
@@ -35,13 +35,6 @@ This tutorial shows you how to run a MapReduce word count example on an Hadoop c
3535

3636
- You must have installed Azure PowerShell, and have configured them for use with your account. For instructions on how to do this, see [Install and configure Azure PowerShell][powershell-install-configure]
3737

38-
##In this article
39-
This topic shows you how to run the sample, presents the Java code for the MapReduce program, summarizes what you have learned, and outlines some next steps. It has the following sections.
40-
41-
1. [Run the sample using Azure PowerShell](#run-sample)
42-
2. [The Java code for the WordCount MapReduce program](#java-code)
43-
3. [Summary](#summary)
44-
4. [Next steps](#next-steps)
4538

4639
<h2><a id="run-sample"></a>Run the sample using Azure PowerShell</h2>
4740

articles/machine-learning-consume-web-services.md

+1-1
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,6 @@
11
<properties
22
pageTitle="How to consume a Machine Learning web service that has been published from a Machine Learning experiment | Azure"
3-
description="required"
3+
description="Once a machine learning service is published, the RESTFul web service that is made available can be consumed either as request-response service or as a batch execution service."
44
services="machine-learning"
55
solutions="big-data"
66
documentationCenter=""

articles/machine-learning-feature-selection-and-engineering.md

+6-7
Original file line numberDiff line numberDiff line change
@@ -21,7 +21,7 @@
2121

2222
This topic explains the purposes of feature engineering and feature selection in the data enhancement process of machine learning. It illustrates what these processes involve using examples provided by Azure Machine Learning Studio.
2323

24-
The training of models used in machine learning can often be enhanced by the selection or extraction of features from the raw data collected. A example of an engineered feature in the context of learning how to classify the images of handwritten characters is a bit density map constructed from the raw bit distribution data. This map can help locate the edges of the characters more efficiently than the raw distribution.
24+
The training data used in machine learning can often be enhanced by the selection or extraction of features from the raw data collected. A example of an engineered feature in the context of learning how to classify the images of handwritten characters is a bit density map constructed from the raw bit distribution data. This map can help locate the edges of the characters more efficiently than the raw distribution.
2525

2626
Engineered and selected features increase the efficiency of the training process which attempts to extract the key information contained in the data. They also improve the power of these models to classify the input data accurately and to predict outcomes of interest more robustly. Feature engineering and selection can also combine to make the learning more computationally tractable. It does so by enhancing and then reducing the number of features needed to calibrate or train a model. Mathematically speaking, the features selected to train the model are a minimal set of independent variables that explain the patterns in the data and then predict outcomes successfully.
2727

@@ -65,7 +65,7 @@ With the goal of constructing effective features in the training data, four regr
6565

6666
Besides feature set A, which already exist in the original raw data, the other three sets of features are created through the feature engineering process. Feature set B captures very recent demand for the bikes. Feature set C captures the demand for bikes at a particular hour. Feature set D captures demand for bikes at particular hour and particular day of the week. The four training datasets each includes feature set A, A+B, A+B+C, and A+B+C+D, respectively.
6767

68-
In the Azure Machine Learning experiment, these four training datasets are formed via four branches from the pre-processed input dataset. Except the left most branch, each of these branches contains an "Execute R Script" module, in which a set of derived features (feature set B, C, and D) are respectively constructed and appended to the imported dataset. The following figure demonstrates the R script used to create feature set B in the second branch from the left.
68+
In the Azure Machine Learning experiment, these four training datasets are formed via four branches from the pre-processed input dataset. Except the left most branch, each of these branches contains an "Execute R Script" module, in which a set of derived features (feature set B, C, and D) are respectively constructed and appended to the imported dataset. The following figure demonstrates the R script used to create feature set B in the second left branch.
6969

7070
![create features](./media/machine-learning-feature-selection-and-engineering/addFeature-Rscripts.png)
7171

@@ -79,7 +79,7 @@ Feature engineering is widely applied in tasks related to text mining, such as d
7979

8080
To achieve this task, a technique called **feature hashing** is applied to efficiently turn arbitrary text features into indices. Instead of associating each text feature (words/phrases) to a particular index, this method functions by applying a hash function to the features and using their hash values as indices directly.
8181

82-
In Azure Machine Learning, there is a [Feature Hashing](http://msdn.microsoft.com/library/azure/c9a82660-2d9c-411d-8122-4d9e0b3ce92a) module that creates these word/phrase features conveniently. Following figure shows an example of using this module. The input dataset contains two columns: the book rating ranging from 1 to 5, and the actually review content. The goal of this "Feature Hashing" module is to retrieve a bunch of new features that show the occurrence frequency of the corresponding word(s)/phrase(s) within the particular book review. To use this module, we need to complete the following steps:
82+
In Azure Machine Learning, there is a [Feature Hashing](http://msdn.microsoft.com/library/azure/c9a82660-2d9c-411d-8122-4d9e0b3ce92a) module that creates these word/phrase features conveniently. Following figure shows an example of using this module. The input dataset contains two columns: the book rating ranging from 1 to 5, and the actual review content. The goal of this "Feature Hashing" module is to retrieve a bunch of new features that show the occurrence frequency of the corresponding word(s)/phrase(s) within the particular book review. To use this module, we need to complete the following steps:
8383

8484
* First, select the column that contains the input text ("Col2" in this example).
8585
* Second, set the "Hashing bitsize" to 8, which means 2^8=256 features will be created. The word/phase in all the text will be hashed to 256 indices. The parameter "Hashing bitsize" ranges from 1 to 31. The word(s)/phrase(s) are less likely to be hashed into the same index if setting it to be a larger number.
@@ -98,16 +98,15 @@ Feature selection is a process that is commonly applied for the construction of
9898
* First, feature selection often increases classification accuracy by eliminating irrelevant, redundant, or highly correlated features.
9999
* Second, it decreases the number of features which makes model training process more efficient. This is particularly important for learners that are expensive to train such as support vector machines.
100100

101-
Although feature selection does seek to reduce the number of features in the dataset used to train the model, it is not usually referred to by the term "dimensionality reduction". Feature selection methods extract a subset of original features in the data without changing them. Dimensionality reduction methods employ engineered features that can transform the original features and thus modify them.
102-
103-
Examples of dimensionality reduction methods include Principal Component Analysis, canonical correlation analysis, and Singular Value Decomposition.
101+
Although feature selection does seek to reduce the number of features in the dataset used to train the model, it is not usually referred to by the term "dimensionality reduction". Feature selection methods extract a subset of original features in the data without changing them. Dimensionality reduction methods employ engineered features that can transform the original features and thus modify them. Examples of dimensionality reduction methods include Principal Component Analysis, canonical correlation analysis, and Singular Value Decomposition.
104102

105103
Among others, one widely applied category of feature selection methods in a supervised context is called "filter based feature selection". By evaluating the correlation between each feature and the target attribute, these methods apply a statistical measure to assign a score to each feature. The features are then ranked by the score, which may be used to help set the threshold for keeping or eliminating a specific feature. Examples of the statistical measures used in these methods include Person correlation, mutual information, and the Chi squared test.
106104

107-
In Azure Machine Learning Studio, there are modules provided for feature selection. As shown in the following figure, these modules include "Filter Based Feature Selection", "Fisher Liner Discriminant Analysis", and "Linear Discriminant Analysis".
105+
In Azure Machine Learning Studio, there are modules provided for feature selection. As shown in the following figure, these modules include "Filter Based Feature Selection"and "Fisher Liner Discriminant Analysis".
108106

109107
![Feature selection example](./media/machine-learning-feature-selection-and-engineering/feature-Selection.png)
110108

109+
111110
Consider, for example, the use of the [Filter Based Feature Selection](http://help.azureml.net/Content/html/818b356b-045c-412b-aa12-94a1d2dad90f.htm) module. For the purpose of convenience, we continue to use the text mining example outlined above. Assume that we want to build a regression model after a set of 256 features are created through the "Feature Hashing" module, and that the response variable is the "Col1" and represents a book review ratings ranging from 1 to 5. By setting "Feature scoring method" to be "Pearson Correlation", the "Target column" to be "Col1", and the "Number of desired features" to 50. Then the module "Filter Based Feature Selection" will produce a dataset containing 50 features together with the target attribute "Col1". The following figure shows the flow of this experiment and the input parameters we just described.
112111

113112
![Feature selection example](./media/machine-learning-feature-selection-and-engineering/feature-Selection1.png)
Loading

0 commit comments

Comments
 (0)