Skip to content

Commit

Permalink
Update H2O1 docs
Browse files Browse the repository at this point in the history
  • Loading branch information
jessica0xdata committed Mar 19, 2015
1 parent 588007f commit dcd2658
Show file tree
Hide file tree
Showing 3 changed files with 65 additions and 96 deletions.
55 changes: 28 additions & 27 deletions h2o-docs/source/tutorial/gbm.rst
Original file line number Diff line number Diff line change
Expand Up @@ -31,7 +31,7 @@ Before modeling, parse data into H2O:

#. Click **Submit**. Parsing data into H2O generates a .hex key in the format "data name.hex"

.. image:: PCAparse.png
.. image:: GBMparse.png
:width: 100%

""""
Expand All @@ -41,44 +41,35 @@ Building a Model

#. Once data are parsed, a horizontal menu displays at the top
of the screen reading "Build model using ... ". Select
GBM here, or go to the drop down menu Model and
select GBM.
**Distributed GBM** here, or go to the drop down menu Model and
select **Gradient Boosting Machine**.


#. In the "source" field, enter the .hex key for the Arrhythmia data set.
#. In the "source" field, enter the .hex key for the Arrhythmia data set, if it is not already entered.


#. In the "response" list, select the response variable. In this example, it is variable 1.
#. From the drop-down "response" list, select `C1`.

#. Select Gradient Boosted Classification by checking the "classification" checkbox or Gradient Boosted Regression by unchecking the "classification" checkbox. GBM is set to classification by default. For this example, check the **classification** checkbox.

#. In the "Ignored Columns" section, select the subset of variables to
omit from the model. In this example, the only column to
omit is the index column, 0.


#. Select Gradient Boosted Classification by checking the "classification" checkbox or Gradient Boosted Regression by unchecking the "classification" checkbox. GBM is set to classification by default. For this example, the desired output is classification.


#. In the "validation" field, enter the .hex key associated with a holdout (testing)
data set to apply results to a new data set after the model is generated.

#. In the "ntrees" field, enter the number of trees to generate. For this example, enter 20.
#. In the "ntrees" field, enter the number of trees to generate. For this example, enter `20`.

#. In the "max depth" field, specify the maximum number of edges between the top
node and the furthest node as a stopping criteria. For this example, set the depth
of interaction to 5.
node and the furthest node as a stopping criteria. For this example, enter `5`.

#. In the "min rows" field, specify the minimum number of observations (rows)
to include in any terminal node as a stopping criteria. For this example, use 25.
to include in any terminal node as a stopping criteria. For this example, use `25`.

#. In the "nbins" field, specify the number of bins to use for splitting data.
#. In the "nbins" field, specify the number of bins to use for splitting data. For this example, use the default value of 20.
Split points are evaluated at the boundaries of each of these
bins. As the value for Nbins increases, the more closely the algorithm approximates
evaluating each individual observation as a split point. The trade
off for this refinement is an increase in computational time.

#. In the "learn rate" field, specify a value to slow the convergence of the
algorithm to a solution and help prevent overfitting. This parameter is also referred to as shrinkage. In this example, enter .3.
algorithm to a solution and help prevent overfitting. This parameter is also referred to as shrinkage. For this example, enter `.3`.

#. To generate the model, click the **Submit** button.

.. image:: GBMrequest.png
:width: 70%
Expand All @@ -93,16 +84,26 @@ GBM Results
The GBM output for classification displays a confusion matrix with the
classifications for each group, the associated error by group, and
the overall average error. Regression models can be quite complex and
difficult to directly interpret. For that reason, a model key is
given for subsequent use in validation and prediction.
difficult to directly interpret. For that reason, H2O provides a model key for subsequent use in validation and prediction.

Both model types provide the MSE by tree. For classification models, the MSE is based on
the classification error within the tree. For regression models, MSE is
calculated from the squared deviances, as with standard regressions.
Both model types provide the MSE by tree.
- For classification models, the MSE is based on the classification error within the tree.
- For regression models, MSE is calculated from the squared deviances, as with standard regressions.

.. image:: GBMresults.png
:width: 100%

The output also includes a table containing the tree statistics (min, max, and mean for tree depth and number of leaves), as well as a table and chart representing the variable importances.

.. image:: GBMVarImp.png
:width: 100%



To change the order in which the variable importances are displayed (from most important to least or vice versa), click the **Sort** button below the chart.

.. image:: GBMsort.png
:width: 100%

""""

47 changes: 17 additions & 30 deletions h2o-docs/source/tutorial/glm.rst
Original file line number Diff line number Diff line change
Expand Up @@ -67,40 +67,39 @@ Before modeling, parse data into H2O:
Building a Model
""""""""""""""""

#. Once data are parsed, go to the drop-down **Model** menu and
select *GLM*.
#. Once data are parsed, a horizontal menu displays at the top
of the screen reading "Build model using ... ". Click the **Generalized Linear Modeling** link here, or click the drop-down **Model** menu and select **Generalized Linear Model**.


#. In the **Source** field, enter the .hex key for the data set.
#. In the **Source** field, enter the .hex key for the Abalone data set, if it is not already entered.


#. In the **Response** field, select the column associated with the Whole Weight
variable (column 5).
#. From the drop-down **Response** list, select the column associated with the Whole Weight variable (`C5`).


#. In the **Ignored Columns** field, select the columns to ignore.
#. In the **Ignored Columns** field, select the columns to ignore. For this example, select `C6, C7,` and `C8`.

#. Do not change the default **Classification** and **Max Iter** values. Classification is
used when the dependent variable is a binomial classifier. "Max iter"
defines the maximum number of iterations performed by the algorithm in the event that it fails to converge.
#. From the drop-down **Family** list, select *Gaussian*.

#. Confirm the **Standardize** option is not checked (disabled).

#. Confirm the value for **Tweedie Variance Power** is zero. This option is only used for the Tweedie family of GLM models (like zero-inflated Poisson).

#. Enter 0 in the **Nfolds** field to disable cross-validation. If the Nfolds values is greater
than 0, the GLM model displays the specified number of cross-validation
models.

#. Specify **Family** to be *Gaussian*.
#. To disable cross-validation, enter 0 in the **Nfolds** field. If the Nfolds value is greater than 0, the GLM model displays the specified number of cross-validation models.

#. Confirm the value for **Tweedie Variance Power** is zero. This option is only used
for the Tweedie family of GLM models (like zero-inflated Poisson).

#. Enter .3 in the **Alpha** field. The alpha parameter is the mixing
#. In the **Alpha** field, enter .3. The alpha parameter is the mixing
parameter for the L1 and L2 penalty.

#. In the **Lambda** field, enter .002.


#. Enter .002 for the **Lambda** value.
#. Confirm the **Standardize** option is not checked (disabled).


#. Do not change the default **Max Iter** value. Classification is
used when the dependent variable is a binomial classifier. "Max iter"
defines the maximum number of iterations performed by the algorithm in the event that it fails to converge.

#. Click **Submit**.

Expand Down Expand Up @@ -138,18 +137,6 @@ If you select a different lambda value, the page refreshes and the selected lamb

#. In the "data" field, enter the .hex key of the test data set and click **Submit**.

#. For regression models, click the **Score** menu and select *Confusion Matrix*. For binomial models, click the **Score** menu and select *AUC*.

#. In the "actual" field, enter the .hex key for the test data set.

#. From the drop-down "vactual" list, select the column to use for prediction.

#. In the "predict" field, enter the .hex key for the prediction generated in the first step.

#. From the drop-down "vpredict" list, select *predict*.

#. Click **Submit**.


Validation results report the same model statistics that were generated when the model was originally specified.

Expand Down
59 changes: 20 additions & 39 deletions h2o-docs/source/tutorial/kmeans.rst
Original file line number Diff line number Diff line change
Expand Up @@ -28,7 +28,7 @@ Getting Started
This tutorial uses a publicly available data set that can be found at http://archive.ics.uci.edu/ml/datasets/seeds


The data are composed of 210 observations, 7 attributes, and an priori
The data are composed of 210 observations, 7 attributes, and an a priori
grouping assignment. All data are positively valued and
continuous. Before modeling, parse data into H2O:

Expand Down Expand Up @@ -58,25 +58,20 @@ Building a Model

#. In the "source" field, enter the .hex key associated with the
data set.


#. Specify a value for "k." For this dataset, use 3.


#. Check the "normalize" checkbox to normalize data, but this is not required for this
example.

#. Select an option from the "Initialization" drop-down list.


#. Select an option from the "Initialization" drop-down list. For this example, select **PlusPlus**.

- Plus Plus initialization chooses one initial center at random and weights the random selection of subsequent centers so that points furthest from the first center are more likely to be chosen.
- Furthest initialization chooses one initial center at random, and then chooses the next center to be the point furthest away in terms of Euclidean distance.
- The default ("None") results in K initial centers being chosen independently at random.

#. Specify a value for "k." For this dataset, use 3.

#. Enter a "Max Iter" (short for maximum iterations) value to specify the maximum number of iterations the algorithm processes.

#. Select the columns of attributes that should be used
in defining the clusters in the "Cols" section. In this example, all columns except column 7 (the a priori known clusters for this particular set) are selected.

#. Check the "normalize" checkbox to normalize data, but this is not required for this example.

#. Click **Submit**.

Expand All @@ -88,9 +83,15 @@ Building a Model
K-Means Output
""""""""""""""

The output is a matrix of the cluster assignments and the
coordinates of the cluster centers (in terms of the originally
selected attributes). Your cluster centers may differ slightly.
The output is a series of table that contain:

- the cluster centers (in terms of the originally selected attributes)
- cluster sizes
- cluster variances
- overall totals (the total within the cluster sum of squares)

To view the cluster assignments by observation, click the **View the row-by-row cluster assignments**.

K-Means randomly chooses starting points and converges on
optimal centroids. The cluster number is arbitrary and should
be thought of as a factor.
Expand All @@ -100,35 +101,15 @@ be thought of as a factor.

""""""

K-means Next Steps
K-means Predictions
"""""""""""""""""""

For more information about the model, select *K-Means* from the
drop-down **Score** menu. Specify the K-Means model key and the
.hex key for the original data set.
To make a prediction based on the model, click the **model Parameters** button. Copy the `destination_key` for the model, then click the drop-down **Score** menu. Select **Predict**, then paste the copied `destination_key` in the **model** field. Type the name of the seeds dataset in the **data** entry field. When you begin typing, the fields will auto-complete; press the arrow keys to select the correct entry and press the Enter button to confirm. To generate the prediction, click the **Submit** button.

The output that displays when you click **Submit** is the number of rows
assigned to each cluster and the squared error per cluster.
The prediction results display in a two-column table. The first column represents the number of rows assigned to each cluster and the second row contains the squared error per cluster.

.. image:: KMscore.png
.. image:: KMPredict.png
:width: 90%

""""""

K-means Apply
"""""""""""""

To generate a prediction (assign the observations in a data set
to a cluster), select **K-means Apply** from the drop-down **Score** menu.
Specify the model and the .hex key for the data, then click **Submit**.

In the following example, cluster assignments have been generated
for the original data. Because the data have been sufficiently well
researched, the ideal cluster assignments were known in
advance. Comparing known clusters with predicted clusters demonstrated
that this K-Means model classifies with a less than 10% error rate.

.. image:: KMapply.png
:width: 90%

""""""

0 comments on commit dcd2658

Please sign in to comment.