Update H2O1 docs

smarthomekit · Mar 19, 2015 · dcd2658 · dcd2658
1 parent 588007f
commit dcd2658
Show file tree

Hide file tree

Showing 3 changed files with 65 additions and 96 deletions.
diff --git a/h2o-docs/source/tutorial/gbm.rst b/h2o-docs/source/tutorial/gbm.rst
@@ -31,7 +31,7 @@ Before modeling, parse data into H2O:
 
 #. Click **Submit**. Parsing data into H2O generates a .hex key in the format  "data name.hex"
 
-.. image:: PCAparse.png
+.. image:: GBMparse.png
    :width: 100%
 
 """"
@@ -41,44 +41,35 @@ Building a Model
 
 #. Once data are parsed, a horizontal menu displays at the top
    of the screen reading "Build model using ... ". Select 
-   GBM here, or go to the drop down menu Model and
-   select GBM. 
+   **Distributed GBM** here, or go to the drop down menu Model and
+   select **Gradient Boosting Machine**. 
 
 
-#. In the "source" field, enter the .hex key for the Arrhythmia data set. 
+#. In the "source" field, enter the .hex key for the Arrhythmia data set, if it is not already entered. 
 
 
-#. In the "response" list, select the response variable. In this example, it is variable 1.   
+#. From the drop-down "response" list, select `C1`.   
 
+#. Select Gradient Boosted Classification by checking the "classification" checkbox or Gradient Boosted Regression by unchecking the "classification" checkbox. GBM is set to classification by default. For this example, check the **classification** checkbox.
 
-#. In the "Ignored Columns" section, select the subset of variables to
-   omit from the model. In this example, the only column to 
-   omit is the index column, 0. 
-
-
-#. Select Gradient Boosted Classification by checking the "classification" checkbox or Gradient Boosted Regression by unchecking the "classification" checkbox. GBM is set to classification by default. For this example, the desired output is classification.
-
-
-#. In the "validation" field, enter the .hex key associated with a holdout (testing)
-   data set to apply results to a new data set after the model is generated. 
-
-#. In the "ntrees" field, enter the number of trees to generate. For this example, enter  20. 
+#. In the "ntrees" field, enter the number of trees to generate. For this example, enter `20`. 
 
 #. In the "max depth" field, specify the maximum number of edges between the top
-   node and the furthest node as a stopping criteria. For this example, set the depth
-   of interaction to 5. 
+   node and the furthest node as a stopping criteria. For this example, enter `5`. 
 
 #. In the "min rows" field, specify the minimum number of observations (rows)
-   to include in any terminal node as a stopping criteria. For this example, use 25. 
+   to include in any terminal node as a stopping criteria. For this example, use `25`. 
 
-#. In the "nbins" field, specify the number of bins to use for splitting data. 
+#. In the "nbins" field, specify the number of bins to use for splitting data. For this example, use the default value of 20. 
    Split points are evaluated at the boundaries of each of these
    bins. As the value for Nbins increases, the more closely the algorithm approximates
    evaluating each individual observation as a split point. The trade
    off for this refinement is an increase in computational time. 
 
 #. In the "learn rate" field, specify a value to slow the convergence of the
-   algorithm to a solution and help prevent overfitting. This parameter is also referred to as shrinkage. In this example, enter .3. 
+   algorithm to a solution and help prevent overfitting. This parameter is also referred to as shrinkage. For this example, enter `.3`. 
+
+#. To generate the model, click the **Submit** button. 
 
 .. image:: GBMrequest.png
    :width: 70%
@@ -93,16 +84,26 @@ GBM Results
 The GBM output for classification displays a confusion matrix with the
 classifications for each group, the associated error by group, and
 the overall average error. Regression models can be quite complex and
-difficult to directly interpret. For that reason, a model key is
-given for subsequent use in validation and prediction. 
+difficult to directly interpret. For that reason, H2O provides a model key for subsequent use in validation and prediction. 
 
-Both model types provide the MSE by tree. For classification models, the MSE is based on
-the classification error within the tree. For regression models, MSE is
-calculated from the squared deviances, as with standard regressions. 
+Both model types provide the MSE by tree. 
+  - For classification models, the MSE is based on the classification error within the tree. 
+  - For regression models, MSE is calculated from the squared deviances, as with standard regressions. 
 
 .. image:: GBMresults.png
    :width: 100%
 
+The output also includes a table containing the tree statistics (min, max, and mean for tree depth and number of leaves), as well as a table and chart representing the variable importances. 
+
+.. image:: GBMVarImp.png
+   :width: 100%
+
+
+
+To change the order in which the variable importances are displayed (from most important to least or vice versa), click the **Sort** button below the chart.  
+
+.. image:: GBMsort.png
+   :width: 100%
 
 """"
 
diff --git a/h2o-docs/source/tutorial/glm.rst b/h2o-docs/source/tutorial/glm.rst
@@ -67,40 +67,39 @@ Before modeling, parse data into H2O:
 Building a Model
 """"""""""""""""
 
-#. Once data are parsed, go to the drop-down **Model** menu and
-   select *GLM*. 
+#. Once data are parsed, a horizontal menu displays at the top
+   of the screen reading "Build model using ... ". Click the **Generalized Linear Modeling** link here, or click the drop-down **Model** menu and select **Generalized Linear Model**.
 
 
-#. In the **Source** field, enter the .hex key for the data set. 
+#. In the **Source** field, enter the .hex key for the Abalone data set, if it is not already entered. 
 
 
-#. In the **Response** field, select the column associated with the Whole Weight
-   variable (column 5). 
+#. From the drop-down **Response** list, select the column associated with the Whole Weight variable (`C5`). 
 
 
-#. In the **Ignored Columns** field, select the columns to ignore. 
+#. In the **Ignored Columns** field, select the columns to ignore. For this example, select `C6, C7,` and `C8`. 
 
-#. Do not change the default **Classification** and **Max Iter** values. Classification is
-   used when the dependent variable is a binomial classifier. "Max iter"
-   defines the maximum number of iterations performed by the algorithm in the event that it fails to converge. 
+#. From the drop-down **Family** list, select *Gaussian*. 
 
-#. Confirm the **Standardize** option is not checked (disabled). 
 
+#. Confirm the value for **Tweedie Variance Power** is zero. This option is only used for the Tweedie family of GLM models (like zero-inflated Poisson). 
 
-#. Enter 0 in the **Nfolds** field to disable cross-validation. If the Nfolds values is greater
-   than 0, the GLM model displays the specified  number of cross-validation
-   models. 
 
-#. Specify **Family** to be *Gaussian*. 
+#. To disable cross-validation, enter 0 in the **Nfolds** field. If the Nfolds value is greater than 0, the GLM model displays the specified  number of cross-validation models. 
 
-#. Confirm the value for **Tweedie Variance Power** is zero. This option is only used
-   for the Tweedie family of GLM models (like zero-inflated Poisson). 
 
-#. Enter .3 in the **Alpha** field. The alpha parameter is the mixing
+#. In the **Alpha** field, enter .3. The alpha parameter is the mixing
    parameter for the L1 and L2 penalty.
 
+#. In the **Lambda** field, enter .002.
+
 
-#. Enter .002 for the **Lambda** value.
+#. Confirm the **Standardize** option is not checked (disabled). 
+
+
+#. Do not change the default **Max Iter** value. Classification is
+   used when the dependent variable is a binomial classifier. "Max iter"
+   defines the maximum number of iterations performed by the algorithm in the event that it fails to converge. 
 
 #. Click **Submit**. 
 
@@ -138,18 +137,6 @@ If you select a different lambda value, the page refreshes and the selected lamb
 
 #. In the "data" field, enter the .hex key of the test data set and click **Submit**.  
 
-#. For regression models, click the **Score** menu and select *Confusion Matrix*. For binomial models, click the **Score** menu and select *AUC*.
-
-#. In the "actual" field, enter the .hex key for the test data set. 
-
-#. From the drop-down "vactual" list, select the column to use for prediction.
-
-#. In the "predict" field, enter the .hex key for the prediction generated in the first step. 
-
-#. From the drop-down "vpredict" list, select *predict*.
-
-#. Click **Submit**. 
-
 
 Validation results report the same model statistics that were generated when the model was originally specified.
 

diff --git a/h2o-docs/source/tutorial/kmeans.rst b/h2o-docs/source/tutorial/kmeans.rst
@@ -28,7 +28,7 @@ Getting Started
 This tutorial uses a publicly available data set that can be found at http://archive.ics.uci.edu/ml/datasets/seeds
 
 
-The data are composed of 210 observations, 7 attributes, and an priori
+The data are composed of 210 observations, 7 attributes, and an a priori
 grouping assignment. All data are positively valued and
 continuous. Before modeling, parse data into H2O:
 
@@ -58,25 +58,20 @@ Building a Model
 
 #. In the "source" field, enter the .hex key associated with the
    data set. 
-
-
-#. Specify a value for "k." For this dataset, use 3.  
-
-
-#. Check the "normalize" checkbox to normalize data, but this is not required for this
-   example. 
-
-#. Select an option from the "Initialization" drop-down list. 
+
+
+#. Select an option from the "Initialization" drop-down list. For this example, select **PlusPlus**. 
 
    - Plus Plus initialization chooses one initial center at random and weights the random selection of subsequent centers so that points furthest from the first center are more likely to be chosen. 
    - Furthest initialization chooses one initial center at random, and then chooses the next center to be the  point furthest away in terms of Euclidean distance. 
    - The default ("None") results in K initial centers being chosen independently at random.  
 
+#. Specify a value for "k." For this dataset, use 3.  
+
 #. Enter a "Max Iter" (short for maximum iterations) value to specify the maximum number of iterations the algorithm processes.
 
-#. Select the columns of attributes that should be used 
-   in defining the clusters in the "Cols" section. In this example, all columns except column 7 (the a priori known clusters for this particular set) are selected. 
 
+#. Check the "normalize" checkbox to normalize data, but this is not required for this example. 
 
 #. Click **Submit**.
 
@@ -88,9 +83,15 @@ Building a Model
 K-Means Output
 """"""""""""""
 
-The output is a matrix of the cluster assignments and the
-coordinates of the cluster centers (in terms of the originally 
-selected attributes). Your cluster centers may differ slightly. 
+The output is a series of table that contain: 
+
+- the cluster centers (in terms of the originally selected attributes)
+- cluster sizes
+- cluster variances
+- overall totals (the total within the cluster sum of squares)
+
+To view the cluster assignments by observation, click the **View the row-by-row cluster assignments**. 
+
 K-Means randomly chooses starting points and converges on 
 optimal centroids. The cluster number is arbitrary and should
 be thought of as a factor. 
@@ -100,35 +101,15 @@ be thought of as a factor.
 
 """"""
 
-K-means Next Steps
+K-means Predictions
 """""""""""""""""""
 
-For more information about the model, select *K-Means* from the
-drop-down **Score** menu. Specify the K-Means model key and the 
-.hex key for the original data set. 
+To make a prediction based on the model, click the **model Parameters** button. Copy the `destination_key` for the model, then click the drop-down **Score** menu. Select **Predict**, then paste the copied `destination_key` in the **model** field. Type the name of the seeds dataset in the **data** entry field. When you begin typing, the fields will auto-complete; press the arrow keys to select the correct entry and press the Enter button to confirm. To generate the prediction, click the **Submit** button. 
 
-The output that displays when you click **Submit** is the number of rows 
-assigned to each cluster and the squared error per cluster. 
+The prediction results display in a two-column table. The first column represents the number of rows assigned to each cluster and the second row contains the squared error per cluster. 
 
-.. image:: KMscore.png
+.. image:: KMPredict.png
    :width: 90%
 
 """"""
 
-K-means Apply
-"""""""""""""
-
-To generate a prediction (assign the observations in a data set
-to a cluster), select **K-means Apply** from the drop-down **Score** menu.
-Specify the model and the  .hex key for the data, then click **Submit**. 
-
-In the following example, cluster assignments have been generated
-for the original data. Because the data have been sufficiently well 
-researched, the ideal cluster assignments were known in
-advance. Comparing known clusters with predicted clusters demonstrated
-that this K-Means model classifies with a less than 10% error rate. 
-
-.. image:: KMapply.png
-   :width: 90%
-
-""""""