Skip to content

Commit

Permalink
Update H2O1 Data Science docs
Browse files Browse the repository at this point in the history
  • Loading branch information
jessica0xdata committed Mar 20, 2015
1 parent 949cc81 commit 0a20766
Show file tree
Hide file tree
Showing 7 changed files with 305 additions and 76 deletions.
62 changes: 57 additions & 5 deletions h2o-docs/source/datascience/deeplearning.rst
Original file line number Diff line number Diff line change
Expand Up @@ -36,7 +36,7 @@ Defining a Deep Learning Model

**Source**

A .hex key associated with the parsed training data.
A .hex key associated with the parsed training data.

**Response**

Expand Down Expand Up @@ -65,6 +65,18 @@ Defining a Deep Learning Model
training data to use in model validation (i.e., production of
error rates on data not used in model building).

**N folds**

The number of folds for cross-validation (if no validation data is specified). To disable cross-validation, enter 0. If the number of folds is more than 1, validation must remain empty.

**Holdout fraction**

The fraction of training data (from end) to hold out for validation (if no validation data is specified).

**Keep cross validation dataset splits**

Preserve cross-validation dataset splits.

**Checkpoint**

A model key associated with a previously trained Deep Learning
Expand All @@ -84,6 +96,16 @@ Defining a Deep Learning Model
values is fine for many problems, but best results on complex
datasets are often only attainable via expert mode options.

**Autoencoder**

Enable deep auto encoders. For more information, refer to `the Deep Learning booklet <https://leanpub.com/deeplearning/read#Autoencoder>`_.


**Use all factor levels**

Use all factor levels of categorical variables. Otherwise, the first factor level is omitted (without loss of accuracy). Useful for variable importances and auto-enabled for autoencoder.


**Activation**

The activation function (non-linearity) to be used the neurons in the
Expand Down Expand Up @@ -252,13 +274,13 @@ Defining a Deep Learning Model
A fraction of the inputs for each hidden layer to omit from training
to improve generalization. The default is 0.5 for each hidden layer.

**L1 Regularization**
**L1**

A regularization method that constrains the absolute value of the weights and
has the net effect of dropping some weights (setting them to zero) from a model
to reduce complexity and avoid overfitting.

**L2 Regularization**
**L2**

A regularization method that constrains the sum of the squared
weights. This method introduces bias into parameter estimates, but
Expand Down Expand Up @@ -286,7 +308,7 @@ Defining a Deep Learning Model
scale. For Normal, the values are drawn from a Normal distribution
with the standard deviation of the initial weight scale.

**Loss Function**
**Loss**

The loss (error) function to be optimized by the model.

Expand Down Expand Up @@ -317,6 +339,7 @@ Defining a Deep Learning Model
training dataset.

**Score Duty Cycle**

Maximum fraction of time spent for model scoring on training, validation samples,
and diagnostics, such as computation of feature importances (i.e., not on training).

Expand Down Expand Up @@ -366,7 +389,7 @@ Defining a Deep Learning Model
Gather diagnostics for hidden layers, such as mean and RMS values for learning
rate, momentum, weights and biases.

**Variable Importance**
**Variable Importances**

Compute variable importances for input features.
The implemented method (by Gedeon) considers the weights connecting the
Expand Down Expand Up @@ -405,6 +428,35 @@ Defining a Deep Learning Model
the data. It is automatically enabled if the number of training
samples per iteration is set to -1 (or to N times the dataset size or larger).


**Sparse**

Enable sparse data handling (experimental).


**Col major**

Use a column major weight matrix for the input layer; can speed up forward propagation, but may slow down back propagation.


**Average activation**

Average activation for sparse autoencoder (experimental).

**Sparsity beta**

Sparsity regularization (experimental).

**Max categorical features**

Maximum number of categorical features, enforced via hashing (experimental)

**Reproducible**

Force reproducibility on small data (will be slow - only uses one thread)



""""

Interpreting A Model
Expand Down
48 changes: 48 additions & 0 deletions h2o-docs/source/datascience/gbm.rst
Original file line number Diff line number Diff line change
Expand Up @@ -41,6 +41,18 @@ Defining a GBM Model
A .hex key associated with data to be used in validation of the
model built using the data specified in **Source**.

**N folds**

Number of folds for cross-validation (if no validation data is specified)

**Holdout fraction**

Fraction of training data (from end) to hold out for validation (if no validation data is specified).

**Keep cross validation dataset splits**

Preserve cross-validation dataset splits.

**NTrees:**

The number of trees to build. To specify models with different total numbers
Expand Down Expand Up @@ -84,6 +96,32 @@ Defining a GBM Model
Return information about each variable's importance
in training the specified model.

**Balance classes**

For imbalanced data, balance training data class counts via over/under-sampling for improved predictive accuracy.

**Class sampling factors**

Specify the over/under-sampling ratios per class (lexicographic order).

**Max after balance size**

If classes are balanced, limit the resulting dataset size to the specified multiple of the original dataset size.

**Checkpoint**

A model key associated with a previously trained model. Use this option to build a new model as a continuation of a previously generated model (e.g., by a grid search).

**Overwrite checkpoint**

Overwrite the checkpoint.


**Family**

Select the family (auto or bernoulli).


**Learn Rate:**

A number between 0 and 1 that specifies the rate at which the
Expand All @@ -96,6 +134,16 @@ Defining a GBM Model
such as specification of multiple learning rates, selecting this
option will build the set of models in parallel, rather than
sequentially.

**Seed**

The random seed controls sampling and initialization. Reproducible results are only expected with single-threaded operation (i.e., when running on one node, turning off load balancing and providing a small dataset that fits in one chunk). In general, the multi-threaded asynchronous updates to the model parameters will result in (intentional) race conditions and non-reproducible results. Note that deterministic sampling and initialization might still lead to some weak sense of determinism in the model.

**Group split**

Perform group splitting categoricals.



""""

Expand Down
Loading

0 comments on commit 0a20766

Please sign in to comment.