diff --git a/docs/deeplearning/DeepLearningRVignette.tex b/docs/deeplearning/DeepLearningRVignette.tex index b58813999e..7e2697fb60 100644 --- a/docs/deeplearning/DeepLearningRVignette.tex +++ b/docs/deeplearning/DeepLearningRVignette.tex @@ -10,6 +10,7 @@ \hypersetup{colorlinks, urlcolor={blue}} \usepackage{graphicx} \graphicspath{ {images/} } +\usepackage{parskip} \begin{document} @@ -22,13 +23,37 @@ \textsc{\small{Arno Candel \hspace{40pt} Viraj Parmar}} \\ \bigskip -\textsc{Feb 2015} +\textsc{February 2015} \end{center} \tableofcontents \newpage +\section{What is H2O?} + +H2O is an open source analytics platform for data scientists and business analysts who need scalable and fast machine learning capabilities. Our product helps organizations like PayPal, ShareThis, and Cisco reduce model building, training, and scoring times from months to days. Our use cases range from predictive modeling, fraud detection, and even customer intelligence in industries as diverse as insurance, SaaS, finance, ad tech, and recruiting. + +With its in-memory compression techniques, H2O can handle billions of data rows in-memory — even with a fairly small cluster. The platform includes interfaces for R, Python, Scala, Java, JSON and Coffeescript/JavaScript, along with its built-in web interface that makes it easier for non-engineers to stitch together a complete analytic workflow. The platform was built alongside (and on top of) both Hadoop and Spark Clusters. + +H2O implements almost all common machine learning algorithms — such as generalized linear modeling (linear regression, logistic regression, etc.), Naive Bayes, principal components analysis, time series, k-means clustering and others. H2O also implements best-in-class algorithms such as Random Forest, Gradient Boosting and Deep Learning at scale. Customers can build thousands of models and compare them to get the best prediction results. + +H2O was built by a passionate team of computer scientists, systems engineers and data scientists, from the ground up, for the data science community. We’re driven by strong curiosity, a desire to learn, and strong drive to tackle the scalability challenges of real world data analysis. Our team members have come to H2O from organizations as diverse as Marketo, Oracle, Azul, Teradata, and SAS. Our advisory board comes from Stanford University’s engineering, statistics, and health research departments. + +We host meetups, run experiments, and spend our days learning alongside our customers. + + +\textbf{Try it out} + +H2O offers an R package that can be installed from CRAN. H2O can be downloaded from \url{www.h2o.ai/download}. + +\textbf{Join the community} + +Connect with \url{h2ostream@googlegroups.com} and \url{https://github.com/h2oai} to learn about our meetups, training sessions, hackathons, and product updates. + +\textbf{Learn more about H2O} + +Visit \url{www.h2o.ai} \section{Introduction} \label{1} Deep Learning has been dominating recent machine learning competitions with @@ -38,11 +63,11 @@ \section{Introduction} \label{1} accuracy. H2O is the world’s fastest open-source in-memory platform for machine learning and predictive analytics on big data. -This documentation presents the Deep Learning framework in H2O, as experienced through the H2O R interface. Further documentation on H2O's system and algorithms can be found at \href{http://docs.h2o.ai}{http://docs.h2o.ai}, especially the ``R User documentation", and fully featured tutorials are available at \href{http://learn.h2o.ai}{http://learn.h2o.ai}. The datasets, R code and instructions for this document can be found at the \href{https://github.com/h2oai/h2o/tree/master/docs/deeplearning/DeepLearningRVignetteDemo}{H2O GitHub repository} at \\https://github.com/h2oai/h2o/tree/master/docs/deeplearning/DeepLearningRVignetteDemo/. This introductory section provides instructions on getting H2O started from R, followed by a brief overview of deep learning. +This documentation presents the Deep Learning framework in H2O, as experienced through the H2O R interface. Further documentation on H2O's system and algorithms can be found at \href{http://docs.h2o.ai}{http://docs.h2o.ai}, especially the ``R User documentation", and fully featured tutorials are available at \href{http://learn.h2o.ai}{http://learn.h2o.ai}. The datasets, R code and instructions for this document can be found at the H2O GitHub repository at \url{https://github.com/h2oai/h2o/tree/master/docs/deeplearning/DeepLearningRVignetteDemo}. This introductory section provides instructions on getting H2O started from R, followed by a brief overview of deep learning. \subsection{Installation} \label{1.1} -To install H2O, follow the ``Download" link on \href{http://h2o.ai/}{H2O's website} at http://h2o.ai/. For multi-node operation, download the H2O zip file and deploy H2O on your cluster, following instructions from the ``Full Documentation". For single-node operation, follow the instructions in the ``Install in R" tab. Open your R Console and run the following to install and start H2O directly from R: +To install H2O, follow the ``Download" link on H2O's website at \url{http://h2o.ai/}. For multi-node operation, download the H2O zip file and deploy H2O on your cluster, following instructions from the ``Full Documentation". For single-node operation, follow the instructions in the ``Install in R" tab. Open your R Console and run the following to install and start H2O directly from R: \begin{spverbatim} # The following two commands remove any previously installed H2O packages for R. @@ -61,14 +86,14 @@ \subsection{Installation} \label{1.1} Initialize H2O with \begin{spverbatim} -h2o_server = h2o.init() +h2o_server = h2o.init(nthreads = -1) \end{spverbatim} \noindent With this command, the H2O R module will start an instance of H2O automatically at localhost:54321. Alternatively, to specify a connection with an existing H2O cluster node (other than localhost at port 54321) you must explicitly state the IP address and port number in the \texttt{h2o.init()} call. An example is given below, but do not directly paste; you should specify the IP and port number appropriate to your specific environment. \begin{spverbatim} -h2o_server = h2o.init(ip = "192.555.1.123", port = 12345, startH2O = FALSE) +h2o_server = h2o.init(ip = "192.555.1.123", port = 12345, startH2O = FALSE, nthreads = -1) \end{spverbatim} \noindent @@ -80,9 +105,9 @@ \subsection{Installation} \label{1.1} \subsection{Support} \label{1.2} -Users of the H2O package may submit general enquiries and bug reports to the H2O.ai \href{mailto:h2ostream@googlegroups.com}{support address}. Alternatively, specific bugs or issues may be filed to the H2O.ai \href{https://0xdata.atlassian.net/secure/Dashboard.jspa}{JIRA}. +Users of the H2O package may submit general enquiries and bug reports to H2O.ai support address, \url{h2ostream@googlegroups.com}. Alternatively, specific bugs or issues may be filed to the H2O.ai JIRA at \url{https://0xdata.atlassian.net/secure/Dashboard.jspa}. -\subsection{Deep learning overview} \label{1.3} +\subsection{Deep learning Overview} \label{1.3} First we present a brief overview of deep neural networks for supervised learning tasks. There are several theoretical frameworks for deep learning, and here we summarize the feedforward architecture used by H2O. \\ @@ -92,7 +117,7 @@ \subsection{Deep learning overview} \label{1.3} \end{figure} \\ \noindent -The basic unit in the model (shown above) is the neuron, a biologically inspired model of the human neuron. For humans, varying strengths of neurons' output signals travel along the synaptic junctions and are then aggregated as input for a connected neuron's activation. In the model, the weighted combination $\alpha = \sum_{i=1}^{n} w_i x_i + b$ of input signals is aggregated, and then an output signal $f(\alpha)$ transmitted by the connected neuron. The function $f$ represents the nonlinear activation function used throughout the network, and the bias $b$ accounts for the neuron's activation threshold. +The basic unit in the model (shown above) is the neuron, a biologically inspired model of the human neuron. For humans, varying strengths of neurons' output signals travel along the synaptic junctions and are then aggregated as input for a connected neuron's activation. In the model, the weighted combination $\alpha = \sum_{i=1}^{n} w_i x_i + b$ of input signals is aggregated, and then an output signal $f(\alpha)$ transmitted by the connected neuron. The function $f$ represents the nonlinear activation function used throughout the network, and the bias $b$ accounts for the neuron's activation threshold. \\ \begin{figure}[h!] \centering @@ -102,9 +127,9 @@ \subsection{Deep learning overview} \label{1.3} \noindent Multi-layer, feedforward neural networks consist of many layers of interconnected neuron units: beginning with an input layer to match the feature space; followed by multiple layers of nonlinearity; and terminating with a linear regression or classification layer to match the output space. The inputs and outputs of the model's units follow the basic logic of the single neuron described above. Bias units are included in each non-output layer of the network. The weights linking neurons and biases with other neurons fully determine the output of the entire network, and learning occurs when these weights are adapted to minimize the error on labeled training data. More specifically, for each training example $j$ the objective is to minimize a loss function \begin{center} -$L(W,b$ $|$ $j)$. +$L(W,B$ $|$ $j)$. \end{center} -Here $W$ is the collection $\left\{W_i\right\}_{1:N-1}$, where $W_i$ denotes the weight matrix connecting layers $i$ and $i+1$ for a network of $N$ layers; similarly $b$ is the collection $\left\{b_i\right\}_{1:N-1}$, where $b_i$ denotes the column vector of biases for layer $i+1$. +Here $W$ is the collection $\left\{W_i\right\}_{1:N-1}$, where $W_i$ denotes the weight matrix connecting layers $i$ and $i+1$ for a network of $N$ layers; similarly $B$ is the collection $\left\{b_i\right\}_{1:N-1}$, where $b_i$ denotes the column vector of biases for layer $i+1$. \\ \\ This basic framework of multi-layer neural networks can be used to accomplish deep learning tasks. Deep learning architectures are models of hierarchical feature extraction, typically involving multiple levels of nonlinearity. Such models are able to learn useful representations of raw data, and have exhibited high performance on complex data such as images, speech, and text \href{http://www.iro.umontreal.ca/~lisa/pointeurs/TR1312.pdf}{(Bengio, 2009)}. @@ -158,7 +183,6 @@ \subsection{Training protocol} \label{2.2} \subsubsection{Initialization} \label{2.2.1} \noindent Various deep learning architectures employ a combination of unsupervised pretraining followed by supervised training, but H2O uses a purely supervised training protocol. The default initialization scheme is the uniform adaptive option, which is an optimized initialization based on the size of the network. Alternatively, you may select a random initialization to be drawn from either a uniform or normal distribution, for which a scaling parameter may be specified as well. - \subsubsection{Activation and loss functions} \label{2.2.2} In the introduction we introduced the nonlinear activation function $f$, for which the choices are summarized in Table 1. Note here that $x_i$ and $w_i$ denote the firing neuron's input values and their weights, respectively; $\alpha$ denotes the weighted combination $\alpha = \sum_i w_i x_i+b$. \\ @@ -182,10 +206,7 @@ \subsubsection{Activation and loss functions} \label{2.2.2} \\ \\ The $\tanh$ function is a rescaled and shifted logistic function and its symmetry around 0 allows the training algorithm to converge faster. The rectified linear activation function has demonstrated high performance on image recognition tasks, and is a more biologically accurate model of neuron activations (\href{http://yann.lecun.com/exdb/publis/pdf/lecun-98b.pdf}{LeCun et al, 1998}). Maxout activation works particularly well with dropout, a regularization method discussed later in this vignette (\href{http://arxiv.org/pdf/1302.4389.pdf}{Goodfellow et al, 2013}). It is difficult to determine a ``best" activation function to use; each may outperform the others in separate scenarios, but grid search models (also described later) can help to compare activation functions and other parameters. The default activation function is the Rectifier. Each of these activation functions can be operated with dropout regularization (see below). -\\ -\\ -\bigskip -\\ + The following choices for the loss function $L(W,B$ $|$ $ j)$ are summarized in Table 2. The system default enforces the table's typical use rule based on whether regression or classification is being performed. Note here that $t^{(j)}$ and $o^{(j)}$ are the predicted (target) output and actual output, respectively, for training example $j$; further, let $y$ denote the output units and $O$ the output layer. \\ \begin{table}[ht] @@ -228,7 +249,7 @@ \subsubsection{Parallel distributed network training} \label{2.2.3} \line(1,0){275} \\ \\ -Stochastic gradient descent is known to be fast and memory-efficient, but not easily parallelizable without becoming slow. We utilize \textsc{Hogwild!}, the recently developed lock-free parallelization scheme from \href{http://i.stanford.edu/hazy/papers/hogwild-nips.pdf}{Niu et al, 2011}. \textsc{Hogwild!} follows a shared memory model where multiple cores, each handling separate subsets (or all) of the training data, are able to make independent contributions to the gradient updates $\nabla L(W,B$ $ |$ $j)$ asynchronously. In a multi-node system this parallelization scheme works on top of H2O's distributed setup, where the training data is distributed across the cluster. Each node operates in parallel on its local data until the final parameters $W,b$ are obtained by averaging. Below is a rough summary. +Stochastic gradient descent is known to be fast and memory-efficient, but not easily parallelizable without becoming slow. We utilize \textsc{Hogwild!}, the recently developed lock-free parallelization scheme from \href{http://i.stanford.edu/hazy/papers/hogwild-nips.pdf}{Niu et al, 2011}. \textsc{Hogwild!} follows a shared memory model where multiple cores, each handling separate subsets (or all) of the training data, are able to make independent contributions to the gradient updates $\nabla L(W,B$ $ |$ $j)$ asynchronously. In a multi-node system this parallelization scheme works on top of H2O's distributed setup, where the training data is distributed across the cluster. Each node operates in parallel on its local data until the final parameters $W,B$ are obtained by averaging. Below is a rough summary. \\ \\ \noindent @@ -270,8 +291,10 @@ \subsubsection{Parallel distributed network training} \label{2.2.3} Here, the weights and bias updates follow the asynchronous $\textsc{Hogwild!}$ procedure to incrementally adjust each node's parameters $W_n,B_n$ after seeing example $i$. The Avg$_n$ notation refers to the final averaging of these local parameters across all nodes to obtain the global model parameters and complete training. \subsubsection{Specifying the number of training samples per iteration} \label{2.2.4} H2O Deep Learning is scalable and can take advantage of a large cluster of compute nodes. There are three modes in which to operate. The default behavior is to let every node train on the entire (replicated) dataset, but automatically locally shuffling (and/or using a subset of) the training examples for each iteration. For datasets that don't fit into each node's memory (also depending on the heap memory specified by the -Xmx option), it might not be possible to replicate the data, and each compute node can be instructed to train only with local data. An experimental single node mode is available for the case where slow final convergence is observed due to the presence of too many nodes, but we've never seen this become necessary. -\\ -The number of training examples (globally) presented to the distributed SGD worker nodes between model averaging is controlled by the important parameter \texttt{train\_samples\_per\_iteration}. One special value is -1, which results in all nodes processing all their local training data per iteration. Note that if \texttt{replicate\_training\_data} is enabled (true by default), this will result in training N epochs (passes over the data) per iteration on N nodes, otherwise 1 epoch will be trained per iteration. Another special value is 0, which always results in 1 epoch per iteration, independent of the number of compute nodes. In general, any user-given positive number is permissible for this parameter. For large datasets, it might make sense to specify a fraction of the dataset. For example, if the training data contains $10$ million rows, and we specify the number of training samples per iteration as $100,000$ when running on $4$ nodes, then each node will process $25,000$ examples per iteration, and it will take $40$ such distributed iterations to process one epoch. If the value is set too high, it might take too long between synchronization and model convergence can be slow. If the value is set too low, network communication overhead will dominate the runtime, and computational performance will suffer. The special value of -2 (the default) enables auto-tuning of this parameter based on the computational performance of the processors and the network of the system and attempts to find a good balance between computation and communication. Note that this parameter can affect the convergence rate during training. + +The number of training examples (globally) presented to the distributed SGD worker nodes between model averaging is controlled by the important parameter \texttt{train\_samples\_per\_iteration}. One special value is -1, which results in all nodes processing all their local training data per iteration. Note that if \texttt{replicate\_training\_data} is enabled (true by default), this will result in training N epochs (passes over the data) per iteration on N nodes, otherwise 1 epoch will be trained per iteration. Another special value is 0, which always results in 1 epoch per iteration, independent of the number of compute nodes. In general, any user-given positive number is permissible for this parameter. For large datasets, it might make sense to specify a fraction of the dataset. + +For example, if the training data contains $10$ million rows, and we specify the number of training samples per iteration as $100,000$ when running on $4$ nodes, then each node will process $25,000$ examples per iteration, and it will take $40$ such distributed iterations to process one epoch. If the value is set too high, it might take too long between synchronization and model convergence can be slow. If the value is set too low, network communication overhead will dominate the runtime, and computational performance will suffer. The special value of -2 (the default) enables auto-tuning of this parameter based on the computational performance of the processors and the network of the system and attempts to find a good balance between computation and communication. Note that this parameter can affect the convergence rate during training. \\ \noindent \subsection{Regularization} \label{2.3} @@ -286,13 +309,15 @@ \subsection{Regularization} \label{2.3} For $\ell_1$ regularization, $R_1(W,B$ $|$ $j)$ represents of the sum of all $\ell_1$ norms of the weights and biases in the network; $R_2(W,B$ $|$ $j)$ represents the sum of squares of all the weights and biases in the network. The constants $\lambda_1$ and $\lambda_2$ are generally chosen to be very small, for example $10^{-5}$. \\ -The second type of regularization available for deep learning is a recent innovation called dropout (\href{http://arxiv.org/pdf/1207.0580.pdf}{Hinton et al., 2012}). Dropout constrains the online optimization such that during forward propagation for a given training example, each neuron in the network suppresses its activation with probability $\textsc{P}$, generally taken to be less than 0.2 for input neurons and up to 0.5 for hidden neurons. The effect is twofold: as with $\ell_2$ regularization, the network weight values are scaled toward 0; furthermore, each training example trains a different model, albeit sharing the same global parameters. Thus dropout allows an exponentially large number of models to be averaged as an ensemble, which can prevent overfitting and improve generalization. Note that input dropout can be especially useful when the feature space is large and noisy. +The second type of regularization available for deep learning is a recent innovation called dropout (\href{http://arxiv.org/pdf/1207.0580.pdf}{Hinton et al., 2012}). + +Dropout constrains the online optimization such that during forward propagation for a given training example, each neuron in the network suppresses its activation with probability $\textsc{P}$, generally taken to be less than 0.2 for input neurons and up to 0.5 for hidden neurons. The effect is twofold: as with $\ell_2$ regularization, the network weight values are scaled toward 0; furthermore, each training example trains a different model, albeit sharing the same global parameters. Thus dropout allows an exponentially large number of models to be averaged as an ensemble, which can prevent overfitting and improve generalization. Note that input dropout can be especially useful when the feature space is large and noisy. \subsection{Advanced optimization} \label{2.4} H2O features manual and automatic versions of advanced optimization. The manual mode features include momentum training and learning rate annealing, while automatic mode features adaptive learning rate. + \subsubsection{Momentum training} \label{2.4.1} -Momentum modifies back-propagation by allowing prior iterations to influence the current update. In particular, a velocity vector $v$ is defined to modify the updates as follows, with $\theta$ representing the parameters $W,B$; $\mu$ representing the momentum coefficient, and $\alpha$ denoting -the learning rate. +Momentum modifies back-propagation by allowing prior iterations to influence the current update. In particular, a velocity vector $v$ is defined to modify the updates as follows, with $\theta$ representing the parameters $W,B$; $\mu$ representing the momentum coefficient, and $\alpha$ denoting the learning rate. \begin{center} $v_{t+1} = \mu v_t - \alpha \nabla L(\theta_t)$ \\ @@ -317,12 +342,10 @@ \subsubsection{Rate annealing} \label{2.4.2} \subsubsection{Adaptive learning} \label{2.4.3} The implemented adaptive learning rate algorithm ADADELTA (\href{http://arxiv.org/pdf/1212.5701v1.pdf}{Zeiler, 2012}) automatically combines the benefits of learning rate annealing and momentum training to avoid slow convergence. Specification of only two parameters $\rho$ and $\epsilon$ simplifies hyper parameter search. In some cases, manually controlled (non-adaptive) learning rate and momentum specifications can lead to better results, but require the hyperparameter search of up to 7 parameters. If the model is built on a topology with many local minima or long plateaus, it is possible for a constant learning rate to produce sub-optimal results. In general, however, we find adaptive learning rate to produce the best results, and this option is kept as the default. -\\ -\\ + The first of two hyper parameters for adaptive learning is $\rho$. It is similar to momentum and relates to the memory to prior weight updates. Typical values are between 0.9 and 0.999. The second of two hyper parameters $\epsilon$ for adaptive learning is similar to learning rate annealing during initial training and momentum at later stages where it allows forward progress. Typical values are between $10^{-10}$ and $10^{-4}$. -\\ -\\ + \subsection{Loading data} \label{2.5} Loading a dataset in R for use with H2O is slightly different from the usual methodology, as we must convert our datasets into \texttt{H2OParsedData} objects. For an example, we use a toy weather dataset included in the \href{https://github.com/h2oai/h2o/tree/master/docs/deeplearning/DeepLearningRVignetteDemo}{H2O GitHub repository for the H2O Deep Learning documentation} at \\https://github.com/h2oai/h2o/tree/master/docs/deeplearning/DeepLearningRVignetteDemo/. First load the data to your current working directory in your R Console (do this henceforth for dataset downloads), and then run the following command. @@ -560,7 +583,7 @@ \section{Appendix A: Complete parameter list} \item \texttt{holdout\_fraction}: (Optional) Fraction of the training data to hold out for validation. \item \texttt{checkpoint}: Model checkpoint (either key or H2ODeepLearningModel) to resume training with. \item \texttt{activation}: The choice of nonlinear, differentiable activation function used throughout the network. Options are \texttt{Tanh, TanhWithDropout, Rectifier, RectifierWithDropout, Maxout, MaxoutWithDropout}, and the default is \texttt{Rectifier}. See section \ref{2.2.2} for more details. -\item \texttt{hidden}: The number and size of each hidden layer in the model. For example, if c(100,200,100) is specified, a model with 3 hidden layers will be produced, and the middle hidden layer will have 200 neurons. The default is c(200,200). For grid search, use list(c(10,10), c(20,20)) etc. See section \ref{3.2} for more details. . +\item \texttt{hidden}: The number and size of each hidden layer in the model. For example, if c(100,200,100) is specified, a model with 3 hidden layers will be produced, and the middle hidden layer will have 200 neurons. The default is c(200,200). For grid search, use list(c(10,10), c(20,20)) etc. See section \ref{3.2} for more details. \item \texttt{autoencoder}: Default is false. See section \ref{4} for more details. \item \texttt{use\_all\_factor\_levels}: Use all factor levels of categorical variables. Otherwise, the first factor level is omitted (without loss of accuracy). Useful for variable importances and auto-enabled for autoencoder. \item \texttt{epochs}: The number of passes over the training dataset to be carried out. It is recommended to start with lower values for initial grid searches. The value can be modified during checkpoint restarts and allows continuation of selected models. Default is 10. @@ -572,7 +595,7 @@ \section{Appendix A: Complete parameter list} \item \texttt{rate}: The learning rate $\alpha$. Higher values lead to less stable models while lower values lead to slower convergence. Default is 0.005. \item \texttt{rate\_annealing}: Default value is 1e-6 (when adaptive learning is disabled). See section \ref{2.4.2} for more details. \item \texttt{rate\_decay}: Default is 1.0 (when adaptive learning is disabled). The learning rate decay parameter controls the change of learning rate across layers. -\item \texttt{momentum\_start}: The momentum\_start parameter controls the amount of momentum at the beginning of training (when adaptive learning is disabled). Default is 0. See section \ref{2.4.1} for more details. +\item \texttt{momentum\_start}: The momentum\_start parameter controls the amount of momentum at the beginning of training (when adaptive learning is disabled). Default is 0. \ref{2.4.1} for more details. \item \texttt{momentum\_ramp}: The momentum\_ramp parameter controls the amount of learning for which momentum increases assuming momentum\_stable is larger than momentum\_start. It can be enabled when adaptive learning is disabled. The ramp is measured in the number of training samples. Default is 1e-6. See section \ref{2.4.1} for more details. \item \texttt{momentum\_stable}: The momentum\_stable parameter controls the final momentum value reached after momentum\_ramp training samples (when adaptive learning is disabled). The momentum used for training will remain the same for training beyond reaching that point. Default is 0. See section \ref{2.4.1} for more details. \item \texttt{nesterov\_accelerated\_gradient}: The default is true (when adaptive learning is disabled). See Section \ref{2.4.1} for more details. @@ -605,29 +628,53 @@ \section{Appendix A: Complete parameter list} \item \texttt{replicate\_training\_data}: Replicate the entire training dataset onto every node for faster training on small datasets. Default is true. \item \texttt{single\_node\_mode}: Run on a single node for fine-tuning of model parameters. Can be useful for faster convergence during checkpoint resumes after training on a very large count of nodes (for fast initial convergence). Default is false. \item \texttt{shuffle\_training\_data}: Enable shuffling of training data (on each node). This option is recommended if training data is replicated on N nodes, and the number of training samples per iteration is close to N times the dataset size, where all nodes train will (almost) all the data. It is automatically enabled if the number of training samples per iteration is set to -1 (or to N times the dataset size or larger), otherwise it is disabled by default. -\item \texttt{max\_categorical\_features}: Max. number of categorical features, enforced via hashing (Experimental). -\item \texttt{reproducible}: Force reproducibility on small data (will be slow - only uses 1 thread) +\item \texttt{max\_categorical\_features}: Max. number of categorical features, enforced via hashing (Experimental).. +\item \texttt{reproducible}: Force reproducibility on small data (will be slow - only uses 1 thread). +\item \texttt{sparse}: Enable sparse data handling (experimental). +\item \texttt{col\_major}: Use a column major weight matrix for the input layer; can speed up forward propagation, but may slow down backpropagation. +\item \texttt{input\_dropout\_ratios}: Enable input layer dropout ratios, which can improve generalization, by specifying one value per hidden layer. The default is 0.5. + \end{itemize} \section{Appendix B: References} -\href{http://h2o.ai/}{H2O website} -\\\href{http://github.com/h2oai/h2o.git}{H2O Github Repository} -\\\href{http://docs.h2o.ai/}{H2O documentation} -\\\href{http://learn.h2o.ai/}{H2O Training} -\\\href{http://data.h2o.ai/}{H2O Training Scripts and Data} -\\\href{https://github.com/h2oai/h2o/tree/master/docs/deeplearning/DeepLearningRVignetteDemo}{Code for this Document} -\\\href{mailto:h2ostream@googlegroups.com}{H2O support} -\\\href{https://0xdata.atlassian.net/secure/Dashboard.jspa}{H2O JIRA} -\\\href{http://www.iro.umontreal.ca/~lisa/pointeurs/TR1312.pdf}{(Bengio, 2009} -\\\href{http://yann.lecun.com/exdb/publis/pdf/lecun-98b.pdf}{LeCun et al, 1998} -\\\href{http://arxiv.org/pdf/1302.4389.pdf}{Goodfellow et al, 2013} -\\\href{http://i.stanford.edu/hazy/papers/hogwild-nips.pdf}{Niu et al, 2011} -\\\href{http://arxiv.org/pdf/1207.0580.pdf}{Hinton et al., 2012} -\\\href{http://www.cs.toronto.edu/~fritz/absps/momentum.pdf}{Sutskever et al, 2014} -\\\href{http://arxiv.org/pdf/1212.5701v1.pdf}{Zeiler, 2012} -\\\href{https://github.com/h2oai/h2o/tree/master/docs/deeplearning/DeepLearningRVignetteDemo}{H2O GitHub repository for the H2O Deep Learning documentation} -\\\href{http://yann.lecun.com/exdb/mnist/}{MNIST database} -\\\href{http://www.cs.toronto.edu/~hinton/science.pdf}{Hinton et al, 2006} + +\textbf{H2O website} \url{http://h2o.ai/} + +\textbf{H2O documentation} \url{http://docs.h2o.ai} + +\textbf{H2O Github Repository} \url{http://github.com/h2oai/h2o.git} + +\textbf{H2O Training} \url{http://learn.h2o.ai/} + +\textbf{H2O Training Scripts and Data} \url{http://data.h2o.ai/} + +\textbf{Code for this Document} \url{https://github.com/h2oai/h2o/tree/master/docs/deeplearning/DeepLearningRVignetteDemo} + +\textbf{H2O support} \url{h2ostream@googlegroups.com} + +\textbf{H2O JIRA} \url{https://0xdata.atlassian.net/secure/Dashboard.jspa} + +\url{https://www.youtube.com/user/0xdata} + +\textbf{Learning Deep Architectures for AI}. Bengio, Yoshua, 2009. \url{http://www.iro.umontreal.ca/~lisa/pointeurs/TR1312.pdf} + +\textbf{Efficient BackProp}. {LeCun et al, 1998}. \url{http://yann.lecun.com/exdb/publis/pdf/lecun-98b.pdf} + +\textbf{Maxout Networks}. {Goodfellow et al, 2013}. \url{http://arxiv.org/pdf/1302.4389.pdf} + +\textbf{HOGWILD!: A Lock-Free Approach to Parallelizing Stochastic Gradient Descent}. {Niu et al, 2011}. \url{http://i.stanford.edu/hazy/papers/hogwild-nips.pdf} + +\textbf{Improving neural networks by preventing co-adaptation of feature detectors}. {Hinton et al., 2012}. \url{http://arxiv.org/pdf/1207.0580.pdf} + +\textbf{On the importance of initialization and momentum in deep learning}. {Sutskever et al, 2014}. \url{http://www.cs.toronto.edu/~fritz/absps/momentum.pdf} + +\textbf{ADADELTA: AN ADAPTIVE LEARNING RATE METHOD}. {Zeiler, 2012}. \url{http://arxiv.org/pdf/1212.5701v1.pdf} + +\textbf{H2O GitHub repository for the H2O Deep Learning documentation} \url{https://github.com/h2oai/h2o/tree/master/docs/deeplearning/DeepLearningRVignetteDemo} + +\textbf{{MNIST database}} \url{http://yann.lecun.com/exdb/mnist/} + +\textbf{Reducing the Dimensionality of Data with Neural Networks}. {Hinton et al, 2006}. \url{http://www.cs.toronto.edu/~hinton/science.pdf} \end{document} diff --git a/docs/gbm/gbmRVignette.pdf b/docs/gbm/gbmRVignette.pdf index 28c18eda9a..e8a80094df 100644 Binary files a/docs/gbm/gbmRVignette.pdf and b/docs/gbm/gbmRVignette.pdf differ diff --git a/docs/gbm/gbmRVignette.tex b/docs/gbm/gbmRVignette.tex index d89dc38fdb..e679b1c5ec 100644 --- a/docs/gbm/gbmRVignette.tex +++ b/docs/gbm/gbmRVignette.tex @@ -28,19 +28,28 @@ \newpage \section{What is H2O?} -It is the only alternative to combine the power of highly advanced algorithms, the freedom of open source, and the capacity of truly scalable in-memory processing for big data on one or many nodes. Combined, these capabilities make it faster, easier, and more cost-effective to harness big data to maximum benefit for the business. +H2O is an open source analytics platform for data scientists and business analysts who need scalable and fast machine learning capabilities. Our product helps organizations like PayPal, ShareThis, and Cisco reduce model building, training, and scoring times from months to days. Our use cases range from predictive modeling, fraud detection, and even customer intelligence in industries as diverse as insurance, SaaS, finance, ad tech, and recruiting. -Data collection is easy. Decision making is hard. H2O makes it fast and easy to derive insights from your data through faster and better predictive modeling. Existing Big Data stacks are batch-oriented. Search and analytics need to be interactive. Use machines to learn machine-generated data. And more data beats better algorithms. +With its in-memory compression techniques, H2O can handle billions of data rows in-memory — even with a fairly small cluster. The platform includes interfaces for R, Python, Scala, Java, JSON and Coffeescript/JavaScript, along with its built-in web interface that makes it easier for non-engineers to stitch together a complete analytic workflow. The platform was built alongside (and on top of) both Hadoop and Spark Clusters. -With H2O, you can make better predictions by harnessing sophisticated, ready-to-use algorithms and the processing power you need to analyze bigger data sets, more models, and more variables. +H2O implements almost all common machine learning algorithms — such as generalized linear modeling (linear regression, logistic regression, etc.), Naive Bayes, principal components analysis, time series, k-means clustering and others. H2O also implements best-in-class algorithms such as Random Forest, Gradient Boosting and Deep Learning at scale. Customers can build thousands of models and compare them to get the best prediction results. -Get started with minimal effort and investment. H2O is an extensible open source platform that offers the most pragmatic way to put big data to work for your business. With H2O, you can work with your existing languages and tools. You can further extend the platform seamlessly into your Hadoop environments. Get H2O! +H2O was built by a passionate team of computer scientists, systems engineers and data scientists, from the ground up, for the data science community. We’re driven by strong curiosity, a desire to learn, and strong drive to tackle the scalability challenges of real world data analysis. Our team members have come to H2O from organizations as diverse as Marketo, Oracle, Azul, Teradata, and SAS. Our advisory board comes from Stanford University’s engineering, statistics, and health research departments. -Download H2O -{\url{http://www.h2o.ai/download}} +We host meetups, run experiments, and spend our days learning alongside our customers. -Join the Community -{\url{h2ostream@googlegroups.com}} and {\url{github.com/h2oai/h2o.git}} + +\textbf{Try it out} + +H2O offers an R package that can be installed from CRAN. H2O can be downloaded from \url{www.h2o.ai/download}. + +\textbf{Join the community} + +Connect with \url{h2ostream@googlegroups.com} and \url{https://github.com/h2oai} to learn about our meetups, training sessions, hackathons, and product updates. + +\textbf{Learn more about H2O} + +Visit \url{www.h2o.ai} \section{Introduction} @@ -200,7 +209,7 @@ \subsection{Launching on Hadoop} $ hadoop jar h2odriver_hdp2.1.jar water.hadoop.h2odriver -libjars ../h2o.jar -mapperXmx 1g -nodes 1 -output hdfsOutputDirName \end{spverbatim} \begin{itemize} -\item For each major release of each distribution of Hadoop, there is a driver jar file that the user will need to launch H2O with. Currently available driver jar files in each build of H2O includes {\texttt{h2odriver_cdh5.jar, h2odriver_hdp2.1.jar}}, and {\texttt{mapr2.1.3.jar}}. +\item For each major release of each distribution of Hadoop, there is a driver jar file that the user will need to launch H2O with. Currently available driver jar files in each build of H2O include {\texttt{h2odriver_cdh5.jar, h2odriver_hdp2.1.jar}}, and {\texttt{mapr2.1.3.jar}}. \item The above command launches exactly one 1g node of H2O; however, we recommend launching the cluster with 4 times the memory of your data file. \item{\texttt{mapperXmx}} is the mapper size or the amount of memory allocated to each node. \item{\texttt{nodes}} is the number of nodes requested to form the cluster. @@ -337,7 +346,7 @@ \subsubsection{Theory and framework} \indent \indent i. Compute $r_{ikm} = y_{ik} - p_k(x_i), i = 1,2,\dots,N$ \\ \indent \indent ii. Fit a regression tree to the targets $r_{ikm}, i = 1,2,\dots,N$, -\par \hspace{3em} giving terminal regions $R_{jim}, 1,2,\dots,J_m$ +\par \hspace{3em} giving terminal regions $R_{jkm}, 1,2,\dots,J_m$ \\ \indent \indent iii. Compute $$\gamma_{jkm} = \frac{K-1}{K} \frac{\sum_{x_i \in R_{jkm}} (r_{ikm})}{\sum_{x_i \in R_{jkm}} |r_{ikm}| (1 - |r_{ikm}|)} , j=1,2,\dots,J_m$$ \\ diff --git a/docs/glm/GLM_Vignette.pdf b/docs/glm/GLM_Vignette.pdf index 4e43b142de..708f42f00e 100644 Binary files a/docs/glm/GLM_Vignette.pdf and b/docs/glm/GLM_Vignette.pdf differ diff --git a/docs/glm/GLM_Vignette.tex b/docs/glm/GLM_Vignette.tex index a1f58c1ddd..859ae3d3d3 100644 --- a/docs/glm/GLM_Vignette.tex +++ b/docs/glm/GLM_Vignette.tex @@ -39,19 +39,28 @@ \section{Introduction} \label{1} This document describes Generalized Linear Model implementation on H2O platform, list of supported features and how to use them from R. \subsection{What is H2O?} -It is the only alternative to combine the power of highly advanced algorithms, the freedom of open source, and the capacity of truly scalable in-memory processing for big data on one or many nodes. Combined, these capabilities make it faster, easier, and more cost-effective to harness big data to maximum benefit for the business. +H2O is an open source analytics platform for data scientists and business analysts who need scalable and fast machine learning capabilities. Our product helps organizations like PayPal, ShareThis, and Cisco reduce model building, training, and scoring times from months to days. Our use cases range from predictive modeling, fraud detection, and even customer intelligence in industries as diverse as insurance, SaaS, finance, ad tech, and recruiting. -Data collection is easy. Decision making is hard. H2O makes it fast and easy to derive insights from your data through faster and better predictive modeling. Existing Big Data stacks are batch-oriented. Search and analytics need to be interactive. Use machines to learn machine-generated data. And more data beats better algorithms. +With its in-memory compression techniques, H2O can handle billions of data rows in-memory — even with a fairly small cluster. The platform includes interfaces for R, Python, Scala, Java, JSON and Coffeescript/JavaScript, along with its built-in web interface that makes it easier for non-engineers to stitch together a complete analytic workflow. The platform was built alongside (and on top of) both Hadoop and Spark Clusters. -With H2O, you can make better predictions by harnessing sophisticated, ready-to-use algorithms and the processing power you need to analyze bigger data sets, more models, and more variables. +H2O implements almost all common machine learning algorithms — such as generalized linear modeling (linear regression, logistic regression, etc.), Naive Bayes, principal components analysis, time series, k-means clustering and others. H2O also implements best-in-class algorithms such as Random Forest, Gradient Boosting and Deep Learning at scale. Customers can build thousands of models and compare them to get the best prediction results. -Get started with minimal effort and investment. H2O is an extensible open source platform that offers the most pragmatic way to put big data to work for your business. With H2O, you can work with your existing languages and tools. Further, you can extend the platform seamlessly into your Hadoop environments. Get H2O! +H2O was built by a passionate team of computer scientists, systems engineers and data scientists, from the ground up, for the data science community. We’re driven by strong curiosity, a desire to learn, and strong drive to tackle the scalability challenges of real world data analysis. Our team members have come to H2O from organizations as diverse as Marketo, Oracle, Azul, Teradata, and SAS. Our advisory board comes from Stanford University’s engineering, statistics, and health research departments. -Download H2O -\url{http://www.h2o.ai/download} +We host meetups, run experiments, and spend our days learning alongside our customers. -Join the Community -\url{h2ostream@googlegroups.com} and \url{github.com/h2oai/h2o.git} + +\textbf{Try it out} + +H2O offers an R package that can be installed from CRAN. H2O can be downloaded from \url{www.h2o.ai/download}. + +\textbf{Join the community} + +Connect with \url{h2ostream@googlegroups.com} and \url{https://github.com/h2oai} to learn about our meetups, training sessions, hackathons, and product updates. + +\textbf{Learn more about H2O} + +Visit \url{www.h2o.ai} \subsection{What is GLM?} Generalized linear models (GLM) are the workhorse for most predictive analysis use cases. GLM can be used for both regression and classification, it scales well to large datasets and is based on solid statistical background. It is a ganrelaztion of linear models, allowing for modeling of data with exponential distributions and for categorical data (classification). GLM models are fitted solving the maximum likelihood optimization problem. @@ -66,7 +75,7 @@ \subsection{GLM on H2O} The main advantage of an L1 penalty is that with sufficiently high $\lambda$, it produces a sparse solution; the L2-only penalty does not reduce coefficients to exactly 0. The two penalties also differ in the case of correlated predictors. The L2 penalty shrinks coefficients for correlated columns towards each other, while the L1 penalty will pick one and drive the others to zero. Using the elastic net argument $\alpha$, you can combine these two behaviors. It is also useful to always add a small L2 penalty to increase numerical stability. -Similarly to \cite{glmnet}, h2o can compute the full regularization path, starting from null-model (maximum penalty) going down to minimally penalized model. This search is made efficient by employing strong-rules \cite{strong} to filter out inactive coefficients (coefficients pushed to zero by penalty). Computing full regularization path is usefull in that it gives more insight about the importance of indiviaul coefficvient, quiality of the model and it allows to select optimal amount of penalization for the given problem and data. +Similarly to \cite{glmnet}, H2O can compute the full regularization path, starting from null-model (maximum penalty) going down to minimally penalized model. This search is made efficient by employing strong-rules \cite{strong} to filter out inactive coefficients (coefficients pushed to zero by penalty). Computing full regularization path is useful in that it gives more insight about the importance of individual coefficients and quality of the model while allowing selection of the optimal amount of penalization for the given problem and data. \subsubsection{Summary of features} @@ -222,7 +231,7 @@ \subsubsection{Linear Regression (Gaussian family) } The model is fitted by solving the least squares problem (maximum likelihood for gaussian family): -\[ \min\limits_{\beta,\beta_0} { {1 \over 2N}\sum\limits_{i=1}\limits^{N}(x_i^{T}\beta + \beta_0- y_i)^T (x_i^{T}\beta + \beta_0 - y_i)) + \lambda (\alpha \|\beta \|_1 + {1-\alpha \over 2}) \| \beta \|_2^2} \] +\[ \min\limits_{\beta,\beta_0} { {1 \over 2N}\sum\limits_{i=1}\limits^{N}(x_i^{T}\beta + \beta_0- y_i)^T (x_i^{T}\beta + \beta_0 - y_i) + \lambda (\alpha \|\beta \|_1 + {1-\alpha \over 2}) \| \beta \|_2^2} \] Deviance is simply the sum of squared errors: @@ -544,25 +553,27 @@ \subsubsection{Loading data} \label{2.5} \subsection{Performing a trial run} \label{3.2} Returning to the Airline dataset demo, we first load the dataset into H2O and select the variables we want to use to predict a chosen response. For example, we can model if flights are delayed based on the departure's scheduled day of the week and day of the month. -% @TODO localH2O not found FIXXXXXXX & CHECK VARIABLES - [Code below tested & run in R - Jessica]% \begin{spverbatim} +library(h2o) +localH2O = h2o.init(nthreads = -1) #Load the data and prepare for modeling -air_train.hex = h2o.uploadFile(localH2O, path = "Downloads/AirlinesTrain.csv", header = TRUE, sep = ",", key = "airline_train.hex") - -air_test.hex = h2o.uploadFile(localH2O, path = "Downloads/AirlinesTest.csv", header = TRUE, sep = ",", key = "airline_test.hex") - -x = c("fYear", "fMonth", "fDayofMonth", "fDayOfWeek", "UniqueCarrier", "Origin", "Dest", "Distance") -y = "IsDepDelayed" +air_train.hex = h2o.uploadFile(localH2O, path = "~/Downloads/AirlinesTrain.csv", + header = TRUE, sep = ",", key = "airline_train.hex") +air_test.hex = h2o.uploadFile(localH2O, path = "~/Downloads/AirlinesTest.csv", + header = TRUE, sep = ",", key = "airline_test.hex") +x = c("fYear", "fMonth", "fDayofMonth", "fDayOfWeek", "UniqueCarrier", "Origin", + "Dest", "Distance") +y = "IsDepDelayed" \end{spverbatim} Now we train the GLM model: \begin{spverbatim} -airline.glm <- h2o.glm(x=x, - y=y, +airline.glm <- h2o.glm(x=x, + y=y, data=air_train.hex, key = "glm_model", family="binomial", @@ -576,30 +587,27 @@ \subsubsection{Extracting and handling the results} \label{3.2.1} We can extract the parameters of our model, examine the scoring process, and make predictions on new data. -%@ NOTE typoe in R demo test… i spelled performance wrong :( -% [Jessica: Need some help with code below - 2nd to last line throws error still] \begin{spverbatim} print("Predict on GLM model") -best_glm = airlines.glm@models[[airlines.glm@best_model]] +best_glm = airline.glm@models[[airline.glm@best_model]] air.results = h2o.predict(object = best_glm, newdata = air_test.hex) print("Check performance and AUC") perf = h2o.performance(air.results$YES,air_test.hex$IsDepDelayed ) print(perf) perf@model$auc print("Show distribution of predictions with quantile.") -quantile.H2OParsedData(air.results$YES) +quant = quantile.H2OParsedData(air.results$YES) print("Extract strongest predictions.") -top.air <- h2o.assign(air.results[air.results$YES > quant‘75%,key="top.air") +top.air <- h2o.assign(air.results[air.results$YES > quant["75%"]], key="top.air") top.air \end{spverbatim} \noindent \\ \\ Once we have a satisfactory model, the \texttt{h2o.predict()} command can be used to compute and store predictions on the new data, which can then be used for further tasks in the interactive modeling process. -%[Jessica: I think this code needs work as well] \begin{spverbatim} #Perform classification on the held out data -prediction = h2o.predict(airline.glm, newdata=air_test.hex) +prediction = h2o.predict(object = best_glm, newdata=air_test.hex) #Copy predictions from H2O to R pred = as.data.frame(prediction) head(pred) @@ -677,7 +685,7 @@ \section{Appendix: Parameters} April 29, 2009 \bibitem{strong} - Robert Tibshirani, Jacob Bien, Jerome Friedman, Trevor Hastie, Noah Simon, Jonathan Tay- lor, and Ryan J. Tibshirani + Robert Tibshirani, Jacob Bien, Jerome Friedman, Trevor Hastie, Noah Simon, Jonathan Taylor, and Ryan J. Tibshirani Strong Rules for Discarding Predictors in Lasso-type Problems J. R. Statist. Soc. B, vol. 74, 2012.