Skip to content

Commit

Permalink
added lots of stuff/ data options are great now/ users like more docs
Browse files Browse the repository at this point in the history
  • Loading branch information
dearirenelang committed May 8, 2014
1 parent d9b6037 commit 8d53f5e
Show file tree
Hide file tree
Showing 15 changed files with 354 additions and 116 deletions.
125 changes: 12 additions & 113 deletions h2o-docs/source/userguide/data.rst
Original file line number Diff line number Diff line change
Expand Up @@ -3,120 +3,19 @@
Data
=====

Ingesting Data
---------------
.. toctree::
:maxdepth: 1

Ingesting data is the process of moving data from outside of H\ :sub:`2`\ O into
the running instance of H\ :sub:`2`\ O. To ingest data start from the drop down
menu **Data**, and select the appropriate option. Options and their uses are described below.
inspect
dataviewall
datasummary
dataparse
datainspect
dataimportfiles
dataexportfiles
dataquantiles
datauploadfiles
quantiles

**Import Files:**

In the path field specify an absolute path to the
file. For example: Users/UserName/Work/dataset.csv. Press submit.

On the resulting screen the specified path will appear as a
highlighted link. Clicking on the path automatically parses the
data.

**Import URL:**

Copy the URL where the raw data are displayed into the URL
field. Users may wish to specify a Key; one is usually assigned
using the original file name. In this case the URL will become part
of the .hex, unless Key is otherwise specified. For example,
original data can be found at:
http://archive.ics.uci.edu/ml/machine-learning-databases/internet_ads/ad.data

Once the data are imported, users will be automatically sent to the
Import URL page, where they can click on the KEY. This automatically
goes to the Inspect page. Users should not be worried at this point
if data do not look as expected. This will be corrected when data are
parsed.

**Import S3:**

In the field marked Bucket give the path to an existing AWS bucket
where data are stored.

**Upload:**

Click on the **Select File** button. A menu of files on the
computer or working directory will appear. Select the appropriate
file, and click on **Choose.** When returned to the H\ :sub:`2`\ O
screen press **Upload.**



Parsing Data
------------

Once data are ingested, they are available to H\ :sub:`2`\ O, but are
not yet in a format that H\ :sub:`2`\ O can process. Converting the data to
an H\ :sub:`2`\ O usable format is called parsing.

After ingestion users are directed to a **Request Parse** screen. To
parse data users can leave most options in default. For example, H\ :sub:`2`\ O
automatically determines separators in data sets. For most data
formats users will be automatically redirected to a page to request
parse, where they can simply press submit. Exceptions to this are
noted below. Once data are parsed a .hex key is displayed for the
user. This .hex key will be used to refer to the data set in all H\ :sub:`2`\ O
analysis, and should be noted. It can also be found at a later time
through the Admin menu by selecting Jobs, or through the **Data**
menu, by choosing **View All.**

**Import URL:**

Click on "Parse into .hex format" displayed at the top of
the inspect page after data are inhaled. Import URL takes users
directly to parse.

**Parser Behavior**

The data type in each column must be consistent. For example, when
data are alpha-coded categorical, all entries must be alpha or
alpha numeric. If numeric entries are detected by the parser, the
column will not be processed. It will register all entries as
NA. This is also true when NA entries are included in columns
consisting of numeric data. Columns of alpha coded categorical
variables containing NA entries will register NA as a distinct
factor level. When missing data are coded as periods or dots in the
original data set those entries are converted to zero.


Other Data Capabilities
-----------------------

Each of the following actions can be found in the Data drop down
menu.

**Inspect:**

Used to view a inhaled or parsed data set. Select Inspect
from the drop down menu Data. In Key enter the key or .hex key
associated with the desired data.

**View All:**

Used to view all data sets that have been inhaled or
parsed into H\ :sub:`2`\ O. To remove a dataset from H\ :sub:`2`\ O
click on the red X next to the data set key.

**Summary:**

Used to display descriptive statistics and histograms of
any columns within a specific data set. Specify data by the
associated .hex key in the Key field, and select variables of
interest from the resulting list of variables. Summary can be found
under the **Model** drop down menu.




Data Manipulation
------------------

Users who wish to manipulate their data after it has been parsed into
H\ :sub:`2`\ O have a set of tools to do via H\ :sub:`2`\ O + R.

21 changes: 21 additions & 0 deletions h2o-docs/source/userguide/dataexportfiles.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,21 @@
.. _DataExport:

Data: Export Files
====================

Data files can be exported to S3, HDFS or NFS.

**Src key**

The key associated with the data to be exported.


**Path**

The file path to S3, HDFS, NFS, or URL where the data are to be
exported to.

**Force**

A checkbox option that, when checked, will overwrite existing
files.
19 changes: 19 additions & 0 deletions h2o-docs/source/userguide/dataimportfiles.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,19 @@


Data: Import Files
====================

In the path field specify an absolute path to the
file. For example: Users/UserName/Work/dataset.csv. Press submit.

On the resulting screen the specified path will appear as a
highlighted link. Clicking on the path automatically parses the
data.

Import files also enables users to import data from S3 and URL.


**Path**

The file path to S3, HDFS, NFS, or URL where the data are to be
imported from.
16 changes: 16 additions & 0 deletions h2o-docs/source/userguide/datainspect.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,16 @@


Data: Inspect Request
=======================


**Src Key**

The source key for the parsed data (with keys usually ending in
.hex).

Once source key has been specified, an inspect table displaying parsed
data is returned to the user. Basic summary information is given at
the top, as are click button options to specify columns within data as a
factor or numeric. For more information visit :ref:`InspectReturn`.

61 changes: 61 additions & 0 deletions h2o-docs/source/userguide/dataparse.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,61 @@
.. _DataParse:

Data: Parse
====================

Once data are ingested, they are available to H\ :sub:`2`\ O, but are
not yet in a format that H\ :sub:`2`\ O can process. Converting the data to
an H\ :sub:`2`\ O usable format is called parsing.

Parser Behavior
------------------

The data type in each column must be consistent. For example, when
data are alpha-coded categorical, all entries must be alpha or
alpha numeric. If numeric entries are detected by the parser, the
column will not be processed. It will register all entries as
NA. This is also true when NA entries are included in columns
consisting of numeric data. Columns of alpha coded categorical
variables containing NA entries will register NA as a distinct
factor level. When missing data are coded as periods or dots in the
original data set those entries are converted to zero.

**In general options can be left in default and the parser just works.**

**Parser Type**
Drop down menu allows users to specify whether data are formatted as
CSV, XLS, or SVMlight. This option is best left in default - the
parser recognizes data formats with rare exception.

**Separator**
A list of common separators is given, however, this option is best
left in default.

**Header**
Checkbox to be checked if the first line of the file being parsed is
a header (includes column names or indices).

**Header From File**
Specify a file key if the header for the data to be parsed is found
in another file that has already been imported to H2O.

**Exclude**
A comma separated list of columns to be omitted from parse.

**Source Key**
The file key associated with the imported data to be parsed.

**Destination Key**
An optional user specified name for the parsed data to be referenced
later in modeling. If left in default a destination key will
automatically be assigned to be "original file name.hex".

**Preview**
Auto-generated preview of parsed data.

**Delete on done**
A checkbox indicating whether imported data should be deleted when
parsed. In general, this option is recommended, as retaining data will take
memory resources, but not aid in modeling because unparsed data
can't be acted on by H2O.

48 changes: 48 additions & 0 deletions h2o-docs/source/userguide/dataquantiles.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,48 @@
.. _DataQuantiles:

Data: Quantiles (Request)
==========================

**Source Key**

The key associated with the data set of interest.

**Column**

The column of interest.

**Quantile**

A value bounded on the interval (0,1), where X is the value below
which X as a percentage of the data fall. For instance if the
quantile .25 is requested, the value returned will be the value
within the range of the column of data below which 25% of the data
fall.

**Max Qbins**

The number of bins into which the column should be split before the
quantile is calculated. As the number of bins approaches the number
of observations the approximate solution approaches the exact
solution.

**Multiple Pass**

Only 3 possible entries:
*0*: Calculate the best approximation of the requested quanitle in
one pass.
*1*: Return the exact result (with a maximum iteration of 16 passes)
*2*: Return both a single pass approximation and multi-pass exact
answer.

**Interpolation Type**

When the quantile falls between two in-data values, it is necessary
to interpolate the true value of the quantile. This can be done by
mean interpolation, or linear interpolation.

*2*: Mean interpolation
*7*: Linear interpolation



31 changes: 31 additions & 0 deletions h2o-docs/source/userguide/datasummary.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,31 @@


Data: Summary (Request)
==========================

Summary returns a column by column detailed summary of parsed
data. For more information on the returned information see
:ref:`Summary`

**Source**

The .hex key associated with the data to be summarized.

**Cols**

If a subset of columns is desired, specify that subset
here. Default is to return a summary for all columns.

**Max Ncols**

The maximum number of columns to be summarized.

**Max Qbins**

The number of bins for quantiles. When large data are parsed, they
are also binned and distributed across a cluster. When data are
multimodal (or otherwise distinctly shaped), increasing the number
of bins will allocate fewer data points to each bin and thus
increase the accuracy of the quantiles returned. Increasing the
number of bins for extremely large data can slow results depending
on the memory allocated to computational tasks.
9 changes: 9 additions & 0 deletions h2o-docs/source/userguide/datauploadfiles.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,9 @@


Data: Upload Files
====================

Upload files enables users to upload data from their local computer
or server. Click on *Select File* and an upload helper will appear to
walk users through their file structure and find the data to be
uploaded and parsed.
9 changes: 9 additions & 0 deletions h2o-docs/source/userguide/dataviewall.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,9 @@


Data: View All
==================

Users can view all keys and associated data by selecting the **View
All** option from the **Data** drop down menu. Keys are listed in the
far left column, and can be removed from the cluster by clicking on
the large red X next to the key name.
2 changes: 1 addition & 1 deletion h2o-docs/source/userguide/general.rst
Original file line number Diff line number Diff line change
Expand Up @@ -24,7 +24,7 @@ of this documentation. Advanced users may find additional documentation on
running in specialized environments helpful: :ref:`Developer`.

For multinode clusters utilizing several servers, it is strongly
reccomended that all servers and nodes be symmetric and identically
recommended that all servers and nodes be symmetric and identically
configured. For example, allocating different amounts of memory to
nodes in the same cluster can adversely impact performance.

Expand Down
Loading

0 comments on commit 8d53f5e

Please sign in to comment.