PyDataProvider English Document

Thanks to caoying to check english gramma. ISSUE=4598269 git-svn-id: https://svn.baidu.com/idl/trunk/paddle@1462 1ad973e4-5ce8-4261-8a94-b56d1f490c56
simonway · Aug 31, 2016 · ed67165 · ed67165
1 parent cecdede
commit ed67165
Show file tree

Hide file tree

Showing 6 changed files with 294 additions and 176 deletions.
diff --git a/doc/ui/api/py_data_provider_wrapper.rst b/doc/ui/api/py_data_provider_wrapper.rst
diff --git a/doc/ui/data_provider/index.md b/doc/ui/data_provider/index.md
diff --git a/doc/ui/data_provider/index.rst b/doc/ui/data_provider/index.rst
@@ -0,0 +1,42 @@
+PaddlePaddle DataProvider Introduction
+================================
+DataProvider is a module that loads training or testing data into cpu or gpu
+memory for the following triaining or testing process.
+
+For simple use, users can use Python :code:`PyDataProvider` to dynamically reads
+the original data in any format or in any form, and then transfer them into a
+data format PaddlePaddle requires. The process is extremly flexible and highly
+customized, with sacrificing the efficiency only a little. This is extremly
+useful when you have to dynamically generate certain kinds of data according to,
+for example, the training performance.
+
+Besides, users also can also customize a C++ :code:`DataProvider` for a more
+complex usage, or for a higher efficiency.
+
+The following parameters are required to define in the PaddlePaddle network
+configuration file (trainer_config.py): which DataProvider is chosen to used,
+and specific parameters for DataProvider, including training file list
+(train.list) and testing file list (test.list).
+
+Train.list and test.list are simply two plain text files, which defines path
+of training or testing data. It is recommended that directly placing them into
+the training directory, and reference to them by using a relative path (
+relative to the PaddePaddle program).
+
+Testing or evaluating will not be performed during training if the test.list is
+not set or set to None. Otherwise, PaddlePaddle will evaluate the trained model
+by the specified tesing data while training, every testing period (a user
+defined command line parameter in PaddlePaddle) to prevent over-fitting.
+
+Each line of train.list and test.list is an absolute or relative path (relative
+to the PaddePaddle program runtime) of data file. Fascinatingly more, each line
+can also be a HDFS file path or a SQL connection string. As long as the user
+assures how to access each file in DataProvider.
+
+Please refer to the following articles for more information about the detail
+usages of DataProvider and how to implement a new DataProvider,
+
+..  toctree::
+
+    pydataprovider2.rst
+    write_new_dataprovider.rst
diff --git a/doc/ui/data_provider/pydataprovider2.rst b/doc/ui/data_provider/pydataprovider2.rst
@@ -0,0 +1,250 @@
+How to use PyDataProvider2
+==========================
+
+We highly recommand users to use PyDataProvider2 to provide training or testing
+data to PaddlePaddle. The user only needs to focus on how to read a single
+sample from the original data file by using PyDataProvider2, leaving all of the
+trivial work, including, transfering data into cpu/gpu memory, shuffle, binary
+serialization to PyDataProvider2. PyDataProvider2 uses multithreading and a
+fanscinating but simple cache strategy to optimize the efficiency of the data
+providing process.
+
+DataProvider for the non-sequential model
+-----------------------------------------
+
+Here we use the MNIST handwriting recognition data as an example to illustrate
+how to write a simple PyDataProvider.
+
+MNIST is a handwriting classification data set. It contains 70,000 digital
+grayscale images. Labels of the training sample range from 0 to 9. All the
+images have been size-normalized and centered into images with a same size
+of 28 x 28 pixels.
+
+A small part of the original data as an example can be found in the path below:
+
+.. literalinclude:: ../../../doc_cn/ui/data_provider/mnist_train.txt
+
+Each line of the data contains two parts, separated by ';'. The first part is
+label of an image. The second part contains 28x28 pixel float values.
+
+Just write path of the above data into train.list. It looks like this:
+
+.. literalinclude:: ../../../doc_cn/ui/data_provider/train.list
+
+The corresponding dataprovider can be found in the path below:
+
+.. literalinclude:: ../../../doc_cn/ui/data_provider/mnist_provider.py
+   : linenos:
+
+The first line imports PyDataProvider2 package.
+The main function is the process function, that has two parameters.
+The first parameter is the settings, which is not used in this example.
+The second parameter is the filename, that is exactly each line of train.list.
+This parameter is passed to the process function by PaddlePaddle.
+
+:code:`@provider` is a Python
+`Decorator <http://www.learnpython.org/en/Decorators>`_ .
+It sets some properties to DataProvider, and constructs a real PaddlePaddle
+DataProvider from a very sample user implemented python function. It does not
+matter if you are not familiar with `Decorator`_. You can keep it sample by
+just taking :code:`@provider` as a fixed mark above the provider function you
+implemented.
+
+`input_types`_ defines the data format that a DataProvider returns.
+In this example, it is set to a 28x28-dimensional dense vector and an integer
+scalar, whose value ranges from 0 to 9.
+`input_types`_ can be set to several kinds of input formats, please refer to the
+document of `input_types`_ for more details.
+
+
+The process method is the core part to construct a real DataProvider in
+PaddlePaddle. It implements how to open the text file, how to read one sample
+from the original text file, converted them into `input_types`_, and give them
+back to PaddlePaddle process at line 23.
+Note that data yields by the process function must follow a same order that
+`input_types`_ are defined.
+
+
+With the help of PyDataProvider2, user can focus on how to generate ONE traning
+sample by using keywords :code:`yield`.
+:code:`yield` is a python keyword, and a concept related to it includes
+:code:`generator`.
+
+Only a few lines of codes need to be added into the training configuration file,
+you can take this as an example.
+
+.. literalinclude:: ../../../doc_cn/ui/data_provider/mnist_config.py
+
+Here we specify training data by 'train.list', and no testing data is specified.
+
+Now, this simple example of using PyDataProvider is finished.
+The only thing that the user should know is how to generte **one sample** from
+**one data file**.
+And PaddlePadle will do all of the rest things\:
+
+* Form a training batch
+* Shuffle the training data
+* Read data with multithreading
+* Cache the training data (Optional)
+* CPU-> GPU double buffering.
+
+Is this cool?
+
+DataProvider for the sequential model
+-------------------------------------
+A sequence model takes sequences as its input. A sequence is made up of several
+timesteps. The so-called timestep, is not necessary to have something to do
+with 'time'. It can also be explained to that the order of data are taken into
+consideration into model design and training.
+For example, the sentence can be interpreted as a kind of sequence data in NLP
+tasks.
+
+Here is an example on data proivider for English sentiment classification data.
+The original input data are simple English text, labeled into positive or
+negative sentiment (marked by 0 and 1 respectively).
+
+A small part of the original data as an example can be found in the path below:
+
+.. literalinclude:: ../../../doc_cn/ui/data_provider/sentimental_train.txt
+
+The corresponding data provider can be found in the path below:
+
+.. literalinclude:: ../../../doc_cn/ui/data_provider/sentimental_provider.py
+
+This data provider for sequential model is a little bit complex than that
+for MINST dataset.
+A new initialization method is introduced here.
+The method :code:`on_init` is configured to DataProvider by :code:`@provider`'s
+:code:`init_hook` parameter, and it will be invoked once DataProvider is
+initialized. The :code:`on_init` function has the following parameters:
+
+* The first parameter is the settings object.
+* The rest parameters are passed by key word arguments. Some of them are passed
+  by PaddlePaddle, see reference for `init_hook`_.
+  The :code:`dictionary` object is a python dict object passed from the trainer
+  configuration file, and it maps word string to word id.
+
+To pass these parameters into DataProvider, the following lines should be added
+into trainer configuration file.
+
+.. literalinclude:: ../../../doc_cn/ui/data_provider/sentimental_config.py
+
+The definition is basically same as MNIST example, except:
+* Load dictionary in this configuration
+* Pass it as a parameter to the DataProvider
+
+The `input_types` is configured in method :code:`on_init`. It has the same
+effect to configure them by :code:`@provider`'s :code:`input_types` parameter.
+However, the :code:`input_types` is set at runtime, so we can set it to
+different types according to the input data. Input of the neural network is a
+sequence of word id, so set :code:`seq_type` to :code:`integer_value_sequence`.
+
+Durning :code:`on_init`, we save :code:`dictionary` variable to
+:code:`settings`, and it will be used in :code:`process`. Note the settings
+parameter for the process function and for the on_init's function are a same
+object.
+
+The basic processing logic is the same as MNIST's :code:`process` method. Each
+sample in the data file is given back to PaddlePaddle process.
+
+Thus, the basic usage of PyDataProvider is here.
+Please refer to the following section reference for details.
+
+Reference
+---------
+
+.. _@provider::
+@provider
++++++++++
+
+'@provider' is a Python `Decorator`_, it can construct a PyDataProvider in
+PaddlePaddle from a user defined function. Its parameters are:
+
+* `input_types`_ defines format of the data input.
+* should_shuffle defines whether to shuffle data or not. By default, it is set
+  true during training, and false during testing.
+* pool_size is the memory pool size (in sample number) in DataProvider.
+  -1 means no limit.
+* can_over_batch_size defines whether PaddlePaddle can store little more
+  samples than pool_size. It is better to set True to avoid some deadlocks.
+* calc_batch_size is a function define how to calculate batch size. This is
+  usefull in sequential model, that defines batch size is counted upon sequence
+  or token. By default, each sample or sequence counts to 1 when calculating
+  batch size.
+* cache is a data cache strategy, see `cache`_
+* Init_hook function is invoked once the data provider is initialized,
+  see `init_hook`_
+
+.. _input_types::
+input_types
++++++++++++
+
+PaddlePaddle has four data types, and three sequence types.
+The four data types are: 
+
+* dense_vector represents dense float vector.
+* sparse_binary_vector sparse binary vector, most of the value is 0, and
+  the non zero elements are fixed to 1.
+* sparse_float_vector sparse float vector, most of the value is 0, and some
+  non zero elements that can be any float value. They are given by the user.
+* integer represents an integer scalar, that is especially used for label or
+  word index.
+
+
+The three sequence types are
+
+* SequenceType.NO_SEQUENCE means the sample is not a sequence
+* SequenceType.SEQUENCE means the sample is a sequence
+* SequenceType.SUB_SEQUENCE means it is a nested sequence, that each timestep of
+  the input sequence is also a sequence.
+
+Different input type has a defferenct input format. Their formats are shown
+in the above table.
+
++----------------------+---------------------+-----------------------------------+------------------------------------------------+
+|                      | NO_SEQUENCE         | SEQUENCE                          |  SUB_SEQUENCE                                  |
++======================+=====================+===================================+================================================+
+| dense_vector         | [f, f, ...]         | [[f, ...], [f, ...], ...]         | [[[f, ...], ...], [[f, ...], ...],...]         |
++----------------------+---------------------+-----------------------------------+------------------------------------------------+
+| sparse_binary_vector | [i, i, ...]         | [[i, ...], [i, ...], ...]         | [[[i, ...], ...], [[i, ...], ...],...]         |
++----------------------+---------------------+-----------------------------------+------------------------------------------------+
+| sparse_float_vector  | [(i,f), (i,f), ...] | [[(i,f), ...], [(i,f), ...], ...] | [[[(i,f), ...], ...], [[(i,f), ...], ...],...] |
++----------------------+---------------------+-----------------------------------+------------------------------------------------+
+| integer_value        |  i                  | [i, i, ...]                       | [[i, ...], [i, ...], ...]                      |
++----------------------+---------------------+-----------------------------------+------------------------------------------------+
+
+where f represents a float value, i represents an integer value.
+
+.. _init_hook::
+.. _settings::
+init_hook
++++++++++
+
+init_hook is a function that is invoked once the data provoder is initialized.
+Its parameters lists as follows:
+
+* The first parameter is a settings object, which is the same to :code:'settings'
+  in :code:`process` method.  The object contains several attributes, including:
+  * settings.input_types the input types. Reference `input_types`_
+  * settings.logger a logging object
+* The rest parameters are the key word arguments. It is made up of PaddpePaddle
+  pre-defined parameters and user defined parameters.
+  * PaddlePaddle defines parameters including:
+    * is_train is a bool parameter that indicates the DataProvider is used in
+      training or testing
+    * file_list is the list of all files.
+  * User-defined parameters args can be set in training configuration.
+
+Note, PaddlePaddle reserves the right to add pre-defined parameter, so please
+use :code:`**kwargs` in init_hook to ensure compatibility by accepting the
+parameters which your init_hook does not use.
+
+.. _cache ::
+cache
++++++
+DataProvider provides two simple cache strategy. They are
+* CacheType.NO_CACHE means do not cache any data, then data is read runtime by
+  the user implemented python module every pass.
+* CacheType.CACHE_PASS_IN_MEM means the first pass reads data by the user
+  implemented python module, and the rest passes will directly read data from
+  memory.