Skip to content

Commit

Permalink
Merge branch 'blis/ops14todo' of https://github.com/Microsoft/CNTK in…
Browse files Browse the repository at this point in the history
…to blis/ops14todo
  • Loading branch information
wilrich-msft committed May 4, 2016
2 parents d916bc8 + daa2203 commit 9d04eda
Show file tree
Hide file tree
Showing 2 changed files with 56 additions and 1 deletion.
26 changes: 25 additions & 1 deletion contrib/Python/cntk/tests/context_test.py
Original file line number Diff line number Diff line change
Expand Up @@ -6,7 +6,9 @@

import numpy as np
from ..context import *

from ..ops.cntk2 import Input
from ..sgd import *
from ..reader import *

def test_parse_shapes_1():
output = '''\
Expand Down Expand Up @@ -107,3 +109,25 @@ def test_parse_test_result_output():
assert result['eval_node'] == 2.77790430
assert result['crit_node'] == 0.44370050
assert len(result) == 3

def test_export_deferred_context():
X = Input(2)
reader = CNTKTextFormatReader("Data.txt")
my_sgd = SGDParams()

with DeferredExecutionContext() as ctx:
input_map=reader.map(X, alias='I', dim=2)
ctx.train(
root_nodes=[X],
training_params=my_sgd,
input_map=input_map)

ctx.test(
root_nodes=[X],
input_map=input_map)

ctx.write(input_map=input_map)
ctx.eval(X, input_map)
with open(ctx.export("name")) as config_file:
assert config_file.readlines()[-1] == "command=Train:Test:Write:Eval"

31 changes: 31 additions & 0 deletions contrib/Python/doc/gettingstarted.rst
Original file line number Diff line number Diff line change
Expand Up @@ -285,3 +285,34 @@ that the minibatch layout for the labels and the data with dynamic axes is compa
For the full explanation of how ``lstm_layer()`` is defined, please see the full example in the
Examples section.

How to pass Python data as train/test data
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

The Python CNTK API allows to pass training / testing data either by specifing external input files or by using Python data directly to CNTK.
This second alternative - using internal Python data - is usefull especially if you want to do some quick experimentation with small synthetic data sets.
In what follows you will learn in what structure these data has to be provided.

Let us start with a scenario coming from one of our code examples (`logreg_numpy.py <https://github.com/Microsoft/CNTK/tree/master/contrib/Python/cntk/examples/LogReg/logreg_numpy.py>`_).
In this example we want to classify a 250 dimensional feature vector into one of two classes. In this case whe have two *inputs*:
- The features values for each training item. In the example these are 500 vectors each of dimension 250.
- The expected class. In this example the class is encoded with a two-dimensonal vector where the element for expected class is set to 1 and the other to 0.

For each of these inputs we have to provide one data structure containing all training instances.

You might notice that this is conceptually different to the case where we provide the data from external files using the CNTKTextReader.
In the input file for CNTKTextReader we provide data for different *inputs* of one instance on the same line, so the data from different inputs are much more interwined.

In Python the feature data are reprensented by a NumPy array of dimension ``number_of_instances X dimension_of_feature_space`` so in out example its a NumPy array of dimension ``500 X 250``.
Likewise the expected output is reprensented by another NumPy array of dimension ``500 X 2``.

Passing sequence data from Python
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

CNTK can handle sequences with arbitrary maximal length. This feature is also called *dynamic-axis*.
To represent an input with a dynamic-axis in Python you have to provide each sequence as a NumPy-array where the first axis has a dimension equal to the sequence length.
The complete dataset is then just a normal one-dimensional numpy array of these sequences.

Take as an artifical example a sentence classification problem. Each sentence has a different number of words, i.e. it is a *sequence* of words. The individual words might each be represented by some lantent vector.
So each sentence is represented by a NumPy array of dimension ``sequence_length X embedding_dimension``. The whole set of instances (sentences) is then represented by putting them into a one-dimensional array with the size equal to the number of instances.


0 comments on commit 9d04eda

Please sign in to comment.