FIX scikit-learn#401: update tutorial doctests to reflect recent changes and add them to

ogrisel · ogrisel · commit aa69d857a844 · 2011-10-22T17:39:19.000+02:00
diff --git a/Makefile b/Makefile
@@ -32,7 +32,7 @@ test-code: in
 	$(NOSETESTS) -s sklearn
 test-doc:
 	$(NOSETESTS) -s --with-doctest --doctest-tests --doctest-extension=rst \
-	--doctest-fixtures=_fixture doc/modules/
+	--doctest-fixtures=_fixture doc/ doc/modules/
 
 test-coverage:
 	$(NOSETESTS) -s --with-coverage --cover-html --cover-html-dir=coverage \
diff --git a/doc/modules/cross_validation.rst b/doc/modules/cross_validation.rst
@@ -1,3 +1,5 @@
+.. _cross_validation:
+
 ================
 Cross-Validation
 ================
diff --git a/doc/tutorial.rst b/doc/tutorial.rst
@@ -12,11 +12,11 @@ Getting started: an introduction to machine learning with scikit-learn
 Machine learning: the problem setting
 ---------------------------------------
 
-In general, a learning problem considers a set of n *samples* of data and
-try to predict properties of unknown data. If each sample is more than a
-single number, and for instance a multi-dimensional entry (aka
-*multivariate* data), is it said to have several attributes, or
-*features*.
+In general, a learning problem considers a set of n **samples** of
+data and try to predict properties of unknown data. If each sample is
+more than a single number, and for instance a multi-dimensional entry
+(aka **multivariate** data), is it said to have several attributes,
+or **features**.
 
 We can separate learning problems in a few large categories:
 
@@ -46,12 +46,12 @@ We can separate learning problems in a few large categories:
 
 .. topic:: Training set and testing set
 
-    Machine learning is about learning some properties of a data set and
-    applying them to new data. This is why a common practice in machine
-    learning to evaluate an algorithm is to split the data at hand in two
-    sets, one that we call a *training set* on which we learn data
-    properties, and one that we call a *testing set*, on which we test
-    these properties.
+    Machine learning is about learning some properties of a data set
+    and applying them to new data. This is why a common practice in
+    machine learning to evaluate an algorithm is to split the data
+    at hand in two sets, one that we call a **training set** on which
+    we learn data properties, and one that we call a **testing set**,
+    on which we test these properties.
 
 
 Loading an example dataset
@@ -63,65 +63,57 @@ Loading an example dataset
 datasets for classification and the `boston house prices dataset 
 <http://archive.ics.uci.edu/ml/datasets/Housing>`_ for regression.::
 
-    >>> from sklearn import datasets
-    >>> iris = datasets.load_iris()
-    >>> digits = datasets.load_digits()
+  >>> from sklearn import datasets
+  >>> iris = datasets.load_iris()
+  >>> digits = datasets.load_digits()
 
 A dataset is a dictionary-like object that holds all the data and some
-metadata about the data. This data is stored in the `.data` member, which
-is a `n_samples, n_features` array. In the case of supervised problem,
-explanatory variables are stored in the `.target` member. More details on
-the different datasets can be found in the 
-:ref:`dedicated section <datasets>`.
+metadata about the data. This data is stored in the ``.data`` member,
+which is a ``n_samples, n_features`` array. In the case of supervised
+problem, explanatory variables are stored in the ``.target`` member. More
+details on the different datasets can be found in the :ref:`dedicated
+section <datasets>`.
 
-For instance, in the case of the digits dataset, `digits.data` gives
+For instance, in the case of the digits dataset, ``digits.data`` gives
 access to the features that can be used to classify the digits samples::
 
-    >>> print digits.data
-    [[  0.   0.   5. ...,   0.   0.   0.]
-     [  0.   0.   0. ...,  10.   0.   0.]
-     [  0.   0.   0. ...,  16.   9.   0.]
-     ..., 
-     [  0.   0.   1. ...,   6.   0.   0.]
-     [  0.   0.   2. ...,  12.   0.   0.]
-     [  0.   0.  10. ...,  12.   1.   0.]]
+  >>> print digits.data
+  [[  0.   0.   5. ...,   0.   0.   0.]
+   [  0.   0.   0. ...,  10.   0.   0.]
+   [  0.   0.   0. ...,  16.   9.   0.]
+   ..., 
+   [  0.   0.   1. ...,   6.   0.   0.]
+   [  0.   0.   2. ...,  12.   0.   0.]
+   [  0.   0.  10. ...,  12.   1.   0.]]
 
 and `digits.target` gives the ground truth for the digit dataset, that
 is the number corresponding to each digit image that we are trying to
-learn:
+learn::
 
->>> digits.target
-array([0, 1, 2, ..., 8, 9, 8])
+  >>> digits.target
+  array([0, 1, 2, ..., 8, 9, 8])
 
 .. topic:: Shape of the data arrays
 
     The data is always a 2D array, `n_samples, n_features`, although
     the original data may have had a different shape. In the case of the
     digits, each original sample is an image of shape `8, 8` and can be
-    accessed using:
+    accessed using::
 
-    >>> digits.images[0]
-    array([[  0.,   0.,   5.,  13.,   9.,   1.,   0.,   0.],
-           [  0.,   0.,  13.,  15.,  10.,  15.,   5.,   0.],
-           [  0.,   3.,  15.,   2.,   0.,  11.,   8.,   0.],
-           [  0.,   4.,  12.,   0.,   0.,   8.,   8.,   0.],
-           [  0.,   5.,   8.,   0.,   0.,   9.,   8.,   0.],
-           [  0.,   4.,  11.,   0.,   1.,  12.,   7.,   0.],
-           [  0.,   2.,  14.,   5.,  10.,  12.,   0.,   0.],
-           [  0.,   0.,   6.,  13.,  10.,   0.,   0.,   0.]])
+      >>> digits.images[0]
+      array([[  0.,   0.,   5.,  13.,   9.,   1.,   0.,   0.],
+             [  0.,   0.,  13.,  15.,  10.,  15.,   5.,   0.],
+             [  0.,   3.,  15.,   2.,   0.,  11.,   8.,   0.],
+             [  0.,   4.,  12.,   0.,   0.,   8.,   8.,   0.],
+             [  0.,   5.,   8.,   0.,   0.,   9.,   8.,   0.],
+             [  0.,   4.,  11.,   0.,   1.,  12.,   7.,   0.],
+             [  0.,   2.,  14.,   5.,  10.,  12.,   0.,   0.],
+             [  0.,   0.,   6.,  13.,  10.,   0.,   0.,   0.]])
 
-    The :ref:`simple example on this dataset <example_plot_digits_classification.py>`
-    illustrates how starting from the original problem one can shape the
-    data for consumption in the `scikit-learn`.
-
-
-``sklearn`` also offers the possibility to reuse external datasets coming
-from the http://mlcomp.org online service that provides a repository of public
-datasets for various tasks (binary & multi label classification, regression,
-document classification, ...) along with a runtime environment to compare
-program performance on those datasets. Please refer to the following example for
-for instructions on the ``mlcomp`` dataset loader:
-:ref:`example mlcomp sparse document classification <example_mlcomp_sparse_document_classification.py>`.
+    The :ref:`simple example on this dataset
+    <example_plot_digits_classification.py>` illustrates how starting
+    from the original problem one can shape the data for consumption in
+    the `scikit-learn`.
 
 
 Learning and Predicting
@@ -132,35 +124,42 @@ hand-written digit from an image. We are given samples of each of the 10
 possible classes on which we *fit* an `estimator` to be able to *predict*
 the labels corresponding to new data.
 
-In `scikit-learn`, an *estimator* is just a plain Python class that
+In `scikit-learn`, an **estimator** is just a plain Python class that
 implements the methods `fit(X, Y)` and `predict(T)`.
 
 An example of estimator is the class ``sklearn.svm.SVC`` that
 implements `Support Vector Classification
 <http://en.wikipedia.org/wiki/Support_vector_machine>`_. The
 constructor of an estimator takes as arguments the parameters of the
 model, but for the time being, we will consider the estimator as a black
-box and not worry about these:
+box::
+
+  >>> from sklearn import svm
+  >>> clf = svm.SVC(gamma=0.001)
 
->>> from sklearn import svm
->>> clf = svm.SVC()
+.. topic:: Choosing the parameters of the model
+
+  In this example we set the value of ``gamma`` manually. It is possible
+  to automatically find good values for the parameters by using tools
+  such as :ref:`grid search <grid_search>` and :ref:`cross validation
+  <cross_validation>`.
 
 We call our estimator instance `clf` as it is a classifier. It now must
 be fitted to the model, that is, it must `learn` from the model. This is
 done by passing our training set to the ``fit`` method. As a training
 set, let us use all the images of our dataset apart from the last
-one:
+one::
 
->>> clf.fit(digits.data[:-1], digits.target[:-1])
-SVC(kernel='rbf', C=1.0, probability=False, degree=3, coef0=0.0, tol=0.001,
-  cache_size=100.0, shrinking=True, gamma=0.000556792873051)
+  >>> clf.fit(digits.data[:-1], digits.target[:-1])
+  SVC(C=1.0, coef0=0.0, degree=3, gamma=0.001, kernel='rbf', probability=False,
+    shrinking=True, tol=0.001)
 
 Now you can predict new values, in particular, we can ask to the
 classifier what is the digit of our last image in the `digits` dataset,
-which we have not used to train the classifier:
+which we have not used to train the classifier::
 
->>> clf.predict(digits.data[-1])
-array([ 8.])
+  >>> clf.predict(digits.data[-1])
+  array([ 8.])
 
 The corresponding image is the following:
 
@@ -175,32 +174,35 @@ A complete example of this classification problem is available as an
 example that you can run and study:
 :ref:`example_plot_digits_classification.py`.
 
+
 Model persistence
 -----------------
 
 It is possible to save a model in the scikit by using Python's built-in
-persistence model, namely `pickle <http://docs.python.org/library/pickle.html>`_.
-
->>> from sklearn import svm
->>> from sklearn import datasets
->>> clf = svm.SVC()
->>> iris = datasets.load_iris()
->>> X, y = iris.data, iris.target
->>> clf.fit(X, y)
-SVC(kernel='rbf', C=1.0, probability=False, degree=3, coef0=0.0, tol=0.001,
-  cache_size=100.0, shrinking=True, gamma=0.00666666666667)
->>> import pickle
->>> s = pickle.dumps(clf)
->>> clf2 = pickle.loads(s)
->>> clf2.predict(X[0])
-array([ 0.])
->>> y[0]
-0
+persistence model, namely `pickle <http://docs.python.org/library/pickle.html>`_::
+
+  >>> from sklearn import svm
+  >>> from sklearn import datasets
+  >>> clf = svm.SVC()
+  >>> iris = datasets.load_iris()
+  >>> X, y = iris.data, iris.target
+  >>> clf.fit(X, y)
+  SVC(C=1.0, coef0=0.0, degree=3, gamma=0.25, kernel='rbf', probability=False,
+    shrinking=True, tol=0.001)
+
+  >>> import pickle
+  >>> s = pickle.dumps(clf)
+  >>> clf2 = pickle.loads(s)
+  >>> clf2.predict(X[0])
+  array([ 0.])
+  >>> y[0]
+  0
 
 In the specific case of the scikit, it may be more interesting to use
-joblib's replacement of pickle, which is more efficient on big data, but
-can only pickle to the disk and not to a string:
+joblib's replacement of pickle (``joblib.dump`` & ``joblib.load``),
+which is more efficient on big data, but can only pickle to the disk
+and not to a string::
 
->>> from sklearn.externals import joblib
->>> joblib.dump(clf, 'filename.pkl') # doctest: +SKIP
+  >>> from sklearn.externals import joblib
+  >>> joblib.dump(clf, 'filename.pkl') # doctest: +SKIP
 
diff --git a/examples/plot_digits_classification.py b/examples/plot_digits_classification.py
@@ -40,7 +40,7 @@
 data = digits.images.reshape((n_samples, -1))
 
 # Create a classifier: a support vector classifier
-classifier = svm.SVC()
+classifier = svm.SVC(gamma=0.001)
 
 # We learn the digits on the first half of the digits
 classifier.fit(data[:n_samples/2], digits.target[:n_samples/2])

Original file line number	Diff line number	Diff line change
`@@ -1,3 +1,5 @@`
	`1`	`+.. _cross_validation:`
	`2`	`+`
`1`	`3`	`================`
`2`	`4`	`Cross-Validation`
`3`	`5`	`================`