Add more documentation datasets (apple#251)

afranklin · web-flow · commit 39e7f27998fb · 2018-02-07T14:37:22.000-08:00
diff --git a/userguide/activity_classifier/README.md b/userguide/activity_classifier/README.md
@@ -8,7 +8,7 @@ The activity classifier in Turi Create creates a deep learning model capable of
 
 #### Introductory Example
 
-In this example we create a model to classify physical activities done by users of a handheld phone, using both accelerometer and gyroscope data. We will use data from the [HAPT experiment](http://archive.ics.uci.edu/ml/datasets/Smartphone-Based+Recognition+of+Human+Activities+and+Postural+Transitions) which contains recording sessions of multiple users, each performing certain physical activities. The performed activities are walking, climbing up stairs, climbing down stairs, sitting, standing, and laying.
+In this example we create a model to classify physical activities done by users of a handheld phone, using both accelerometer and gyroscope data. We will use data from the [HAPT experiment](http://archive.ics.uci.edu/ml/datasets/Smartphone-Based+Recognition+of+Human+Activities+and+Postural+Transitions) which contains recording sessions of multiple users, each performing certain physical activities.[<sup>1</sup>](../datasets.md) The performed activities are walking, climbing up stairs, climbing down stairs, sitting, standing, and laying.
 
 Sensor data can be collected at varying frequencies. In the HAPT dataset, the sensors were sampled at 50Hz each - meaning 50 times per second. However, most applications would want to show outputs to the user at larger intervals. We control the output prediction rate via the ```prediction_window``` parameter. For example, if we want to produce a prediction every 5 seconds, and the sensors are sampled at 50Hz - we would set the ```prediction_window``` to 250 (5 sec * 50 samples per second).
 
@@ -66,4 +66,4 @@ We've seen how we can quickly create an activity classifier given recorded sessi
 
 * [Advanced usage](advanced-usage.md)
 * [Deployment via Core ML](export_coreml.md)
-* [How does it work](how-it-works.md)
+* [How does it work](how-it-works.md)
diff --git a/userguide/activity_classifier/data-preperation.md b/userguide/activity_classifier/data-preperation.md
@@ -1,6 +1,6 @@
 # HAPT Data Preparation
 
-In this section we will see how to get the [HAPT experiment](http://archive.ics.uci.edu/ml/datasets/Smartphone-Based+Recognition+of+Human+Activities+and+Postural+Transitions) data into the SFrame format expected by the activity classifier.
+In this section we will see how to get the [HAPT experiment](http://archive.ics.uci.edu/ml/datasets/Smartphone-Based+Recognition+of+Human+Activities+and+Postural+Transitions) data into the SFrame format expected by the activity classifier.[<sup>1</sup>](../datasets.md)
 
 First we need to download the data from [here](http://archive.ics.uci.edu/ml/machine-learning-databases/00341/HAPT%20Data%20Set.zip) in zip format. The code below assumes the data was unzipped into a directory named `HAPT Data Set`. This folder contains 3 types of files - a file containing the performed activities for each experiment, files containing the collected accelerometer samples, and files containing the collected gyroscope samples.
 
@@ -93,4 +93,4 @@ data = data.remove_column('activity_id')
 data.save('hapt_data.sframe')
 ```
 
-To learn more about the expected input format of the activity classifier please visit the [advanced usage](advanced-usage.md) section.
+To learn more about the expected input format of the activity classifier please visit the [advanced usage](advanced-usage.md) section.
diff --git a/userguide/clustering/dbscan.md b/userguide/clustering/dbscan.md
@@ -50,7 +50,7 @@ advantages:
 
 To illustrate the basic usage of DBSCAN and how the results can differ from
 K-means, we simulate non-spherical, low-dimensional data using the scikit-learn
-datasets module.
+datasets module.[<sup>1</sup>](../datasets.md)
 
 ```python
 import turicreate as tc
diff --git a/userguide/clustering/kmeans.md b/userguide/clustering/kmeans.md
@@ -28,8 +28,9 @@ distance from point $$x$$ to center $$B$$ when assigning $$x$$ to a cluster.
 
 #### Basic Usage
 
-We illustrate usage of Turi Create K-means with a dataset used to classify 
-schizophrenic subjects based on MRI scans. The original data consists of
+We illustrate usage of Turi Create K-means with the dataset from the [June
+2014 Kaggle competition to classify schizophrenic subjects based on MRI
+scans](https://www.kaggle.com/c/mlsp-2014-mri). Download **Train.zip** from the data tab.[<sup>1</sup>](../datasets.md) The original data consists of
 two sets of features: functional network connectivity (FNC) features and
 source-based morphometry (SBM) features, which we incorporate into a single
 [`SFrame`](https://apple.github.io/turicreate/docs/api/generated/turicreate.SFrame.html)
diff --git a/userguide/datasets.md b/userguide/datasets.md
@@ -0,0 +1,2 @@
+# User Guide Datasets
+Apple has provided links to certain datasets for reference purposes only and on an “as is” basis. You are solely responsible for your use of the datasets and for complying with applicable terms and conditions, including any use restrictions and attribution requirements. Apple shall not be liable for, and specifically disclaims any warranties, express or implied, in connection with, the use of the datasets, including any warranties of fitness for a particular purpose or non-infringement. 
diff --git a/userguide/image_classifier/README.md b/userguide/image_classifier/README.md
@@ -12,16 +12,16 @@ create a high quality image classifier model.
 
 #### Loading Data
 
-Suppose we have a dataset containing labeled cat and dog images.
+The [Kaggle Cats and Dogs Dataset](https://www.microsoft.com/en-us/download/details.aspx?id=54765) provides labeled cat and dog images.[<sup>1</sup>](../datasets.md) After downloading and decompressing the dataset, navigate to the main **kagglecatsanddogs** folder, which contains a **PetImages** subfolder.
 
 ```python
 import turicreate as tc
 
-# Load images
-data = tc.image_analysis.load_images('train', with_path=True)
+# Load images (Note: you can ignore 'Not a JPEG file' errors)
+data = tc.image_analysis.load_images('PetImages', with_path=True)
 
 # From the path-name, create a label column
-data['label'] = data['path'].apply(lambda path: 'dog' if 'dog' in path else 'cat')
+data['label'] = data['path'].apply(lambda path: 'dog' if '/Dog' in path else 'cat')
 
 # Save the data for future use
 data.save('cats-dogs.sframe')
@@ -44,7 +44,8 @@ data =  tc.SFrame('cats-dogs.sframe')
 # Make a train-test split
 train_data, test_data = data.random_split(0.8)
 
-# Automatically picks the right model based on your data.
+# Automatically pick the right model based on your data.
+# Note: Because the dataset is large, model creation may take hours.
 model = tc.image_classifier.create(train_data, target='label')
 
 # Save predictions to an SArray
diff --git a/userguide/image_similarity/README.md b/userguide/image_similarity/README.md
@@ -13,7 +13,7 @@ unsupervised.
 In this example, we use the [Caltech-101
 dataset](http://www.vision.caltech.edu/Image_Datasets/Caltech101/)
 which contains images objects belonging to 101 categories with about 40
-to 800 images per category.
+to 800 images per category.[<sup>1</sup>](../datasets.md)
 
 ```python
 import turicreate as tc
diff --git a/userguide/recommender/README.md b/userguide/recommender/README.md
@@ -9,7 +9,7 @@ interaction data and use that model to make recommendations.
 Creating a recommender model typically requires a data set to use for
 training the model, with columns that contain the user IDs, the item
 IDs, and (optionally) the ratings. For this example, we use the [MovieLens
- dataset](https://grouplens.org/datasets/movielens/).
+20M dataset](https://grouplens.org/datasets/movielens/20m/).[<sup>1</sup>](../datasets.md)
 
 ```python
 import turicreate as tc
diff --git a/userguide/sframe/sframe-intro.md b/userguide/sframe/sframe-intro.md
@@ -13,7 +13,7 @@ A very common data format is the comma separated value (csv) file, which
 is what we'll use for these examples.  We will use some preprocessed data from
 the
 [Million Song Dataset](https://labrosa.ee.columbia.edu/millionsong/) to
-aid our SFrame-related examples.  The first table contains metadata
+aid our SFrame-related examples.[<sup>1</sup>](../datasets.md)  The first table contains metadata
 about each song in the database.  Here's how we load it into an SFrame:
 
 ```python
diff --git a/userguide/supervised-learning/boosted_trees_classifier.md b/userguide/supervised-learning/boosted_trees_classifier.md
@@ -10,7 +10,7 @@ decision trees.
 
 ##### Introductory Example
 
-In this example, we will use the [Mushrooms dataset](https://archive.ics.uci.edu/ml/datasets/mushroom).
+In this example, we will use the [Mushrooms dataset](https://archive.ics.uci.edu/ml/datasets/mushroom).[<sup>1</sup>](../datasets.md)
 ```python
 import turicreate as tc
 
diff --git a/userguide/supervised-learning/boosted_trees_regression.md b/userguide/supervised-learning/boosted_trees_regression.md
@@ -51,7 +51,7 @@ The algorithm simply fit a new decision tree to the residual at each iteration.
 
 ##### Introductory Example
 
-In this example, we will use the [Mushrooms dataset](https://archive.ics.uci.edu/ml/datasets/mushroom).
+In this example, we will use the [Mushrooms dataset](https://archive.ics.uci.edu/ml/datasets/mushroom).[<sup>1</sup>](../datasets.md)
 
 ```python
 import turicreate as tc
diff --git a/userguide/supervised-learning/decision_tree_classifier.md b/userguide/supervised-learning/decision_tree_classifier.md
@@ -8,7 +8,7 @@ on decision trees.
 
 ##### Introductory Example
 
-In this example, we will use the [Mushrooms dataset](https://archive.ics.uci.edu/ml/datasets/mushroom).
+In this example, we will use the [Mushrooms dataset](https://archive.ics.uci.edu/ml/datasets/mushroom).[<sup>1</sup>](../datasets.md)
 ```python
 import turicreate as tc
 
diff --git a/userguide/supervised-learning/decision_tree_regression.md b/userguide/supervised-learning/decision_tree_regression.md
@@ -11,7 +11,7 @@ for more details).
 
 ##### Introductory Example
 
-In this example, we will use the [Mushrooms dataset](https://archive.ics.uci.edu/ml/datasets/mushroom).
+In this example, we will use the [Mushrooms dataset](https://archive.ics.uci.edu/ml/datasets/mushroom).[<sup>1</sup>](../datasets.md)
 
 ```python
 import turicreate as tc
diff --git a/userguide/supervised-learning/random_forest_classifier.md b/userguide/supervised-learning/random_forest_classifier.md
@@ -8,7 +8,7 @@ forests.
 
 ##### Introductory Example
 
-In this example, we will use the [Mushrooms dataset](https://archive.ics.uci.edu/ml/datasets/mushroom).
+In this example, we will use the [Mushrooms dataset](https://archive.ics.uci.edu/ml/datasets/mushroom).[<sup>1</sup>](../datasets.md)
 ```python
 import turicreate as tc
 
diff --git a/userguide/supervised-learning/random_forest_regression.md b/userguide/supervised-learning/random_forest_regression.md
@@ -24,7 +24,7 @@ forests, all the base models are constructed independently using a
 
 ##### Introductory Example
 
-In this example, we will use the [Mushrooms dataset](https://archive.ics.uci.edu/ml/datasets/mushroom).
+In this example, we will use the [Mushrooms dataset](https://archive.ics.uci.edu/ml/datasets/mushroom).[<sup>1</sup>](../datasets.md)
 
 ```python
 import turicreate as tc

Original file line number	Diff line number	Diff line change
`@@ -0,0 +1,2 @@`
	`1`	`+# User Guide Datasets`
	`2`	`+Apple has provided links to certain datasets for reference purposes only and on an “as is” basis. You are solely responsible for your use of the datasets and for complying with applicable terms and conditions, including any use restrictions and attribution requirements. Apple shall not be liable for, and specifically disclaims any warranties, express or implied, in connection with, the use of the datasets, including any warranties of fitness for a particular purpose or non-infringement.`