updated notebooks

ayalhassan · Jan 4, 2018 · c3171ce · c3171ce
1 parent bdba20f
commit c3171ce
Show file tree

Hide file tree

Showing 5 changed files with 150 additions and 18 deletions.
diff --git a/notebooks/Lesson-0-Getting-Started.ipynb b/notebooks/Lesson-0-Getting-Started.ipynb
@@ -6,15 +6,17 @@
    "source": [
     "# Overview\n",
     "\n",
-    "This is a class for engineers who want to apply machine learning/deep learning to real world problems.  It assumes no prior experience with machine learning or deep learning or scientific computing.  Unlike other introductory courses it covers the real world pitfalls like messy data preparation, comparing models to baselines and deploying models.  The material works on real-world data sets and real wolrd applications.  I'm passionate about teaching this material becuase I think machine learning is incredibly powerful and it still isn't used as much as it could be and I believe the limiting factor is a lack of people who know how to really make it work.\n",
+    "This is a class for engineers who want to apply machine learning/deep learning to real world problems.  \n",
     "\n",
-    "This course covers much more practice than theory because there are many excellent resources online to learn the theory.  \n",
+    "The material assumes no prior experience with machine learning or deep learning and tries to have as little of a math prerequisite as possible.  Unlike other introductory courses, it covers the real-world pitfalls like messy data preparation, comparing models to baselines and deploying models.  All of the material uses real-world data sets and real wolrd applications and best-in-class libraries.  I'm passionate about teaching this material becuase I think machine learning is incredibly powerful and it still isn't used as much as it could be - and I believe the limiting factor is a lack of people who know how to really make it work.\n",
+    "\n",
+    "Personally, I love machine learning theory as well, but this course covers much more practice than theory because there are many excellent resources online to learn the theory.  \n",
     "\n",
     "I strongly believe in learning by doing and learning by experimentation and I hope putting this course in a Jupyter notebook encourages students to explore the material and ask questions.  It should be possible to work through each notebook in under 30 minutes and if you get through all the notebooks you will learn a lot.  \n",
     "\n",
     "## Language\n",
     "\n",
-    "I over everything in python because python is by far the most popular language for deep learning.  If you haven't done any python but have done a lot of programming, you shouldn't find python too hard to pick up.  If you haven't done any programming before, I might recommend checking out Learn Python the Hard Way (https://learnpythonthehardway.org/) or many of the online courses, or if you feel ambitious, you can try to just jump right in.\n",
+    "I cover everything in python because python is by far the most popular language for deep learning.  If you haven't done any python but have done a lot of programming, you shouldn't find python too hard to pick up.  If you haven't done any programming before, I might recommend checking out Learn Python the Hard Way (https://learnpythonthehardway.org/) or many of the online courses, or if you feel ambitious, you can try to just jump right in.\n",
     "\n",
     "## Libraries\n",
     "\n",
@@ -31,11 +33,11 @@
     "\n",
     "If you want to jump right in and skip this section - go for it!  It might make more sense once you have finished the courses.\n",
     "\n",
-    "### What is Supervised Machine Learning\n",
+    "### Supervised Machine Learning\n",
     "\n",
     "Most of the real world applications of Machine Learning are instances of *supervised machine learning*.  In supervised machine learning you take *training data* which includes inputs and outputs and you build a model that predicts outputs from new inputs.\n",
     "\n",
-    "You have probably done linear regression in your life.  We don't always think of that as machine learning but it's definitely a simple type of machine learning.\n",
+    "You have probably done a linear regression in your life.  We don't always think of that as machine learning but it's definitely a simple type of machine learning.\n",
     "\n",
     "Let's do a simple example where we do a regression to predict regional house prices based on the size of houses.\n",
     "\n",
@@ -112,6 +114,8 @@
    "source": [
     "Here the black dots are data points and the blue line is our model.  If we feed 6.25 into our model, it will output 24.  If we feed 7 into our model it will output around 30. \n",
     "\n",
+    "### More Complicated Regression\n",
+    "\n",
     "We can build a more complicated model on just these data points in a lot of different ways."
    ]
   },
@@ -168,6 +172,8 @@
    "source": [
     "Here we used a modeling tool called a Gaussian Process to draw a line through the points.  It's no longer linear and it might fit our data better.\n",
     "\n",
+    "### Too-Complicated Regression\n",
+    "\n",
     "We can draw an infinite number of lines through our data.  Here's another example:"
    ]
   },
@@ -241,6 +247,21 @@
     "I typically start teaching the fundamentals of machine learning without deep learning because the models are a little faster to run and there is a lot of overlap.  But I try to get to deep learning quickly because I know that's what students are excited about and that's where most of the best models are these days."
    ]
   },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Key Takeaways\n",
+    "\n",
+    "1. Supervised machine learning/deep learning means you build a model from training data.\n",
+    "2. Deep learning is a type of machine learning\n",
+    "\n",
+    "## Questions\n",
+    "\n",
+    "1. Why isn't deep learning/neural networks the most commonly used class of algorithms?\n",
+    "2. Why isn't a more complicated model always better?"
+   ]
+  },
   {
    "cell_type": "code",
    "execution_count": null,

diff --git a/notebooks/Lesson-1-Sentiment-Analysis-Data-Exploration.ipynb b/notebooks/Lesson-1-Sentiment-Analysis-Data-Exploration.ipynb
@@ -19,7 +19,7 @@
     "\n",
     "A useful application of machine learning is \"sentiment analysis\".  Here we are trying to determine if a person feels positively or negatively about what they're writing about.  One important application of sentiment analysis is for marketing departments to understand what people are saying about them on social media.  Nearly every medium or large company with any sort of social media presence does some sort of sentiment analysis like the task we are about to do.\n",
     "\n",
-    "Here we have a collection of tweets from the tech conference SXSW talking about apple brands.  These tweet are hand labeled by humans using a tool I built called CrowdFlower.  Our goal is to build a classifier that can generalize the human labels to more tweets.\n",
+    "Here we have a collection of tweets from the tech conference SXSW talking about apple brands.  These tweet are hand labeled by humans using a tool I built called [CrowdFlower](https://crowdflower.com).  Our goal is to build a classifier that can generalize the human labels to more tweets.\n",
     "\n",
     "The labels are what's known as training data, and we're going to use it to teach our classifier what text is positive sentiment and what text is negative sentiment.\n",
     "\n",
@@ -61,12 +61,14 @@
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "Ok, that looks good.  Let's open the file with some python"
+    "Ok, that looks good - if a little messy.  Let's open the file with some python\n",
+    "\n",
+    "## Loading Data"
    ]
   },
   {
    "cell_type": "code",
-   "execution_count": 14,
+   "execution_count": 2,
    "metadata": {},
    "outputs": [
     {
@@ -115,7 +117,7 @@
   },
   {
    "cell_type": "code",
-   "execution_count": 17,
+   "execution_count": 3,
    "metadata": {},
    "outputs": [
     {
@@ -192,7 +194,7 @@
   },
   {
    "cell_type": "code",
-   "execution_count": 25,
+   "execution_count": 12,
    "metadata": {},
    "outputs": [
     {
@@ -205,7 +207,7 @@
        "Name: is_there_an_emotion_directed_at_a_brand_or_product, dtype: object"
       ]
      },
-     "execution_count": 25,
+     "execution_count": 12,
      "metadata": {},
      "output_type": "execute_result"
     }
@@ -314,20 +316,84 @@
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "Ok - the ipad2 was released in 2011, these tweets must be from 2011."
+    "Ok - the ipad2 was released in 2011, these tweets must be from 2011.\n",
+    "\n",
+    "## Data Cleanup\n",
+    "\n",
+    "If we dig into the data set one thing we'll notice is that some of the tweets are actually empty.\n"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 5,
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "nan\n"
+     ]
+    }
+   ],
+   "source": [
+    "print(tweets[6])"
    ]
   },
   {
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "## Conclusion\n",
+    "It is best practice to not change the input data.  It's better to clearly show the ways that you've modified your data in your code.  In this case, we can use pandas to easily pull out the rows where the tweets are empty.  Here we are indexing into our data frame with the results of a pd.notnull function - this notation is really convenient."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 8,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "fixed_tweets = tweets[pd.notnull(tweets)]"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "We also need to remove the same rows of labels so that our \"tweets\" and \"target\" lists have the same length."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 14,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "fixed_target = target[pd.notnull(tweets)]"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "Take a second to think about why I wrote \n",
+    "*fixed_target = target[pd.notnull(tweets)]* instead of *fixed_target = target[pd.notnull(target)]*"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Key Takeaways\n",
+    "\n",
+    "1. The most important thing to do when building a machine learning model is to actually look at your data.  \n",
+    "2. Clean up your data in code, not in the original file\n",
     "\n",
-    "The two most important things to do when building a machine learning model are:\n",
-    "1. Look at your data\n",
-    "2. Get something working end-to-end that you can iterate on as soon as you can\n",
+    "## Questions\n",
     "\n",
-    "Hopefully we have a good sense of the dataset we're working with, but we can come back to our tools as needed.\n"
+    "1. How messy is this data?  It was labeled by humans - how many mislabels?\n",
+    "2. Why is there a \"Can't Tell\" label - what kind of tweets get that?\n",
+    "3. Are all the tweets in English?"
    ]
   },
   {

diff --git a/notebooks/Lesson-2-Feature-Extraction.ipynb b/notebooks/Lesson-2-Feature-Extraction.ipynb
@@ -183,6 +183,29 @@
    "source": [
     "Great!  Now we have a feature matrix that we can feed in to our machine learning algorithm.  It has 9092 rows corresponding to 9092 tweets and 9706 columns corresponding to 9706 words."
    ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Takeaways\n",
+    "\n",
+    "1. All machine learning algorithms have the same API - a list of fixed-length vectors of numbers - also known as a Feature Matrix\n",
+    "2. Data almost never comes in a list of fixed length vectors, so this transofrmation is critical, and highly application dependant.\n",
+    "3. When dealing with text data, \"bag of words\" is a common way to do feature extraction.\n",
+    "\n",
+    "## Questions\n",
+    "\n",
+    "1. What would be another way to transform text?\n",
+    "2. What information is lost in the \"bag-of-words\" transformation?"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": []
   }
  ],
  "metadata": {

diff --git a/notebooks/Lesson-3-First-Classifier.ipynb b/notebooks/Lesson-3-First-Classifier.ipynb
@@ -192,6 +192,23 @@
     "But before we get too fancy, we need to put in place a framework to evaluate our algorithms.\n"
    ]
   },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Takeaways\n",
+    "\n",
+    "1. There are many types of algorithms, but they all generally have the same API, so one great way to pick an algorithm is by trial and error.\n",
+    "2. Algorithms generally have similar accuracy if configured properly and given good features.\n",
+    "3. Speed of training and speed of runtime are really important things to consider when choosing an algorithm.\n",
+    "4. SVMs can work great for text data, but the runtime usually gets slower with more training data.\n",
+    "\n",
+    "## Questions\n",
+    "\n",
+    "1. Is there a better way we could have done the feature extraction step?\n",
+    "2. What happens when we see a new word that wasn't in the training data?\n"
+   ]
+  },
   {
    "cell_type": "code",
    "execution_count": null,

diff --git a/notebooks/Lesson-4-Evaluating-Classifiers.ipynb b/notebooks/Lesson-4-Evaluating-Classifiers.ipynb
@@ -303,7 +303,12 @@
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "Check your understanding:\n",
+    "## Takeaways:\n",
+    "1. When building a machine learning model, get to the evaluation step as quickly as possible.\n",
+    "2. Always compare your model's performance against a baseline.\n",
+    "3. If you don't have an infinite amount of data, use Cross-Validation to evaluate performance.\n",
+    "\n",
+    "## Questions\n",
     "\n",
     "1. Imagine we were trying to predict whether or not an earthquake was going to happen tomorrow - what would the baseline accuracy be?\n",
     "2. In what scenario would you want to do more folds of cross-validation?  When would you want to do less?\n",