diff --git a/README.md b/README.md index aca93a6..e899e5c 100644 --- a/README.md +++ b/README.md @@ -419,7 +419,7 @@ Tuesday | Thursday ----- ### Class 15: Natural Language Processing -* Yelp review text homework due (solution) +* Yelp review text homework due ([solution](notebooks/14_yelp_review_text_homework.ipynb)) * Natural language processing ([notebook](notebooks/15_natural_language_processing.ipynb)) * Introduction to our [Kaggle competition](https://inclass.kaggle.com/c/dat8-stack-overflow) * Create a Kaggle account, join the competition using the invitation link, download the sample submission, and then submit the sample submission (which will require SMS account verification). diff --git a/notebooks/14_yelp_review_text_homework.ipynb b/notebooks/14_yelp_review_text_homework.ipynb new file mode 100644 index 0000000..1cc5ffa --- /dev/null +++ b/notebooks/14_yelp_review_text_homework.ipynb @@ -0,0 +1,965 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "# Naive Bayes homework with Yelp review text" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Task 1\n", + "\n", + "Read `yelp.csv` into a DataFrame." + ] + }, + { + "cell_type": "code", + "execution_count": 1, + "metadata": { + "collapsed": false + }, + "outputs": [ + { + "data": { + "text/html": [ + "
\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
business_iddatereview_idstarstexttypeuser_idcoolusefulfunny
09yKzy9PApeiPPOUJEtnvkg2011-01-26fWKvX83p0-ka4JS3dc6E5A5My wife took me here on my birthday for breakf...reviewrLtl8ZkDX5vH5nAx9C3q5Q250
\n", + "
" + ], + "text/plain": [ + " business_id date review_id stars \\\n", + "0 9yKzy9PApeiPPOUJEtnvkg 2011-01-26 fWKvX83p0-ka4JS3dc6E5A 5 \n", + "\n", + " text type \\\n", + "0 My wife took me here on my birthday for breakf... review \n", + "\n", + " user_id cool useful funny \n", + "0 rLtl8ZkDX5vH5nAx9C3q5Q 2 5 0 " + ] + }, + "execution_count": 1, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "# access yelp.csv using a relative path\n", + "import pandas as pd\n", + "yelp = pd.read_csv('../data/yelp.csv')\n", + "yelp.head(1)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Task 2\n", + "\n", + "Create a new DataFrame that only contains the 5-star and 1-star reviews." + ] + }, + { + "cell_type": "code", + "execution_count": 2, + "metadata": { + "collapsed": true + }, + "outputs": [], + "source": [ + "# filter the DataFrame using an OR condition\n", + "yelp_best_worst = yelp[(yelp.stars==5) | (yelp.stars==1)]" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Task 3\n", + "\n", + "Split the new DataFrame into training and testing sets, using the review text as the only feature and the star rating as the response." + ] + }, + { + "cell_type": "code", + "execution_count": 3, + "metadata": { + "collapsed": true + }, + "outputs": [], + "source": [ + "# define X and y\n", + "X = yelp_best_worst.text\n", + "y = yelp_best_worst.stars" + ] + }, + { + "cell_type": "code", + "execution_count": 4, + "metadata": { + "collapsed": true + }, + "outputs": [], + "source": [ + "# split into training and testing sets\n", + "from sklearn.cross_validation import train_test_split\n", + "X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=1)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Task 4\n", + "\n", + "Use CountVectorizer to create document-term matrices from X_train and X_test." + ] + }, + { + "cell_type": "code", + "execution_count": 5, + "metadata": { + "collapsed": true + }, + "outputs": [], + "source": [ + "# import and instantiate the vectorizer\n", + "from sklearn.feature_extraction.text import CountVectorizer\n", + "vect = CountVectorizer()" + ] + }, + { + "cell_type": "code", + "execution_count": 6, + "metadata": { + "collapsed": true + }, + "outputs": [], + "source": [ + "# fit and transform X_train, but only transform X_test\n", + "X_train_dtm = vect.fit_transform(X_train)\n", + "X_test_dtm = vect.transform(X_test)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Task 5\n", + "\n", + "Use Naive Bayes to predict the star rating for reviews in the testing set, and calculate the accuracy." + ] + }, + { + "cell_type": "code", + "execution_count": 7, + "metadata": { + "collapsed": false + }, + "outputs": [ + { + "data": { + "text/plain": [ + "MultinomialNB(alpha=1.0, class_prior=None, fit_prior=True)" + ] + }, + "execution_count": 7, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "# import/instantiate/fit\n", + "from sklearn.naive_bayes import MultinomialNB\n", + "nb = MultinomialNB()\n", + "nb.fit(X_train_dtm, y_train)" + ] + }, + { + "cell_type": "code", + "execution_count": 8, + "metadata": { + "collapsed": true + }, + "outputs": [], + "source": [ + "# make class predictions\n", + "y_pred_class = nb.predict(X_test_dtm)" + ] + }, + { + "cell_type": "code", + "execution_count": 9, + "metadata": { + "collapsed": false + }, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "0.918786692759\n" + ] + } + ], + "source": [ + "# calculate accuracy\n", + "from sklearn import metrics\n", + "print metrics.accuracy_score(y_test, y_pred_class)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Task 6\n", + "\n", + "Calculate the AUC." + ] + }, + { + "cell_type": "code", + "execution_count": 10, + "metadata": { + "collapsed": false + }, + "outputs": [ + { + "data": { + "text/plain": [ + "array([5, 5, 5, 5, 5, 1, 1, 5, 5, 5], dtype=int64)" + ] + }, + "execution_count": 10, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "# y_test contains fives and ones, which will confuse the roc_auc_score function\n", + "y_test[:10].values" + ] + }, + { + "cell_type": "code", + "execution_count": 11, + "metadata": { + "collapsed": false + }, + "outputs": [ + { + "data": { + "text/plain": [ + "array([1, 1, 1, 1, 1, 0, 0, 1, 1, 1])" + ] + }, + "execution_count": 11, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "# create y_test_binary, which contains ones and zeros instead\n", + "import numpy as np\n", + "y_test_binary = np.where(y_test==5, 1, 0)\n", + "y_test_binary[:10]" + ] + }, + { + "cell_type": "code", + "execution_count": 12, + "metadata": { + "collapsed": false + }, + "outputs": [], + "source": [ + "# predict class probabilities\n", + "y_pred_prob = nb.predict_proba(X_test_dtm)[:, 1]" + ] + }, + { + "cell_type": "code", + "execution_count": 13, + "metadata": { + "collapsed": false + }, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "0.940353585141\n" + ] + } + ], + "source": [ + "# calculate the AUC using y_test_binary and y_pred_prob\n", + "print metrics.roc_auc_score(y_test_binary, y_pred_prob)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Task 7\n", + "\n", + "Plot the ROC curve." + ] + }, + { + "cell_type": "code", + "execution_count": 14, + "metadata": { + "collapsed": true + }, + "outputs": [], + "source": [ + "%matplotlib inline\n", + "import matplotlib.pyplot as plt" + ] + }, + { + "cell_type": "code", + "execution_count": 15, + "metadata": { + "collapsed": false + }, + "outputs": [ + { + "data": { + "text/plain": [ + "" + ] + }, + "execution_count": 15, + "metadata": {}, + "output_type": "execute_result" + }, + { + "data": { + "image/png": "iVBORw0KGgoAAAANSUhEUgAAAYYAAAEPCAYAAABGP2P1AAAABHNCSVQICAgIfAhkiAAAAAlwSFlz\nAAALEgAACxIB0t1+/AAAIABJREFUeJzt3XmcHFW99/HPN4FAIAkxBBQjEGUz7HuCIEwAISwKosjD\nIoJ65YVsXrlXNpe56gUR3ACvILLpg4AgXqJAEIEBDAhiQgiQYPJAZA1bQhIhwMT8nj9OTdLTmemu\nmXR190y+79drXl1Vfbrql8pM/frUqXOOIgIzM7MOAxodgJmZNRcnBjMz68SJwczMOnFiMDOzTpwY\nzMysEycGMzPrpNDEIOlKSS9Lml6hzEWSZkmaJmmHIuMxM7Pqiq4xXAVM6O5NSQcCm0bEZsCXgJ8V\nHI+ZmVVRaGKIiPuB+RWKfAK4Jiv7EDBc0nuLjMnMzCprdBvDKOC5kvXngQ80KBYzM6PxiQFAZese\no8PMrIFWa/DxXwA2LFn/QLatE0lOFmZmvRAR5V++q2p0YpgInAxcL2kc8EZEvNxVQQ/2l7S2ttLa\n2troMJpCXz4Xr74KM2cuX//0p+GVV9LywQd3/ZkFC2C//WDo0BXfu/32Vg44oLXmcS5ZAnvs0fUx\nuzNkCGy0Uc1Dya0v/17UmtTjnAAUnBgkXQfsBYyU9BzwLWB1gIi4LCJuk3SgpNnAm8DxRcZjlsfT\nT8PLL8Of/wyrr16bfc6cCZddBmuskdbfeQfWXRfGjEnrO+0EEyfCar38i5w/H047rTaxmhWaGCLi\nyBxlTi4yBut/3norXQgXLoQXshuPb76ZLuQDB6b1hx9OZSZNgsWLYZ118u373XfT5zbaKH1m771r\nE/Maa8AVV8BRRy3ftuaatdm3Wa01+laS9VBLS0ujQ1gpEXDnnfDMMzB1ar5v5LNmwbRpMHx4Wu+4\nBbPuui3ccENanj8/fQPvuJAvWQKbbQb77w/jx8OgQfljXGstGDYsf/lm0Nd/L2rJ52LlqS/cu5cU\nfSHOVcmLL8Ls2V2/9+yz6eI9IHvmra0tfbtfay144w2YMQMOOADWXhv23DPf8TbYALbaKi0PGJAu\n+gOa4Zk6syYmqVeNz04Mtsy0aXD33fnKXnst/POfsP76K763eDGstx6MG5fW29th7FgYMSKtjxwJ\nm29em5jNrHtODJZLWxvMmbPitmuuSctjxqTbL9VIcOqpMHp0beMzs9pxYrAuRaSG2YkTYfLk9E1/\n770730NfuhRaWuCgg9I3fd+iMesfnBgMSE/VfP/76XFIgP/9X3j88bR89NHpnv7xx9fuMUwza15O\nDMa//Rv8/e8pEXzlK8u3H3oobLNN4+Iys8bobWLw46r9wAMPpKeAfvWrdKtok01g++0bHZWZ9VWu\nMfRhF18Ml1ySagm77QY77ww/+tHyTl5mtmpzjaGfe+QReOIJOPfc5R29nnkGTjkFjjwy1RJ6OSyK\nmVknrjE0uXHj4KGH0vLYsbDppikZdNh2Wxg8uDGxmVlzc42hn2hvT72Kr7sOpkxJDclPPZUSgh8j\nNbN68KWmyVxwQaoFfPvbaSC3X/7SScHM6ss1hibzzjvw1a/Ct77V6EjMbFXlxNBgCxbAMcekjmmQ\nBqY77riGhmRmqzgnhga65ZbUhvC3v8FVVy3fvvPOjYvJzMxPJTXQ4MFwxBGpV/Lppzc6GjPrbzwk\nRh/w0kvw5JPwwx+mPgh//3sa4K5jukczs1ry46pN4h//gNdfT8sLF6YRTWfNSvMI339/mpxmyy1T\nR7UttnBSMLPm4xpDjW2ySZrLd4010kQ2w4alqSXf+970GOpuu8HQoY2O0sxWBa4xNIklS+C222Dj\njRsdiZlZ77jblJmZdeLEUCOf+1waxG7u3NSOYGbWVzkx1Mi8eWn6zHfeSZPdm5n1VW5jWEnPP58m\nynnxxUZHYmZWG04MK+nSS9O8yltvnR5DNTPr65wYemnuXLj6avjzn9NEOeec0+iIzMxqw20MvXTv\nvXDNNalfwsEHNzoaM7PacQe3Hlq0CM44A2bOhPXWgxtuaHREZmZdcwe3OrjgApgxI3Vg+853PAqq\nmfVPrjHk8NxzaQC8j3wkJYQxY+DQQxsWjplZLh5dtUDjxqVbSO9/P9xxh6fZNLO+wbeSCnDeeelx\n1JdeSpPpbLNNoyMyMytexcQgaX3gcGBPYDQQwD+A+4AbI+KVogNspFmz4KST4OijYdSoRkdjZlYf\n3SYGSVcAmwC3A5cCLwECNgB2BX4jaXZEfLEegTbKyJFOCma2aqlUY/hJRDzWxfYZwN3A9yRtW0xY\njdXenmZYW7Cg0ZGYmdVft82oHUlB0scldVmum8SxjKQJkmZKmiXpjC7eHylpkqRHJT0u6bgexl+I\nyy+HXXeF6dNh9OhGR2NmVl95nq/5P8BsSd+X9OG8O5Y0ELgEmABsCRwpaUxZsZOBqRGxPdAC/EBS\nwxvE330Xjjsuzcm8996NjsbMrL6qJoaIOBrYAXgauFrSg5K+JKnaBJW7ArMjYk5EtAPXA4eUlXkJ\nGJYtDwNej4glPfoXmJlZTeV6Ij8iFgA3ATcA7wc+CUyVdGqFj40CnitZfz7bVupyYCtJLwLTgNNy\nxm1mZgWpettG0iHAccBmwC+BXSLiFUlrAU8CF3Xz0Tw90s4GHo2IFkmbAHdK2i4iFpUXbG1tXbbc\n0tJCS0tLjt2bma062traaGtrW+n9VO35LOka4IqIuK+L9/aNiD9187lxQGtETMjWzwKWRsT5JWVu\nA/47IiZn63cBZ0TEI2X7qmvP5x//GObMSa9mZn1Vb3s+57mV9HJ5UpB0PkB3SSHzCLCZpNGSBgFH\nABPLyswE9s32+V5gC1JbhpmZNUiexPCxLrYdWO1DWSPyycAdpFtON0TEDEknSDohK3YusLOkacCf\ngK9FxLx8oZuZWREq9Xw+EfgysImk6SVvDQUm59l5RNxO6jlduu2ykuXXgI/3JOCivfoqLFzY6CjM\nzBqnUuPzr0kX9e8BZ5CGwwBYFBGvFx1YIzz4IOy1F4wYAWee2ehozMwao1JiiIiYI+kkyp4wkjSi\nP97yWbwY9tgD7r670ZGYmTVOpcRwHXAQ8DdWfPQ0gA8VFZSZmTVOt4khIg7KXkfXLZo6WrwY3ngD\nHnoI5s2D3/wGnn3WYyOZmeXpx/B7Uu3hloh4sy5RrRhDzfsxHHYY3HNPSg6HHgrrrAMnnghbbQVD\nhtT0UGZmDVHkDG4/IPVBOE/SI6Qk8YeIeLunB2sGU6bAxInw2GPw61/DAQc0OiIzs+ZSNTFERBvQ\nlo16Oh74N+BKlg9+16fccEOapvPYY2HnnRsdjZlZ88k1xLWkwcAngM8AOwLXFBlU0T72MThjhdkh\nzMwM8g2i9xtgLDCJNL/CfRHxr6IDK8LcuelnxIhGR2Jm1rzyDIlxBfChiDghIu7pq0lh4kTYdFOY\nPRt22aXR0ZiZNa9KQ2LsExF3AUOAQ6RlDdsidX67uQ7x1cyiRXDIIXDttY2OxMysuVW6lbQncBdp\nLKOunhXtU4nBzMzyqdTB7VvZ4rcjotNQ2JL6VK/nxYvh7T75cK2ZWf3laWO4qYttN9Y6kKI8/TQM\nHQqnngobbNDoaMzMml+lNoYxwJbAcEmHkbUtkPovrFmf8FbOE0/A5ZfDNtvA1KmNjsbMrG+oVGPY\ngtS+sE72enD2uiOpk1vTGzsWJLjqqkZHYmbWd+QZK2m3iHiwTvF0F0OvxkoaMACWLEmvZmarmt6O\nldRtYpB0RkScL+niLt6OiDi1pwfrLScGM7OeK2IQvSez19L5GDoOUNuhTmvs1lth5kyo8YCsZmar\nhKq3kjoVlgYCQyJiQXEhdXncHtUYtt8+NThvsQWcc05qZzAzW9X0tsZQ9SaLpF9LGiZpbWA68KSk\nr/UmyHo6/XT4+tedFMzMeirP3fetImIhcChwOzAa+GyRQZmZWePkSQyrSVqdlBh+HxHtNHkbg5mZ\n9V6exHAZMIc0mN59kkYDdW1jMDOz+qmaGCLioogYFREHRMRS4B+kmdzMzKwfyjNRz5rAp0htCx3l\nA/h2cWGZmVmj5Jna8xbgDVJ/Bo9RambWz+VJDKMiYv/CI6mRRx+FV15xb2czs97Kc/l8QNK2hUdS\nA+edBwcfDKedBltv3ehozMz6pjyD6M0ANgWeAd7JNkdE1C1Z5O35fOyxsO++6dXMbFVXxFhJHQ7o\nRTxmZtZH5XlcdQ6wITA+W36T5YPpmZlZP5NnrKRW4GvAWdmmQcD/LTAmMzNroDyNz58EDiHVFIiI\nF4ChRQbVU48/DmeeCVOmNDoSM7O+L09ieCfr8QxANspqU5k0CSZPhmOOgX32aXQ0ZmZ9W57EcKOk\ny4Dhkr4E3AX8Is/OJU2QNFPSLElndFOmRdJUSY9LassdeZmxY1OtYdSo3u7BzMwgx1NJEXGBpP2A\nRcDmwDci4s5qn8sm9bkE2Bd4AfirpIkRMaOkzHDgp8D+EfG8pJG9/HeYmVmN5HlclYj4o6QpwJ7A\nvJz73hWYnT3JhKTrSW0VM0rKHAX8NiKez47zWs59m5lZQbq9lSTpVklbZ8sbAI8DxwO/kvTvOfY9\nCniuZP35bFupzYARku6R9IgkTwBkZtZglWoMoyPi8Wz5eOCPEXGspKHAA8CPquw7z2Q+qwM7AvsA\nawEPSvpLRMzK8VkzMytApcTQXrK8L3A5QEQskrS064908gKpY1yHDUm1hlLPAa9FxGJgsaT7gO2A\nFRJDa2vrsuWWlhZaWlpyhGBmtupoa2ujra1tpffT7VhJkv4A3EG6wF8BfCgi5ktaC/hrRGxVccfS\nasBTpNrAi8DDwJFljc8fJjVQ7w+sATwEHBERT5btq+JYSRdeCHPnplczM0uKGCvpC6TJePYlXazn\nZ9vHAldV23FELJF0Mim5DASuiIgZkk7I3r8sImZKmgQ8BiwFLi9PCmZmVl9VR1dtBtVqDOefD6+9\nBhdcUMegzMyaXG9rDJWeSrpS0i4V3h8rqWrNoR4WLIDhwxsdhZlZ/1DpVtKPgP+UNI7UVvASaVTV\n9wFbkJ5Maoq7+vPmwYYbVi9nZmbVdZsYImI6cKykNYAdgI1Jj6D+A5gWEU0z//O8eTBiRKOjMDPr\nH/IMifEO8JfspynNn+/EYGZWK3kG0Wt6rjGYmdVOv0kM73lPo6MwM+sfcieGrGNbU3KNwcysdvJM\n7fkRSU+SnkxC0vaS/qfwyHJqb4c334RhwxodiZlZ/5CnxvBjYALwGkBEPArsVWRQPfHGG6kPw4B+\ncVPMzKzxcl1OI+LZsk1LCoilV3wbycystvJM1POspN0BJA0CTqXzZDsN5cRgZlZbeWoMJwInkSbZ\neYHU2e2kIoPqCfdhMDOrrTw1hs0j4qjSDVkNYnIxIfWMawxmZrWVp8ZwSc5tDeE+DGZmtdVtjUHS\nbsBHgPUkfZU0gB7AUJqoY5xrDGZmtVXpAj+IlAQGZq9Dsp+FwKeLDy0fJwYzs9qqNLrqvcC9kq6O\niDn1C6ln5s2DXbqdNcLMzHoqT+PzW5IuBLYEBmfbIiL2Li6s/FxjMDOrrTxtBdcCM4EPAa3AHOCR\n4kLqGScGM7PaypMY1o2IXwDvRsS9EXE80BS1BXA/BjOzWstzK+nd7HWupIOBF4GmeUDUNQYzs9pS\nRFQuIH0cuB/YELgYGAa0RsTE4sNbFkN0FefSpTBoECxeDKuvXq9ozMz6BklEhKqXLPtctcTQzcF2\njYiHe/zBXuouMSxYABtuCAsX1isSM7O+o7eJoVIHtwHAJ4FNgMcj4jZJOwPnAusD2/c22FrxbSQz\ns9qr1Mbwc+CDwMPA1yV9AfgwcA5wSx1iq8qJwcys9iolhnHAthGxVNKawFxgk4h4vT6hVefEYGZW\ne5UeV22PiKUAEfE28EwzJQVwYjAzK0KlGsOHJU0vWd+kZD0iYtsC48rFicHMrPYqJYYxdYuil9y5\nzcys9ioNojenjnH0yrx58L73NToKM7P+pWnmVegNT9JjZlZ7fT4x+FaSmVlt5UoMktaStEXRwfSU\nE4OZWe1VTQySPgFMBe7I1neQVLdxkipxYjAzq708NYZWYCwwHyAippLmZmg4JwYzs9rLkxjaI+KN\nsm1Liwimp5wYzMxqL09ieELS0cBqkjaTdDHwQJ6dS5ogaaakWZLOqFBuF0lLJB2WM24WLwYJBg+u\nXtbMzPLLkxhOAbYC3gGuAxYCX6n2IUkDgUuACaT5oo+UtEKnuazc+cAkIPfwsK4tmJkVI88MbltE\nxNnA2T3c967A7I6OcpKuBw4BZpSVOwW4CdilJzt3HwYzs2LkqTH8MLsd9B1JW/dg36OA50rWn8+2\nLSNpFClZ/CzblHvWINcYzMyKUTUxREQLMB54DbhM0nRJ38ix7zwX+R8DZ2bTswnfSjIza7g8t5KI\niJeAn0i6GzgD+CbwnSofe4E0T3SHDUm1hlI7AddLAhgJHCCpvav5pFtbW5ctt7S0MG9eixODmVmJ\ntrY22traVno/Ved8lrQl8Bng08DrwA3ATRHxSpXPrQY8BewDvEiaCe7IiChvY+gofxXw+4i4uYv3\nVpjz+YIL4OWX4cILK4ZvZrbKqvmczyWuBK4H9o+IF/LuOCKWSDqZ1GN6IHBFRMyQdEL2/mU9DbaU\nbyWZmRWjamKIiHG93XlE3A7cXraty4QQEcf3ZN/z58PGG/c2MjMz6063iUHSjRFxeNksbh0aPoOb\nawxmZsWoVGM4LXs9mBWfFsr9WGlR3I/BzKwY3T6uGhEvZotfjog5pT/Al+sSXQWuMZiZFSNPB7f9\nuth2YK0D6SknBjOzYlRqYziRVDPYpKydYSgwuejAqnFiMDMrRrf9GCStA7wH+B6pU1tHO8OiiHi9\nPuEti6VTP4b29jSqant7GmHVzMxWVEQ/hoiIOZJOoqyxWdKIiJjX04PVyvz5qeHZScHMrPYqJYbr\ngIOAv9H1U0gfLCSiHHwbycysON0mhog4KHsdXbdocpo/34nBzKwoVZ9KkrS7pCHZ8mcl/VBSQ/sc\nu8ZgZlacPI+rXgq8JWk74KvA08AvC42qCnduMzMrTp7EsCQilgKHAj+NiEtIj6w2jGsMZmbFyTO6\n6iJJZwPHAB/N5mhevdiwKnNiMDMrTp4awxHAO8DnI2IuaXrOCwqNqgonBjOz4uSZ2vMl4FpguKSD\ngbcjouFtDE4MZmbFyPNU0meAh4DDSTO5PSzp8KIDq8SJwcysOHnaGL4O7NIxlaek9YC7gBuLDKwS\n92MwMytOnjYGAa+WrL/OivMz1JVrDGZmxclTY5gE3CHp16SEcARl03XWm/sxmJkVp9vRVTsVkg4D\n9shW74+I3xUa1YrHXza66tKlMGgQvP02rJYnrZmZraJqPrqqpM1Jj6VuCjwG/GdEPN/7EGtj4UJY\ne20nBTOzolRqY7gS+APwKWAKcFFdIqrC7QtmZsWq9L17SERcni3PlDS1HgFV48RgZlasSolhTUk7\nZssCBmfrIk3iM6Xw6LrgxGBmVqxKiWEu8IMK6+MLiagK92EwMytWpYl6WuoYR26uMZiZFStPB7em\n4j4MZmbF6pOJwTUGM7PiODGYmVkneUZXHZDN9fzNbH0jSbsWH1rXnBjMzIqVp8bwP8BuwFHZ+j+z\nbQ3hxGBmVqw8A0uMjYgdOjq4RcQ8SQ2b2tOJwcysWHlqDO9m8zwDy+ZjWFpcSJU5MZiZFStPYrgY\n+B2wvqRzgcnAeYVG1Y2I1MHNj6uamRUn77DbY4B9stW7ImJGoVGtePyICN56C0aOhLfequfRzcz6\nppoPu12y442AN4HfZ5tC0kYR8WxPD7ay3LnNzKx4eW4l3QbcShqC+0/A0/RgBjdJEyTNlDRL0hld\nvH+0pGmSHpM0WdK23e3L7QtmZsWrWmOIiK1L17MRVk/Ks/Os0foSYF/gBeCvkiaW3Yp6GtgzIhZI\nmgD8HBjX1f6cGMzMitfjns/ZcNtjcxbfFZgdEXMioh24HjikbH8PRsSCbPUh4APd7cyJwcyseHna\nGE4vWR0A7Ej69p/HKOC5kvXnqZxUvkC6ddUlJwYzs+Ll6eA2pGR5Camt4bc591/9kaeMpPHA54Hd\nu3q/tbWVyZPhzTehra2FlpaWvLs2M1sltLW10dbWttL7qfi4atZG8P2IOL3bQpV2Lo0DWiNiQrZ+\nFrA0Is4vK7ctcDMwISJmd7GfiAjOOguGDYOzzupNNGZmq5bePq7abRuDpNUi4l/A7pJ6vOPMI8Bm\nkkZLGgQcAUwsO85GpKRwTFdJoZRvJZmZFa/SraSHSe0JjwK3SLoR6OhaFhFxc7WdR8QSSScDdwAD\ngSsiYoakE7L3LwO+CbwH+FmWf9ojosvRW92PwcyseJUSQ0ctYU3gdWDvsverJgaAiLidsn4PWULo\nWP4i8MU8+3KNwcyseJUSw3qSvgpMr1cw1TgxmJkVr1JiGAgMrVcgeTgxmJkVr1JimBsR/1W3SHJw\nYjAzK16fmfO5vR3efhuGNlUdxsys/6mUGPatWxQ5dMzD0OsHZ83MLJduE0NEvF7PQKrxbSQzs/ro\nM7eS3IfBzKw++lRicI3BzKx4TgxmZtaJE4OZmXXixGBmZp04MZiZWSdODGZm1kmfSQzz5zsxmJnV\nQ59JDK4xmJnVR59KDO7gZmZWvD6VGFxjMDMrniKi0TFUJSkGDgzefhtWqzRQuJmZLSOJiOjx0KN9\npsYwZIiTgplZPfSZxODbSGZm9eHEYGZmnTgxmJlZJ04MZmbWSZ9JDO7DYGZWH30mMbjGYGZWH04M\nZmbWiRODmZl14sRgZmadODGYmVknTgxmZtaJE4OZmXXSZxKD+zGYmdVHn0kMa67Z6AjMzFYNfSYx\nmJlZfTgxmJlZJ4UmBkkTJM2UNEvSGd2UuSh7f5qkHYqMx8zMqissMUgaCFwCTAC2BI6UNKaszIHA\nphGxGfAl4GdFxdNftLW1NTqEpuFzsZzPxXI+FyuvyBrDrsDsiJgTEe3A9cAhZWU+AVwDEBEPAcMl\nvbfAmPo8/9Iv53OxnM/Fcj4XK6/IxDAKeK5k/flsW7UyHygwJjMzq6LIxBA5y6mXnzMzswIoopjr\nsKRxQGtETMjWzwKWRsT5JWUuBdoi4vpsfSawV0S8XLYvJwszs16IiPIv31WtVkQgmUeAzSSNBl4E\njgCOLCszETgZuD5LJG+UJwXo3T/MzMx6p7DEEBFLJJ0M3AEMBK6IiBmSTsjevywibpN0oKTZwJvA\n8UXFY2Zm+RR2K8nMzPqmpur57A5xy1U7F5KOzs7BY5ImS9q2EXHWQ57fi6zcLpKWSDqsnvHVS86/\njxZJUyU9LqmtziHWTY6/j5GSJkl6NDsXxzUgzLqQdKWklyVNr1CmZ9fNiGiKH9LtptnAaGB14FFg\nTFmZA4HbsuWxwF8aHXcDz8VuwDrZ8oRV+VyUlLsb+APwqUbH3aDfieHAE8AHsvWRjY67geeiFTiv\n4zwArwOrNTr2gs7HR4EdgOndvN/j62Yz1RjcIW65quciIh6MiAXZ6kP03/4feX4vAE4BbgJerWdw\ndZTnPBwF/DYingeIiNfqHGO95DkXLwHDsuVhwOsRsaSOMdZNRNwPzK9QpMfXzWZKDO4Qt1yec1Hq\nC8BthUbUOFXPhaRRpAtDx5Aq/bHhLM/vxGbACEn3SHpE0mfrFl195TkXlwNbSXoRmAacVqfYmlGP\nr5tFPq7aU+4Qt1zuf5Ok8cDngd2LC6eh8pyLHwNnRkRIEiv+jvQHec7D6sCOwD7AWsCDkv4SEbMK\njaz+8pyLs4FHI6JF0ibAnZK2i4hFBcfWrHp03WymxPACsGHJ+oakzFapzAeybf1NnnNB1uB8OTAh\nIipVJfuyPOdiJ1JfGEj3kw+Q1B4RE+sTYl3kOQ/PAa9FxGJgsaT7gO2A/pYY8pyLjwD/DRAR/0/S\nM8AWpP5Vq5oeXzeb6VbSsg5xkgaROsSV/2FPBI6FZT2ru+wQ1w9UPReSNgJuBo6JiNkNiLFeqp6L\niPhQRHwwIj5Iamc4sZ8lBcj393ELsIekgZLWIjU0PlnnOOshz7mYCewLkN1P3wJ4uq5RNo8eXzeb\npsYQ7hC3TJ5zAXwTeA/ws+ybcntE7NqomIuS81z0ezn/PmZKmgQ8BiwFLo+IfpcYcv5OnAtcJWka\n6Qvw1yJiXsOCLpCk64C9gJGSngO+Rbqt2Ovrpju4mZlZJ810K8nMzJqAE4OZmXXixGBmZp04MZiZ\nWSdODGZm1okTg5mZdeLEsIqR9K9sWOaOn40qlP1nDY53taSns2P9Letg09N9XC7pw9ny2WXvTV7Z\nGLP9dJyXxyTdLGlIlfLbSTqgF8dZX9Kt2fK62bhGiyRd3Mu4z8mGlZ6WxV/TviySbpU0LFs+VdKT\nkn4l6eOVhkDPyk/OXjeWVD57Y1flPyHpG7WJ3FaG+zGsYiQtioihtS5bYR9XAb+PiJslfQy4MCK2\nW4n9rXRM1fYr6WrSEMY/qFD+OGCniDilh8f5drbvG7PeyTsAWwNb92JfuwE/IM2T3i5pBLBGRLzU\nk/304HgzgH0i4sUefq4FOD0iPl6lnICpwC7ZqKnWIK4xrOIkrS3pT9m3+cckfaKLMhtIui/7Rjpd\n0h7Z9v0kPZB99jeS1u7uMNnr/cCm2We/mu1ruqTTSmK5VWlylemSDs+2t0naSdL3gMFZHL/K3vtn\n9nq9pANLYr5a0mGSBki6QNLD2bfqL+U4LQ8Cm2T72TX7N05RmhBp82wYhm8DR2SxHJ7FfqWkh7Ky\nK5zHzKeBWwEi4q2ImAy8kyOmrryPNDZSe7a/eR1JQdIcSedn/6cPKQ0kh6T1JN2UnY+HJX0k2z5E\n0lVZ+WmSPlmyn3UlXQp8CJgk6SuSjuuo5Uh6r6TfZf9vj3bUCrW8xvk94KPZufqKpHslLftyIOnP\nkraJ9C31QWC/Xp4Pq5VGTzLhn/r+AEtI38qmAr8lDSkwNHtvJDCrpOyi7PV04OxseQAwJCt7LzA4\n234G8I0ujncV2cQ5wOGkP/wdScM2DAbWBh4Htgc+Bfy85LPDstd7gB1LY+oixkOBq7PlQcCzwBrA\nl4Bzsu3lxFMhAAAEg0lEQVRrAH8FRncRZ8d+Bmbn5cvZ+lBgYLa8L3BTtvw54KKSz58LHJ0tDwee\nAtYqO8b76GIylWxfF/fi/3Lt7P/xKeCnwJ4l7z0DnJUtf5ZUawP4NbB7trwR8GS2fD7ww5LPDy/Z\nz4gulpfFDNwAnFry+9Hx/9ZxTvfqOH62fizwo2x5c+CvJe8dD5zf6L+TVf2nacZKsrpZHBHLpvaT\ntDpwnqSPksbXeb+k9SPilZLPPAxcmZX934iYlt0e2BJ4IN0BYBDwQBfHE3CBpK8Dr5DmjvgYcHOk\nUUCRdDNpFqpJwIVZzeAPEfHnHvy7JgE/yb7NHwDcGxHvSNoP2EbSp7Nyw0i1ljllnx8saSpp7Po5\nwKXZ9uHALyVtShqquONvpnx47/2Aj0v6j2x9DdKIlk+VlNmYNIFMTUTEm5J2Ip278cANks6MiGuy\nItdlr9cDP8qW9wXGZP9nAEOzmt4+pMHoOvb9Rg9CGQ8ck31uKbCw7P3yIZ9vAr4h6T9JQ8ZfVfLe\ni6QZCa2BnBjsaNK3/x0j4l9KwxOvWVogIu7PEsfBwNWSfkiaMerOiDiqyv4D+I+IuLljg6R96Xyx\nUDpMzFKaj/Yg4LuS7oqI7+T5R0TE20pzHO8PfIblF0WAkyPiziq7WBwRO0gaTBqc7RDgd8B3gLsi\n4pOSNgbaKuzjsKg+90GP5opQakzuGCjwGxHxh9L3swvxvcC9SnP+fo5stq4yHY2JAsZGxLtlx+lx\nbOWh5i0YEW9JupNUyzucVIPsMIAezEdixXAbgw0DXsmSwnjSt9pOlJ5cejUifgH8gtRg+hdg95J7\n12tL2qybY5RfNO4HDpU0OPu2eihwv6QNgLcj4lrgwuw45doldfeF5gbSN9CO2geki/yXOz6TtRGs\n1c3nyWoxpwL/rXS1HEb6FgudR6VcSLrN1OGO7HNkx+kq9n+QbieV6/aiGhEPR8QO2U+npJD9W0rP\n+Q50rgkdUfLaUZv7Y1mcHff67wROKtk+vLuYuoj5LuDE7HMDlT3FVGIRnc8VpN+ji4CHY/kUtQAb\nkM6TNZATw6qn/NvYtcDOkh4j3Yue0UXZ8cCjkqaQvo3/JNJ8wscB1ykNbfwAacz7qseMiKnA1aRb\nVH8hDQ89DdgGeCi7pfNN4Ltd7OvnwGMdjc9l+/4jsCepJtMxv+8vSHMSTMm+Uf+MrmvKy/YTEY+S\nJpv/DPB90q22KaT2h45y9wBbdjQ+k2oWq2eNt48D/7XCASLmAquppJFe0hzSk0XHSXpW2WO5OQ0h\n1eCeyP4PPgy0lrz/nmz7KcC/Z9tOJf1/T5P0BHBCtv27Wfnpkh4FWro4XpQtd6yfBozPfoceAcaU\nlZ8G/CtrmD4NICKmAAvofBsJ0nzO9+X5x1tx/LiqWR1JagVmRMQNBR/nGdLjtE05B4Gk9wP3RMQW\nJdsGAFOAnUsSuzWAawxm9fVTUjtA0Zr2G5+kY0k1xbPL3jqY9NSXk0KDucZgZmaduMZgZmadODGY\nmVknTgxmZtaJE4OZmXXixGBmZp04MZiZWSf/HyifrJsO5bsLAAAAAElFTkSuQmCC\n", + "text/plain": [ + "" + ] + }, + "metadata": {}, + "output_type": "display_data" + } + ], + "source": [ + "# plot ROC curve using y_test_binary and y_pred_prob\n", + "fpr, tpr, thresholds = metrics.roc_curve(y_test_binary, y_pred_prob)\n", + "plt.plot(fpr, tpr)\n", + "plt.xlim([0.0, 1.0])\n", + "plt.ylim([0.0, 1.0])\n", + "plt.xlabel('False Positive Rate (1 - Specificity)')\n", + "plt.ylabel('True Positive Rate (Sensitivity)')" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Task 8\n", + "\n", + "Print the confusion matrix, and calculate the sensitivity and specificity. Comment on the results." + ] + }, + { + "cell_type": "code", + "execution_count": 16, + "metadata": { + "collapsed": false + }, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "[[126 58]\n", + " [ 25 813]]\n" + ] + } + ], + "source": [ + "# print the confusion matrix\n", + "print metrics.confusion_matrix(y_test, y_pred_class)" + ] + }, + { + "cell_type": "code", + "execution_count": 17, + "metadata": { + "collapsed": false + }, + "outputs": [ + { + "data": { + "text/plain": [ + "0.9701670644391408" + ] + }, + "execution_count": 17, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "# calculate sensitivity\n", + "813 / float(813 + 25)" + ] + }, + { + "cell_type": "code", + "execution_count": 18, + "metadata": { + "collapsed": false + }, + "outputs": [ + { + "data": { + "text/plain": [ + "0.6847826086956522" + ] + }, + "execution_count": 18, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "# calculate specificity\n", + "126 / float(126 + 58)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "The model is having a much easier time detecting five-star reviews than one-star reviews." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Task 9\n", + "\n", + "Browse through the review text for some of the false positives and false negatives. Based on your knowledge of how Naive Bayes works, do you have any theories about why the model is incorrectly classifying these reviews?" + ] + }, + { + "cell_type": "code", + "execution_count": 19, + "metadata": { + "collapsed": false + }, + "outputs": [ + { + "data": { + "text/plain": [ + "2175 This has to be the worst restaurant in terms o...\n", + "1781 If you like the stuck up Scottsdale vibe this ...\n", + "2674 I'm sorry to be what seems to be the lone one ...\n", + "9984 Went last night to Whore Foods to get basics t...\n", + "3392 I found Lisa G's while driving through phoenix...\n", + "8283 Don't know where I should start. Grand opening...\n", + "2765 Went last week, and ordered a dozen variety. I...\n", + "2839 Never Again,\\r\\nI brought my Mountain Bike in ...\n", + "321 My wife and I live around the corner, hadn't e...\n", + "1919 D-scust-ing.\n", + "Name: text, dtype: object" + ] + }, + "execution_count": 19, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "# first 10 false positives (meaning they were incorrectly classified as 5-star reviews)\n", + "X_test[y_test < y_pred_class][:10]" + ] + }, + { + "cell_type": "code", + "execution_count": 20, + "metadata": { + "collapsed": false + }, + "outputs": [ + { + "data": { + "text/plain": [ + "\"If you like the stuck up Scottsdale vibe this is a good place for you. The food isn't impressive. Nice outdoor seating.\"" + ] + }, + "execution_count": 20, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "# false positive: model is reacting to the words \"good\", \"impressive\", \"nice\"\n", + "X_test[1781]" + ] + }, + { + "cell_type": "code", + "execution_count": 21, + "metadata": { + "collapsed": false + }, + "outputs": [ + { + "data": { + "text/plain": [ + "\"Went last night to Whore Foods to get basics to make pizza with, most clutch to the process was a three pack of yeast. Low and behold, the dirty hippie kids they have working there again didn't put something in the bag.\\r\\n\\r\\nAnd this time it was the yeast.\\r\\n\\r\\nI love the food there, but the employees are nothing more than entitled hippie kids from Scottsdale who can't be bothered to do their goddamn jobs! I am so sick of this crap with this corporation. Maybe its a Phoenix thing, or maybe its the hiring and firing processes of Whore Foods, but I am done shopping at any Whore Foods. In a place like Phoenix, where you have alternatives such as Sprouts, you'd think Whore Foods would smarten up.\\r\\n\\r\\nOr when you try to ask someone who works here where something is, and they just walk by with their nose high in the air. I understand its important to show your fellow dirt merchants that you're a super star, but I don't work with you, I contribute to your salary and over inflated set of benefits for a grocery clerk, so do your goddamn job. Useless little girl in need of a shower and orthodontist. \\r\\n\\r\\nBut alas, Whore Foods and its dirty hippie employees have alienated yet another person with a job and a degree into hating everything that place stands for.\\r\\n\\r\\nOh yea, one other thing, take the concealed firearm prohibition off of your stores. It didn't stop Jared Loughner from doing something horrid, and all it does is alienate law abiding citizens from their Constitutional rights. I understand you think that only parts of the Constitution should apply, but honestly I think you need to pull your collective heads out of your ass and take a shower.\\r\\n\\r\\nUseless motherf*&#$@s!\"" + ] + }, + "execution_count": 21, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "# false positive: model had likely never seen many of these words in the training data\n", + "X_test[9984]" + ] + }, + { + "cell_type": "code", + "execution_count": 22, + "metadata": { + "collapsed": false + }, + "outputs": [ + { + "data": { + "text/plain": [ + "'D-scust-ing.'" + ] + }, + "execution_count": 22, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "# false positive: model does not have enough data to work with\n", + "X_test[1919]" + ] + }, + { + "cell_type": "code", + "execution_count": 23, + "metadata": { + "collapsed": false + }, + "outputs": [ + { + "data": { + "text/plain": [ + "7148 I now consider myself an Arizonian. If you dri...\n", + "4963 This is by far my favourite department store, ...\n", + "6318 Since I have ranted recently on poor customer ...\n", + "380 This is a must try for any Mani Pedi fan. I us...\n", + "5565 I`ve had work done by this shop a few times th...\n", + "3448 I was there last week with my sisters and whil...\n", + "6050 I went to sears today to check on a layaway th...\n", + "2504 I've passed by prestige nails in walmart 100s ...\n", + "2475 This place is so great! I am a nanny and had t...\n", + "241 I was sad to come back to lai lai's and they n...\n", + "Name: text, dtype: object" + ] + }, + "execution_count": 23, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "# first 10 false negatives (meaning they were incorrectly classified as 1-star reviews)\n", + "X_test[y_test > y_pred_class][:10]" + ] + }, + { + "cell_type": "code", + "execution_count": 24, + "metadata": { + "collapsed": false + }, + "outputs": [ + { + "data": { + "text/plain": [ + "'This is by far my favourite department store, hands down. I have had nothing but perfect experiences in this store, without exception, no matter what department I\\'m in. The shoe SA\\'s will bend over backwards to help you find a specific shoe, and the staff will even go so far as to send out hand-written thank you cards to your home address after you make a purchase - big or small. Tim & Anthony in the shoe salon are fabulous beyond words! \\r\\n\\r\\nI am not completely sure that I understand why people complain about the amount of merchandise on the floor or the lack of crowds in this store. Frankly, I would rather not be bombarded with merchandise and other people. One of the things I love the most about Barney\\'s is not only the prompt attention of SA\\'s, but the fact that they aren\\'t rushing around trying to help 35 people at once. The SA\\'s at Barney\\'s are incredibly friendly and will stop to have an actual conversation, regardless or whether you are purchasing something or not. I have also never experienced a \"high pressure\" sale situation here.\\r\\n\\r\\nAll in all, Barneys is pricey, and there is no getting around it. But, um, so is Neiman\\'s and that place is a crock. Anywhere that ONLY accepts American Express or their charge card and then treats you like scum if you aren\\'t carrying neither is no place that I want to spend my hard earned dollars. Yay Barneys!'" + ] + }, + "execution_count": 24, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "# false negative: model is reacting to the words \"complain\", \"crowds\", \"rushing\", \"pricey\", \"scum\"\n", + "X_test[4963]" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Task 10\n", + "\n", + "Let's pretend that you want to balance sensitivity and specificity. You can achieve this by changing the threshold for predicting a 5-star review. What threshold approximately balances sensitivity and specificity?" + ] + }, + { + "cell_type": "code", + "execution_count": 25, + "metadata": { + "collapsed": false + }, + "outputs": [], + "source": [ + "# create a list that will store the results of the process below\n", + "results = []\n", + "\n", + "# loop through the thresholds returned by the metrics.roc_curve function\n", + "for threshold in thresholds:\n", + " \n", + " # make a class prediction of 5 if the predicted probability is higher than the threshold\n", + " y_pred_class = np.where(y_pred_prob > threshold, 5, 1)\n", + " \n", + " # generate the confusion matrix and slice it into four pieces\n", + " confusion = metrics.confusion_matrix(y_test, y_pred_class)\n", + " TP = confusion[1][1]\n", + " TN = confusion[0][0]\n", + " FP = confusion[0][1]\n", + " FN = confusion[1][0]\n", + " \n", + " # calculate the sensitivity and specificity\n", + " sensitivity = TP / float(TP + FN)\n", + " specificity = TN / float(TN + FP)\n", + " \n", + " # calculate the absolute difference between sensitivity and specificity\n", + " difference = np.absolute(sensitivity - specificity)\n", + " \n", + " # append a tuple to the results list\n", + " results.append((difference, sensitivity, specificity, threshold))" + ] + }, + { + "cell_type": "code", + "execution_count": 26, + "metadata": { + "collapsed": false + }, + "outputs": [ + { + "data": { + "text/plain": [ + "(0.0002983293556085842, 0.87470167064439142, 0.875, 0.99855196916444533)" + ] + }, + "execution_count": 26, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "# locate the minimum difference (at which sensitivity and specificity are balanced)\n", + "min(results)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "At a threshold of approximately **0.9986**, the sensitivity and specificity are both approximately **0.875**." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Task 10 (alternative solution)\n", + "\n", + "This solution simplifies the for loop by utilizing the \"fpr\" and \"tpr\" objects returned by the `metrics.roc_curve` function." + ] + }, + { + "cell_type": "code", + "execution_count": 27, + "metadata": { + "collapsed": false + }, + "outputs": [], + "source": [ + "# create a list that will store the results of the process below\n", + "results = []\n", + "\n", + "# loop through the thresholds returned by the metrics.roc_curve function (skipping the first threshold)\n", + "for threshold in thresholds[1:]:\n", + " \n", + " # calculate the sensitivity and specificity\n", + " sensitivity = tpr[thresholds > threshold][-1]\n", + " specificity = 1 - fpr[thresholds > threshold][-1]\n", + " \n", + " # calculate the absolute difference between sensitivity and specificity\n", + " difference = np.absolute(sensitivity - specificity)\n", + " \n", + " # append a tuple to the results list\n", + " results.append((difference, sensitivity, specificity, threshold))" + ] + }, + { + "cell_type": "code", + "execution_count": 28, + "metadata": { + "collapsed": false + }, + "outputs": [ + { + "data": { + "text/plain": [ + "(0.0002983293556085842, 0.87470167064439142, 0.875, 0.99855196916444533)" + ] + }, + "execution_count": 28, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "# locate the minimum difference (at which sensitivity and specificity are balanced)\n", + "min(results)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Task 11\n", + "\n", + "Let's see how well Naive Bayes performs when all reviews are included, rather than just 1-star and 5-star reviews:\n", + "\n", + "- Define X and y using the original DataFrame from step 1. (y should contain 5 different classes.)\n", + "- Split the data into training and testing sets.\n", + "- Calculate the testing accuracy of a Naive Bayes model.\n", + "- Compare the testing accuracy with the null accuracy.\n", + "- Print the confusion matrix.\n", + "- Comment on the results." + ] + }, + { + "cell_type": "code", + "execution_count": 29, + "metadata": { + "collapsed": true + }, + "outputs": [], + "source": [ + "# define X and y using the original DataFrame\n", + "X = yelp.text\n", + "y = yelp.stars" + ] + }, + { + "cell_type": "code", + "execution_count": 30, + "metadata": { + "collapsed": true + }, + "outputs": [], + "source": [ + "# split into training and testing sets\n", + "X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=1)" + ] + }, + { + "cell_type": "code", + "execution_count": 31, + "metadata": { + "collapsed": true + }, + "outputs": [], + "source": [ + "# create document-term matrices\n", + "X_train_dtm = vect.fit_transform(X_train)\n", + "X_test_dtm = vect.transform(X_test)" + ] + }, + { + "cell_type": "code", + "execution_count": 32, + "metadata": { + "collapsed": false + }, + "outputs": [ + { + "data": { + "text/plain": [ + "MultinomialNB(alpha=1.0, class_prior=None, fit_prior=True)" + ] + }, + "execution_count": 32, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "# fit a Naive Bayes model\n", + "nb.fit(X_train_dtm, y_train)" + ] + }, + { + "cell_type": "code", + "execution_count": 33, + "metadata": { + "collapsed": true + }, + "outputs": [], + "source": [ + "# make class predictions\n", + "y_pred_class = nb.predict(X_test_dtm)" + ] + }, + { + "cell_type": "code", + "execution_count": 34, + "metadata": { + "collapsed": false + }, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "0.4712\n" + ] + } + ], + "source": [ + "# calculate the testing accuary\n", + "print metrics.accuracy_score(y_test, y_pred_class)" + ] + }, + { + "cell_type": "code", + "execution_count": 35, + "metadata": { + "collapsed": false + }, + "outputs": [ + { + "data": { + "text/plain": [ + "4 0.3536\n", + "dtype: float64" + ] + }, + "execution_count": 35, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "# calculate the null accuracy\n", + "y_test.value_counts().head(1) / len(y_test)" + ] + }, + { + "cell_type": "code", + "execution_count": 36, + "metadata": { + "collapsed": false + }, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "[[ 55 14 24 65 27]\n", + " [ 28 16 41 122 27]\n", + " [ 5 7 35 281 37]\n", + " [ 7 0 16 629 232]\n", + " [ 6 4 6 373 443]]\n" + ] + } + ], + "source": [ + "# print the confusion matrix\n", + "print metrics.confusion_matrix(y_test, y_pred_class)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Comments:\n", + "\n", + "- Nearly all 4-star and 5-star reviews are classified as 4 or 5 stars, but they are hard for the model to distinguish between.\n", + "- 1-star, 2-star, and 3-star reviews are often classified as 4 stars, probably because it's the predominant class in the training data.\n", + "- When the model predicts 1 or 2 stars, it's usually correct.\n", + "- 47% accuracy is relatively impressive, given that humans would also have a hard time precisely identifying the star rating for many of these reviews." + ] + } + ], + "metadata": { + "kernelspec": { + "display_name": "Python 2", + "language": "python", + "name": "python2" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 2 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython2", + "version": "2.7.6" + } + }, + "nbformat": 4, + "nbformat_minor": 0 +}