From d237c0c0d7e30765ef4619ec34f0a9869ce0d180 Mon Sep 17 00:00:00 2001
From: srikris <srikris@dato.com>
Date: Tue, 12 Jul 2016 07:19:45 -0700
Subject: [PATCH] [DSS Lead Scoring] Updated the lead scoring notebook.

---
 .../lead_scoring/lead_scoring_tutorial.ipynb  | 678 ++++++++++++++++++
 .../README.md                                 |   0
 .../Recommender DeepDive - Part 1.ipynb       |   0
 .../Recommender DeepDive - Part 2.ipynb       |   0
 .../book-recommender-exercises.ipynb          |   0
 .../book-recommender-solutions.ipynb          |   0
 6 files changed, 678 insertions(+)
 create mode 100644 dss-2016/lead_scoring/lead_scoring_tutorial.ipynb
 rename dss-2016/{recommendation-systems => recommendation_systems}/README.md (100%)
 rename dss-2016/{recommendation-systems => recommendation_systems}/Recommender DeepDive - Part 1.ipynb (100%)
 rename dss-2016/{recommendation-systems => recommendation_systems}/Recommender DeepDive - Part 2.ipynb (100%)
 rename dss-2016/{recommendation-systems => recommendation_systems}/book-recommender-exercises.ipynb (100%)
 rename dss-2016/{recommendation-systems => recommendation_systems}/book-recommender-solutions.ipynb (100%)

diff --git a/dss-2016/lead_scoring/lead_scoring_tutorial.ipynb b/dss-2016/lead_scoring/lead_scoring_tutorial.ipynb
new file mode 100644
index 0000000..b427ca4
--- /dev/null
+++ b/dss-2016/lead_scoring/lead_scoring_tutorial.ipynb
@@ -0,0 +1,678 @@
+{
+ "cells": [
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "# 1. Introduction"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "**The scenario**: suppose we run an online travel agency. We would like to convince our users to book overseas vacations, rather than domestic ones. Each of the users in this dataset will definitely book *something* at the end of a given trial period, i.e. we are only looking at engaged customers.\n",
+    "\n",
+    "**Goals**:\n",
+    "1. predict which new users are most likely to book an overseas trip,\n",
+    "2. generate segmention rules to group similar users based on features and propensity to convert.\n",
+    "\n",
+    "**Data**: mimics the [AirBnB challenge on Kaggle](https://www.kaggle.com/c/airbnb-recruiting-new-user-bookings).\n",
+    "- Users\n",
+    "- Website or app sessions.\n",
+    "\n",
+    "I've simulated data that's very similar in terms of features and distributions, but I've added timestamps to the sessions, and changed the target from country to a binary domestic vs. international variable.\n",
+    "\n",
+    "**Sections**:\n",
+    "1. Introduction\n",
+    "2. The basic scenario - account data only\n",
+    "3. What's happening under the hood?\n",
+    "4. Incorporating activity data."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {
+    "collapsed": false
+   },
+   "outputs": [],
+   "source": [
+    "from __future__ import print_function\n",
+    "import graphlab as gl"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "# 2. The basic scenario"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Import the data: sales accounts"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "- **Sales accounts need not be synonymous with users**, although that is the case here. At Turi, our sales accounts consist of a mix of individual users, companies, and teams within large companies.\n",
+    "\n",
+    "- **The accounts dataset typically comes from a customer relationship management (CRM) tool**, like Salesforce, SAP, or Hubspot. In practice there is an extra step here of extracting the data from that system into an SFrame. "
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {
+    "collapsed": false
+   },
+   "outputs": [],
+   "source": [
+    "users = gl.SFrame('synthetic_airbnb_users.sfr')\n",
+    "users.print_rows(3)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {
+    "collapsed": false
+   },
+   "outputs": [],
+   "source": [
+    "users['status'].sketch_summary()"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Encode the target variable"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "Three types of accounts.\n",
+    "- **Successful accounts**, i.e conversions, are coded as 1.\n",
+    "- **Failed accounts** are coded as -1.\n",
+    "- **Open accounts**, i.e. accounts that have not been decided, are coded as 0.\n",
+    "\n",
+    "Together, successful and failed accounts constitute the **training accounts**."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {
+    "collapsed": false
+   },
+   "outputs": [],
+   "source": [
+    "status_code = {'international': 1,\n",
+    "               'domestic': -1,\n",
+    "               'new': 0}\n",
+    "\n",
+    "users['outcome'] = users['status'].apply(lambda x: status_code[x])\n",
+    "users[['status', 'outcome']].print_rows(10)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Define the schema"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "In a complex problem like lead scoring, there are potentially many columns with \"meaning\". To help the lead scoring tool recognize these columns, we define a dictionary that maps standard lead scoring inputs to the columns in our particular dataset."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {
+    "collapsed": true
+   },
+   "outputs": [],
+   "source": [
+    "user_schema = {'conversion_status': 'outcome',\n",
+    "               'account_id': 'id',\n",
+    "               'features': ['gender', 'age', 'signup_method', 'signup_app',\n",
+    "                            'first_device_type', 'first_browser']}"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Create the lead scoring tool"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "**All accounts are passed to the tool when it's created. There is no separate `predict` method.**\n",
+    "- We typically want to score the same set of open accounts each day during the trial period.\n",
+    "- Very rarely do we want to predict lead scores for different accounts.\n",
+    "- It makes more sense to keep the open accounts in the model, so we can incrementally update the lead scores and market segments, as new data comes in.\n",
+    "- The `update` method is not yet implemented :("
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {
+    "collapsed": false,
+    "scrolled": true
+   },
+   "outputs": [],
+   "source": [
+    "scorer = gl.lead_scoring.create(users, user_schema)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Retrieve the model output and export"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "There's a lot of stuff in the lead scoring model's summary. Let's focus on the accessible fields, three in particular:\n",
+    "- **open_account_scores**: conversion probability and market segment for *open accounts*\n",
+    "- **training_account_scores**: conversion probability and market segment for *existing successes and failures*\n",
+    "- **segment_descriptions**: definitions and summary statistics for the market segments"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {
+    "collapsed": false
+   },
+   "outputs": [],
+   "source": [
+    "print(scorer)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {
+    "collapsed": false
+   },
+   "outputs": [],
+   "source": [
+    "scorer.open_account_scores.head(3)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {
+    "collapsed": false
+   },
+   "outputs": [],
+   "source": [
+    "scorer.open_account_scores.topk('conversion_prob', k=3)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {
+    "collapsed": false
+   },
+   "outputs": [],
+   "source": [
+    "scorer.training_account_scores.head(3)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {
+    "collapsed": false
+   },
+   "outputs": [],
+   "source": [
+    "scorer.segment_descriptions.head(3)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {
+    "collapsed": false
+   },
+   "outputs": [],
+   "source": [
+    "scorer.segment_descriptions[['segment_id', 'segment_features']].print_rows(max_column_width=65)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "To get the training or open accounts that belong to a particular market segment, use the respective SFrame's `filter_by` method."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {
+    "collapsed": false
+   },
+   "outputs": [],
+   "source": [
+    "seg = scorer.training_account_scores.filter_by(8, 'segment_id').head(3)\n",
+    "print(seg)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "# 3. What's happening under the hood?"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## The scoring model: gradient boosted trees"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {
+    "collapsed": false
+   },
+   "outputs": [],
+   "source": [
+    "print(scorer.scoring_model)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "Additional keyword arguments to the lead scoring `create` function are passed through to the gradient boosted trees model."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {
+    "collapsed": false
+   },
+   "outputs": [],
+   "source": [
+    "scorer2 = gl.lead_scoring.create(users, user_schema, max_iterations=20, verbose=False)\n",
+    "print(\"Original num trees:\", scorer.scoring_model.num_trees)\n",
+    "print(\"New num trees:\", scorer2.scoring_model.num_trees)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Validating the scoring model "
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "By default, the gradient boosted trees model withholds ??? percent of the training accounts as a validation set. The validation accuracy can be accessed as a user."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {
+    "collapsed": false
+   },
+   "outputs": [],
+   "source": [
+    "print(\"Validation accuracy:\", scorer.scoring_model.validation_accuracy)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## The segmentation model: decision tree"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {
+    "collapsed": false
+   },
+   "outputs": [],
+   "source": [
+    "print(scorer.segmentation_model)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "Because training the lead scoring tool can take some time with large datasets, the number of segments can be changed *after* a lead scoring tool has been created. This function **creates a new model**, the original model is **immutable**."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {
+    "collapsed": false
+   },
+   "outputs": [],
+   "source": [
+    "scorer2 = scorer.resize_segmentation_model(max_segments=20)\n",
+    "\n",
+    "print(\"original number of segments:\", scorer.segment_descriptions.num_rows())\n",
+    "print(\"new number of segments:\", scorer2.segment_descriptions.num_rows())"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "# 4. Incorporating activity data"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "**Account activity data** describes interactions between accounts and aspects of your business, like web assets, email campaigns, or products. Conceptually, each interaction involves at a minimum:\n",
+    "- an account\n",
+    "- a timestamp\n",
+    "\n",
+    "Interactions may also have:\n",
+    "- an \"item\"\n",
+    "- a user\n",
+    "- other features"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {
+    "collapsed": false
+   },
+   "outputs": [],
+   "source": [
+    "sessions = gl.SFrame('synthetic_airbnb_sessions.sfr')\n",
+    "sessions = gl.TimeSeries(sessions, index='timestamp')\n",
+    "sessions.head(5)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "As with the accounts table, we need to indicate which columns in the activity table mean what. If we had a column indicating which user was involved, we could specify that as well here. In this scenario, we don't have users that are distinct from accounts."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {
+    "collapsed": true
+   },
+   "outputs": [],
+   "source": [
+    "session_schema = {'account_id': 'user_id',\n",
+    "                  'item': 'action_detail'}"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Define relevant dates"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "To use account activity data, a lead scoring tool needs to know the time window for each account's relevant interactions. There are three key dates for each account.\n",
+    "\n",
+    "- **open date**: when a new sales account was created\n",
+    "- **close date**: when the *trial period* ends for a new sales account\n",
+    "- **decision date**: when a final decision was reached by a training account, either success (conversion) or failure. May be *before or after* the close date.\n",
+    "\n",
+    "The **trial duration** is the difference between the open date and the close date. The lead scoring tool in GLC assumes this is fixed for all accounts, but in general this need not be the case.\n",
+    "\n",
+    "Open accounts do not have a decision date yet, by definition. They may or may not be still within the trial period."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {
+    "collapsed": true
+   },
+   "outputs": [],
+   "source": [
+    "user_schema.update({'open_date': 'date_account_created',\n",
+    "                    'decision_date': 'booking_date'})"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "The trial duration is represented by an instance of the `datetime` package's `timedelta` class."
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Create the lead scoring tool "
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {
+    "collapsed": false
+   },
+   "outputs": [],
+   "source": [
+    "import datetime as dt\n",
+    "\n",
+    "scorer3 = gl.lead_scoring.create(users, user_schema,\n",
+    "                                 sessions, session_schema,\n",
+    "                                 trial_duration=dt.timedelta(days=30))"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {
+    "collapsed": false
+   },
+   "outputs": [],
+   "source": [
+    "print(scorer3)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Under the hood: date-based data validation"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "**Invalid accounts** have a decision date earlier than their open date. This is impossible, and these accounts are simply dropped from the set of training accounts."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {
+    "collapsed": false
+   },
+   "outputs": [],
+   "source": [
+    "invalid_ids = scorer3.invalid_accounts\n",
+    "print(invalid_ids)\n",
+    "\n",
+    "invalid_accounts = users.filter_by(invalid_ids, 'id')\n",
+    "invalid_accounts[['id', 'date_account_created', 'booking_date']].print_rows(3)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "**Implicit failure accounts** are accounts that are *open*, but have been open for so long they are extremely unlikely to convert.\n",
+    "\n",
+    "- The threshold for implicit failure is the 95th percentile of the time it took training accounts to reach a decision, or the trial period duration, whichever is longer.\n",
+    "\n",
+    "- Implicit failures are inluded in *both* the training and open account output, because they are used to train the scoring and segmentation models, but are technically still open.\n",
+    "\n",
+    "- The user **doesn't *have* to explicitly specify failure accounts** - the model can do that automatically."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {
+    "collapsed": false
+   },
+   "outputs": [],
+   "source": [
+    "print(scorer3.num_implicit_failures)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Under the hood: activity-based feature engineering "
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "The lead scoring tool constructs account-level features based on the number of interactions, items, and users (not applicable in this scenario) per day that the accounts are open (up to the maximum of the trial duration). The names of these features are accessible as a model field."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {
+    "collapsed": false
+   },
+   "outputs": [],
+   "source": [
+    "scorer3.final_features"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "The values for these features are included in the primary model outputs (`training_account_scores` and `open_account_scores`)."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {
+    "collapsed": false
+   },
+   "outputs": [],
+   "source": [
+    "scorer3.open_account_scores.print_rows(3)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "The activity-based features are also used to define market segments."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {
+    "collapsed": false
+   },
+   "outputs": [],
+   "source": [
+    "cols = ['segment_features', 'median_conversion_prob', 'num_training_accounts']\n",
+    "scorer3.segment_descriptions[cols].print_rows(max_row_width=80, max_column_width=60)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Results: improved validation accuracy"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {
+    "collapsed": false
+   },
+   "outputs": [],
+   "source": [
+    "print(\"Account-only validation accuracy:\", scorer.scoring_model.validation_accuracy)\n",
+    "print(\"Validation accuracy including activity features:\", scorer3.scoring_model.validation_accuracy)"
+   ]
+  }
+ ],
+ "metadata": {
+  "kernelspec": {
+   "display_name": "Python 2",
+   "language": "python",
+   "name": "python2"
+  },
+  "language_info": {
+   "codemirror_mode": {
+    "name": "ipython",
+    "version": 2
+   },
+   "file_extension": ".py",
+   "mimetype": "text/x-python",
+   "name": "python",
+   "nbconvert_exporter": "python",
+   "pygments_lexer": "ipython2",
+   "version": "2.7.12"
+  }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 0
+}
diff --git a/dss-2016/recommendation-systems/README.md b/dss-2016/recommendation_systems/README.md
similarity index 100%
rename from dss-2016/recommendation-systems/README.md
rename to dss-2016/recommendation_systems/README.md
diff --git a/dss-2016/recommendation-systems/Recommender DeepDive - Part 1.ipynb b/dss-2016/recommendation_systems/Recommender DeepDive - Part 1.ipynb
similarity index 100%
rename from dss-2016/recommendation-systems/Recommender DeepDive - Part 1.ipynb
rename to dss-2016/recommendation_systems/Recommender DeepDive - Part 1.ipynb
diff --git a/dss-2016/recommendation-systems/Recommender DeepDive - Part 2.ipynb b/dss-2016/recommendation_systems/Recommender DeepDive - Part 2.ipynb
similarity index 100%
rename from dss-2016/recommendation-systems/Recommender DeepDive - Part 2.ipynb
rename to dss-2016/recommendation_systems/Recommender DeepDive - Part 2.ipynb
diff --git a/dss-2016/recommendation-systems/book-recommender-exercises.ipynb b/dss-2016/recommendation_systems/book-recommender-exercises.ipynb
similarity index 100%
rename from dss-2016/recommendation-systems/book-recommender-exercises.ipynb
rename to dss-2016/recommendation_systems/book-recommender-exercises.ipynb
diff --git a/dss-2016/recommendation-systems/book-recommender-solutions.ipynb b/dss-2016/recommendation_systems/book-recommender-solutions.ipynb
similarity index 100%
rename from dss-2016/recommendation-systems/book-recommender-solutions.ipynb
rename to dss-2016/recommendation_systems/book-recommender-solutions.ipynb