Skip to content

Commit

Permalink
Merge pull request microsoft#155 from davidsalgado/master
Browse files Browse the repository at this point in the history
Added a 'predictive analytics' article + lab
  • Loading branch information
perrysk-msft authored Nov 14, 2016
2 parents f7d98fc + ac86014 commit 5e7741e
Show file tree
Hide file tree
Showing 15 changed files with 589 additions and 0 deletions.
Original file line number Diff line number Diff line change
@@ -0,0 +1,51 @@
Implementing Predictive Analytics in Your Applications
============================================================

Predictive analytics are a powerful way to add intelligence to your application, it enables you to predict outcomes against data that is new to your application. The Microsoft data platform provides numerous ways you can add predictive analytics to your applications

What is predictive analytics?
-------------------------------
Before we get into the implementation, let’s address a fundamental question—**what is predictive analytics?** At their core, predictive tasks are those that predict one value given a set of other values as input. In other words, predictive tasks learn (or are taught) how to make predictions. This learning is captured in a model by an algorithm. Think of the model as the way the learnings are compactly summarized. When you want to make a prediction, you invoke a prediction operation and provide the model as one of the inputs, along with the input values against which you want to form a prediction. Predictive analytics is the act of applying prediction (and your model) to your data to gain new insights.

So, **what are some examples of predictive analytics?** These fall out into two basic categories. You have _prediction that aims to predict the class (or category) of something_. For example, you can have single class classification that tries to predict if an email is spam or not spam—so the class is either “spam” or “not spam”. You can also have multi-class classification, that predicts an outcome from a set of possible outcomes. For example, you can have a multi-class classification that predicts if a consumer is at “high risk”, “moderate risk”, or “low risk” of default on a loan.

You also have _numeric prediction_. Instead of trying to predict a class from a fixed set of options, numeric prediction tries to predict a numeric value from a continuous range of numbers. For example, you might try to predict how long of a delay in minutes a flight will experience or the currency value of a particular stock in the stock market.

Prediction on the Microsoft Data Platform
----------------------------------------------
The Microsoft Data Platform provide numerous ways you can build predictive models that you can then integrate into your application. The following diagram summarizes the options:

![Alternatives to train and use a model](imgs/UseModelForPrediction.png "Model Train and use")

As you can infer from the diagram, the act of incorporating predictive analytics into your applications involves two major phases: model creation and model operationalization. Conceptually, these are very simple to understand.

**Model Creation:** During model creation, you train your predictive model (by showing it sample data along with the outcomes) and test that it works (at least that it predicts results better than random chance would). You save this model so you can use it later when you want to make predictions against new data.

**Model Operationalization:** During model operationalization, you are implementing predictions that use your model in whatever hosting environment (such as a web service) makes sense for integration with your application. In other words, operationalization is how you add predictive analytics to your application.

Options for Model Creation
-----------------------------
Let’s begin by understanding the various ways you can train and test your model. When creating your model, you can train your model locally. This is amounts to authoring and running R or Python scripts on your development workstation. For example, you might use the integrated development environment R GUI (a component in Microsoft R Open) or R Tools for Visual Studio to author your R scripts that train your model, help you test it and visualize the results.

Alternately, the training can be done using resources that are remote to your development workstation. The Microsoft Data Platform offers the following options for this:

* **Azure Machine Learning (Azure ML):** Azure ML enables you to design predictive experiments (referred to as scoring experiments) using its browser based Machine Learning Studio. The visual drag-and-drop experience is like designing a flowchart, where each box of the flowchart is called a “module”. Modules can retrieve data, transform data, process data, create predictive models and evaluate their predictive performance. There are numerous built in modules that let you define and run custom script code written in R or Python as desired.

* **HDInsight:** HDInsight provides numerous ways you can train predictive models using a cluster of servers running in Azure. With R Server on Spark and R Server on Hadoop, you author R scripts whose execution runs across the cluster to train (and test) your model. If you deploy an HDInsight with Spark, you can use Spark ML to program the training and testing models using Scala, Java, Python or R. Generally, the data used to train models in HDInsight comes from a form of highly scalable block storage such as HDFS, Azure Data Lake Store or Azure Storage Blobs.

* **SQL R Services:** SQL in R Services enable you to train and test predictive models in the context of SQL Server 2016. You author T-SQL procedures that contain embedded R scripts, and the SQL Server database engine takes care of the execution. Because it executes in the context of SQL Server, your models can be easily trained against data stored within tables within your database.

Option for Model Operationalization
---------------------------------------
The Microsoft Data Platform also provide multiple ways of adding predictive analytics to an application. As you can see in the diagram, while there are many ways to train a model there tend to be only a few practical ways to use the model from an application.

* **Invoke Predict within a Script:** When running within a local environment, you can easily use the trained model as input into your predictive script.

* **Invoke Predictive Web Service:** When running in a remote environment, a common approach is to encapsulate the call to prediction in a web service/Rest API operation that is readily invoked from an application. For Azure ML, this is as easy as a few clicks to deploy a predictive experiment as a web service. For HDInsight, this amounts to exporting your trained model to a file and importing the model into a compatible host such a Microsoft’s DeployR (for models created with R Server on Spark or Hadoop), which wraps a web services layer around a prediction script written in R.

* **Invoke a Predictive Store Procedure:** When using SQL R Services, you can package the code that invokes prediction using your model within a stored procedure. Therefore, integrating a prediction into your application becomes a matter of executing a stored a procedure in SQL Server—something that most applications can easily accomplish regardless of whether they are written in .NET, node.js, Java...

While each approach has its merits, in the accompanying lab, we’ll examine how to augment a node.js application with predictive analytics using this last approach that invokes a predictive stored procedure running in SQL Server 2016.

**Done with the intro?**
[Start the lab](scripts/Lab.md)
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Original file line number Diff line number Diff line change
@@ -0,0 +1,2 @@
CREATE DATABASE taxidata;
GO
Original file line number Diff line number Diff line change
@@ -0,0 +1,24 @@
USE [taxidata]
GO


CREATE FUNCTION [dbo].[fnEngineerFeatures] (
@passenger_count int = 0,
@trip_distance float = 0,
@trip_time_in_secs int = 0,
@direct_distance float = 0)
RETURNS TABLE
AS
RETURN
(

SELECT
@passenger_count AS passenger_count,
@trip_distance AS trip_distance,
@trip_time_in_secs AS trip_time_in_secs,
@direct_distance as direct_distance
)

GO


Binary file not shown.
Original file line number Diff line number Diff line change
@@ -0,0 +1,33 @@
use taxidata
go

CREATE PROCEDURE [dbo].[TrainTipPredictionModel]
AS
BEGIN
DECLARE @inquery nvarchar(max) = N'
select tipped, passenger_count, trip_time_in_secs, trip_distance, direct_distance
from nyctaxi_features
'

--delete previous stored models
truncate table dbo.nyc_taxi_models

-- Insert the trained model into a database table
INSERT INTO nyc_taxi_models
EXEC sp_execute_external_script
@language = N'R',
@script = N'
## Create model
logitObj <- rxLogit(tipped ~ passenger_count + trip_distance + trip_time_in_secs + direct_distance, data = InputDataSet)
## Serialize model and put it in data frame
trained_model <- data.frame(model=as.raw(serialize(logitObj, NULL)));
',
@input_data_1 = @inquery,
@output_data_1_name = N'trained_model'
;

END
GO
Original file line number Diff line number Diff line change
@@ -0,0 +1,9 @@
USE [taxidata]
GO

CREATE TABLE [dbo].[nyc_taxi_models](
[model] [varbinary](max) NOT NULL
) ON [PRIMARY] TEXTIMAGE_ON [PRIMARY]
GO


Original file line number Diff line number Diff line change
@@ -0,0 +1,15 @@
USE [taxidata]
GO

CREATE TABLE [dbo].[nyctaxi_features](
[passenger_count] [int] NULL,
[trip_time_in_secs] [bigint] NULL,
[trip_distance] [float] NULL,
[direct_distance] [float] NULL,
[tip_amount] [float] NULL,
[tipped] [int] NULL
) ON [PRIMARY]

GO


Original file line number Diff line number Diff line change
@@ -0,0 +1,23 @@
## Just in case you want to go through the steps on an R tool like (RGui, RStudio or R Tools for VS)
##this script uses rpart instead of rxlogit
## these are the step by step to reproduce the lab

install.packages("RODBC")
library(RODBC)

##Connect to SQL Server 2016, assumes a Windows Authentication method
dbhandle <- odbcDriverConnect('driver={SQL Server};server=<yourservername>;database=taxidata;trusted_connection=true')

##Run the query to brin the data we'll use to create the model
res <- sqlQuery(dbhandle, 'select tipped, passenger_count, trip_time_in_secs, trip_distance, direct_distance from nyctaxi_features')

##Create the model...
model <- rxLogit(tipped ~ passenger_count + trip_distance + trip_time_in_secs + direct_distance, res)
summary(model)

##Now, let's create the frame with the parameters for the prediction
prediction_parameters <- data.frame(passenger_count = 1, trip_time_in_secs = 631, trip_distance = 2.5, direct_distance = 2)

##predict
OutputDataFrame <- rxPredict(model, prediction_parameters, outData = NULL, predVarNames = "Score", type = "response", writeModelVars = FALSE, overwrite = TRUE)
OutputDataFrame
Binary file not shown.
Binary file not shown.
Loading

0 comments on commit 5e7741e

Please sign in to comment.