Skip to content

Latest commit

 

History

History
executable file
·
329 lines (228 loc) · 11.8 KB

tabpy-tools.md

File metadata and controls

executable file
·
329 lines (228 loc) · 11.8 KB

TabPy Tools

TabPy tools is the Python package of tools for managing the published Python functions on TabPy server.

Connecting to TabPy

The tools library uses the notion of connecting to a service to avoid having to specify the service location for all subsequent operations:

from tabpy_tools.client import Client

client = Client('http://localhost:9004/')

The URL and port are where the Tableau-Python-Server process has been started - more info can be found in the server section of the documentation.

Deploying a Function

A persisted endpoint is backed by a Python method. For example:

def add(x,y):
    import numpy as np
    return np.add(x, y).tolist()

client.deploy('add', add, 'Adds two numbers x and y')

The next example is more complex, using scikit-learn's clustering API:

def clustering(x, y):
    import numpy as np
    from sklearn.cluster import DBSCAN
    from sklearn.preprocessing import StandardScaler
    X = np.column_stack([x, y])
    X = StandardScaler().fit_transform(X)
    db = DBSCAN(eps=1, min_samples=3).fit(X)
    return db.labels_.tolist()


client.deploy('clustering',
              clustering,
              'Returns cluster Ids for each data point specified by the '
              'pairs in x and y')

In this example the function clustering expects a set of two-dimensional data points, represented by the list of all x-coordinates and the list of all y-coordinates. It will return a set of numerical labels corresponding to the clusters each datapoint is assigned to. We deploy this function as an endpoint named clustering. It is now reachable as a REST API, as well as through the TabPy tools - for details see the next section.

You can re-deploy a function (for example, after you modified its code) by setting the override parameter to True:

client.deploy('add', add, 'Adds two numbers x and y', override=True)

Each re-deployment of an endpoint will increment its version number, which is also returned as part of the query result.

When deploying endpoints that rely on supervised learning models, you may want to load a saved model instead of training on-the-fly for performance reasons.

Below is an excerpt from the training stage of a hypothetical model that predicts whether or not a loan will default:

from sklearn.ensemble import GradientBoostingClassifier

predictors = [x for x in train.columns if x not in [target, RowID]]
gbm = GradientBoostingClassifier(learning_rate=0.01, n_estimators=600,max_depth=9,
min_samples_split=1200, min_samples_leaf=60, subsample=0.85, random_state=10)
modelfit(gbm, train, test, predictors)

When the trained model (named gbm in this case) is used in a function being deployed (as in gbm.predict(...) below), Tableau will automatically save its definition using cloudpickle along with the definition of the function. The model will also be kept in memory on the server to achieve fast response times. If you persist your model manually to disk and read as part of your scoring function code however, you will notice that response times are noticeably longer as every time a client hits an endpoint, the code (including model loading) will get executed. In order to get the best performance, we recommended following the methodology outlined in this example.

def LoanDefaultClassifier(Loan_Amount, Loan_Tenure, Monthly_Income, Age):
    import pandas as pd
    data=pd.concat([Loan_Amount,Loan_Tenure,Monthly_Income,Age],axis=1)
    return gbm.predict(data)

client.deploy('WillItDefault',
              LoanDefaultClassifier,
              'Returns whether a loan application is likely to default.')

You can find a detailed working example with a downloadable sample Tableau workbook and an accompanying Jupyter workbook that walks through model fitting, evaluation and publishing steps on our blog.

The endpoints that are no longer needed can be removed the following way:

client.remove('WillItDefault')

Predeployed Functions

To setup models download the latest version of TabPy and follow the instructions to install and start up your server. Once your server is running, navigate to the models directory and run setup.py. If your TabPy server is running on the default port (9004), you do not need to specify a port when launching the script. If your server is running on a port other than 9004 you can specify a different port in the command line like so

python setup.py 4047 

The setup file will install all of the necessary dependencies (sklearn, nltk, textblob, pandas, & numpy) and deploy all of the prebuilt models located in ./models/scripts. For every model that is successfully deployed a message will be printed to the console

"Successfully deployed PCA"

If you would like to deploy additional models using the deploy script, you can copy any python files to the ./models/scripts directory and modify setup.py to include all necessary packages when installing dependencies or alternatively install all the required dependencies manually.

Principal Component Analysis (PCA)

Principal component analysis is a statistical technique which extracts new, linearly uncorrelated, variables out of a dataset which capture the maximum variance in the data. In this way, PCA can be used to reduce the number of variables in a high dimensional dataset, a process is called dimensionality reduction. The first principal component captures the largest amount of variance, while the second captures the largest portion of the remaining variance while remaining orthogonal to the first and so on. This allows the reduction of the number of dimension while maintaining as much of the information from the original data as possible. PCA is useful in exploratory data analysis because complex relationships can be visualized in a 2D scatter plot of the first few principal components.

TabPy’s implementation of PCA uses the scikit-learn decomposition.PCA algorithm, which is further documented here. It requires the selected component to be > 0 and <= number of variables you pass in to the function. When passing categorical variables we perform the scikit-learn One Hot Encoding to transform your non-numeric variables into a table of 0s and 1s. In order for One Hot Encoding to be performant we have limited the number of unique values your categorical column may contain to 25 and do not permit any nulls or empty strings in the column. Before PCA is performed, all variables are normalized to have a mean of 0 and unit variance using the scikit-learn StandardScaler.

A Tableau calculated field to perform PCA will look like

tabpy.query(‘PCA’, 1, _arg1,_arg2, _arg3)[‘response’]

Sentiment Analysis

Sentiment analysis is a technique which uses natural language processing to extract the emotional positivity or negativity – the sentiment – behind a piece of text and converts that into a numeric value. Our implementation of sentiment analysis returns a polarity score between -1 and 1 which rates the positivity of the string with 1 being very positive and -1 being very negative. Calling the Sentiment Analysis function from TabPy in Tableau will look like the following, where _arg1 is a dimension containing text

tabpy.query('Sentiment Analysis', _arg1)[‘response’]

Python provides multiple packages that compute sentiment analysis – our implementation defaults to use NLTK’s sentiment package. If you would like to use TextBlob’s sentiment analysis algorithm you can do so by specifying the optional argument “library=textblob” when calling the Sentiment Analysis function through a calculated field in Tableau

tabpy.query('Sentiment Analysis', _arg1, library='textblob')[‘response’]

Providing Schema Metadata

As soon as you share your deployed functions, you also need to share metadata about the function. The consumer of an endpoint needs to know the details of how to use the endpoint, such as:

  • The general purpose of the endpoint
  • Input parameter names, data types, and their meaning
  • Return data type and description

This data goes beyond the single string that we used above when deploying the function add. You can use an optional parameter to deploy to provide such a structured description, which can then be retrieved by other users connected to the same server. The schema is interpreted as a Json Schema object, which you can either manually create or generate using a utility method provided in this tools package:

from tabpy_tools.schema import generate_schema

schema = generate_schema(
  input={'x': 3, 'y': 2},
  output=5,
  input_description={'x': 'first value',
                     'y': 'second value'},
  output_description='the sum of x and y')

  client.deploy('add', add, 'Adds two numbers x and y', schema=schema)

To describe more complex input, like arrays, you would use the following syntax:

from tabpy_tools.schema import generate_schema

schema = generate_schema(
  input={'x': [6.35, 6.40, 6.65, 8.60],
         'y': [1.95, 1.95, 2.05, 3.05]},
  output=[0, 0, 0, 1],
  input_description={'x': 'list of x values',
                     'y': 'list of y values'},
  output_description='cluster Ids for each point x, y')

  client.deploy('clustering',
      clustering,
      'Returns cluster Ids for each data point specified by the pairs in x and y',
      schema=schema)

A schema described as such can be retrieved through the REST Endpoints API or through the get_endpoints client API as follows:

client.get_endpoints()['add']['schema']

Querying an Endpoint

Once a Python function has been deployed to the server process, you can use the client's query method to query it (assumes you’re already connected to the service):

x = [6.35, 6.40, 6.65, 8.60, 8.90, 9.00, 9.10]
y = [1.95, 1.95, 2.05, 3.05, 3.05, 3.10, 3.15]

client.query('clustering', x, y)

Response:

{
  'model': 'clustering',
  'response': [0, 0, 0, 1, 1, 1, 1],
  'uuid': '1ca01e46-733c-4a77-b3da-3ded84dff4cd',
  'version': 2
}

Evaluating Arbitrary Python Scripts

The other core functionality besides deploying and querying methods as endpoints is the ad-hoc execution of Python code, called evaluate. Evaluate does not have a Python API in tabpy-tools, only a raw REST interface that other client bindings can easily implement. Tableau connects to TabPy using REST Evaluate.

evaluate allows calling a deployed endpoint from within the Python code block. The convention for this is to use a provided function call tabpy.query in the code, which behaves like the query method in tabpy-tools. See the REST API documentation for an example.