TabPy tools is the Python package of tools for managing the published Python functions on TabPy server.
- Connecting to TabPy
- Authentication
- Deploying a Function
- Predeployed Functions
- Providing Schema Metadata
- Querying an Endpoint
- Evaluating Arbitrary Python Scripts
The tools library uses the notion of connecting to a service to avoid having to specify the service location for all subsequent operations:
from tabpy.tabpy_tools.client import Client
client = Client('http://localhost:9004/')
The URL and port are where the Tableau-Python-Server process has been started - more info can be found in the Starting TabPy section of the documentation.
When TabPy is configured with the authentication feature on, client code
has to specify the credentials to use during model deployment with the
set_credentials
call for a client:
client.set_credentials('username', 'P@ssw0rd')
Credentials only need to be set once for all further client operations.
In cases where credentials are not provided but are required, the deployment will fail with an "Unauthorized" code (401).
For instructions on how to configure and enable the authentication feature for TabPy, see TabPy Server Configuration Instructions.
A persisted endpoint is backed by a Python method. For example:
def add(x,y):
import numpy as np
return np.add(x, y).tolist()
client.deploy('add', add, 'Adds two numbers x and y')
The next example is more complex, using scikit-learn's clustering API:
def clustering(x, y):
import numpy as np
from sklearn.cluster import DBSCAN
from sklearn.preprocessing import StandardScaler
X = np.column_stack([x, y])
X = StandardScaler().fit_transform(X)
db = DBSCAN(eps=1, min_samples=3).fit(X)
return db.labels_.tolist()
client.deploy('clustering',
clustering,
'Returns cluster Ids for each data point specified by the '
'pairs in x and y')
In this example the function clustering
expects a set of two-dimensional
data points, represented by the list of all x-coordinates and the list of all
y-coordinates. It will return a set of numerical labels corresponding to the
clusters to which each datapoint is assigned. We deploy this function as an
endpoint named clustering
.
It is now reachable as a REST API, as
well as through the TabPy tools - for details see the next section.
You can re-deploy a function (for example, after you modified its code) by setting
the override
parameter to True
:
client.deploy('add', add, 'Adds two numbers x and y', override=True)
Each re-deployment of an endpoint will increment its version number, which is also returned as part of the query result.
When deploying endpoints which rely on supervised learning models, you may want to load a saved model instead of training on-the-fly for performance reasons.
Below is an excerpt from the training stage of a hypothetical model that predicts whether or not a loan will default:
from sklearn.ensemble import GradientBoostingClassifier
predictors = [x for x in train.columns if x not in [target, RowID]]
gbm = GradientBoostingClassifier(learning_rate=0.01, n_estimators=600,max_depth=9,
min_samples_split=1200, min_samples_leaf=60, subsample=0.85, random_state=10)
modelfit(gbm, train, test, predictors)
When the trained model (named gbm
in this case) is used in a function being
deployed (as in gbm.predict(...)
below), Tableau will automatically save its
definition using cloudpickle
along with the definition of the function. The model
will also be kept in memory on the server in order to achieve faster response times.
If you persist your model manually to disk and read as part of your scoring function
code however, you will notice that the response times are noticeably longer - as
every time a client hits an endpoint, the code (including model loading) will get
executed. In order to get the best performance, we recommended following the
methodology outlined in this example.
def LoanDefaultClassifier(Loan_Amount, Loan_Tenure, Monthly_Income, Age):
import pandas as pd
data=pd.concat([Loan_Amount,Loan_Tenure,Monthly_Income,Age],axis=1)
return gbm.predict(data)
client.deploy('WillItDefault',
LoanDefaultClassifier,
'Returns whether a loan application is likely to default.')
You can find a detailed working example with a downloadable sample Tableau workbook and an accompanying Jupyter workbook that walks through model fitting, evaluation and publishing steps on our blog.
The endpoints that are no longer needed can be removed the following way:
client.remove('WillItDefault')
To deploy models shipped with TabPy follow the TabPy Installation Instructions and then TabPy Server Configuration Instructions. Once your server is running execute the following command:
tabpy-deploy-models
If your server is running using a custom config specify the config in the command line:
tabpy-deploy-models custom.conf
The command will deploy all of the prebuilt models. For every successfully deployed model a message will be printed to the console:
"Successfully deployed PCA"
Use code in tabpy/models/scripts
as an example of how to create a model and
tabpy/models/deploy_models.py
as an example for how to deploy a model. Before executing delpoyment script
install all the required dependencies with pip
.
You can deploy models individually by navigating to
tabpy/models/scripts
and running
each file in isolation like so:
python PCA.py
Similarly to the setup script, if your server is running using a custom config, you can specify the config's file path through the command line.
Principal component analysis
is a statistical technique which extracts new, linearly uncorrelated,
variables out of a dataset which capture the maximum variance in the
data. In this way, PCA
can be used to reduce the number of variables
in a high dimensional dataset, a process that is called dimensionality
reduction. The first principal component captures the largest amount of
variance, while the second captures the largest portion of the remaining
variance while remaining orthogonal to the first and so on. This allows the
reduction of the number of dimensions while maintaining as much of the
information from the original data as possible. PCA
is useful in
exploratory data analysis because complex linear relationships can be
visualized in a 2D scatter plot of the first few principal components.
TabPy’s implementation of PCA
uses the scikit-learn
decomposition.PCA
algorithm, which is further documented here.
In the Tableau script, after the function name PCA
, you must specify a
principal component to return. This integer input should be > 0 and <= the
number of variables you pass in to the function. When passing categorical
variables we perform the scikit-learn
One Hot Encoding
to transform your non-numeric variables into a one-hot numeric array of 0s and
1s. In order for One Hot Encoding
to be performant we have limited the number
of unique values your categorical column may contain to 25 and do not permit
any nulls or empty strings in the column. In Tableau's implementation of PCA
is performed, all variables are normalized to have a mean of 0 and unit
variance using the scikit-learn
StandardScaler.
A Tableau calculated field to perform PCA will look like:
tabpy.query(‘PCA’, 1, _arg1, _arg2, _arg3)[‘response’]
Sentiment analysis is
a technique which uses natural language processing to extract the emotional
positivity or negativity – the sentiment – behind a piece of text and converts
that into a numeric value. Our implementation of sentiment analysis
returns a
polarity score between -1 and 1 which rates the positivity of the string with
1 being very positive and -1 being very negative. Calling the Sentiment Analysis
function from TabPy in Tableau will look like the following,
where _arg1 is a Tableau dimension containing text
tabpy.query('Sentiment Analysis', _arg1)[‘response’]
Python provides multiple packages that compute sentiment analysis
– our implementation
defaults to use NLTK’s sentiment package.
If you would like to use TextBlob’s sentiment analysis
algorithm you can do so by specifying the optional argument “library=textblob”
when calling the Sentiment Analysis
function through a calculated field in
Tableau
tabpy.query('Sentiment Analysis', _arg1, library='textblob')[‘response’]
A t-test is a statistical hypothesis test that is used to compare two sample means or a sample’s mean against a known population mean. The ttest should be used when the means of the samples follows a normal distribution but the variance may not be known.
TabPy’s pre-deployed t-test implementation can be called using the following syntax,
tabpy.query(‘ttest’, _arg1, _arg2)[‘response’]
and is capable of performing two types of t-tests:
-
A t-test for the means of two independent samples with equal variance This is a two-sided t test with the null hypothesis being that the mean of sample1 is equal to the mean of sample2:
_arg1
(list of numeric values): a list of independent observations._arg2
(list of numeric values): a list of independent observations equal to the length of_arg1
.
Alternatively, your data may not be split into separate measures. If that is the case you can pass the following fields to ttest:
_arg1
(list of numeric values): a list of independent observations_arg2
(list of categorical variables with cardinality two): a binary factor that maps each observation in_arg1
to either sample1 or sample2 (this list should be equal to the length of_arg1
).
-
A t-test for the mean of one group:
_arg1
(list of numeric values): a list of independent observations._arg2
(a numeric value): the known population mean A two-sided t test with the null hypothesis being that the mean of a sample of independent observations is equal to the given population mean.
The function returns a two-tailed p-value (between 0 and 1). Depending on your significance level you may reject or fail to reject the null hypothesis.
Analysis of variance helps inform if two or more group means within a sample differ. By measuring the variation between and among groups and computing the resulting F-statistic we are able to obtain a p-value. While a statistically significant p-value will inform you that at least 2 of your groups’ means are different from each other, it will not tell you which of the two groups differ.
You can call ANOVA from tableau in the following way,
tabpy.query(‘anova’, _arg1, _arg2, _arg3)[‘response’]
As soon as you share your deployed functions, you also need to share metadata about the function. The consumer of an endpoint needs to know the details of how to use the endpoint, such as:
- The general purpose of the endpoint
- Input parameter names, data types, and their meaning
- Return data type and description
This data goes beyond the single string that we used above when deploying the
function add
. You can use an optional parameter to deploy
to provide such
a structured description, which can then be retrieved by other users connected
to the same server. The schema is interpreted as a Json Schema
object, which you can either manually create or generate using a utility
method provided in this tools package:
from tabpy.tabpy_tools.schema import generate_schema
schema = generate_schema(
input={'x': 3, 'y': 2},
output=5,
input_description={'x': 'first value',
'y': 'second value'},
output_description='the sum of x and y')
client.deploy('add', add, 'Adds two numbers x and y', schema=schema)
To describe more complex input, like arrays, you would use the following syntax:
from tabpy.tabpy_tools.schema import generate_schema
schema = generate_schema(
input={'x': [6.35, 6.40, 6.65, 8.60],
'y': [1.95, 1.95, 2.05, 3.05]},
output=[0, 0, 0, 1],
input_description={'x': 'list of x values',
'y': 'list of y values'},
output_description='cluster Ids for each point x, y')
client.deploy('clustering',
clustering,
'Returns cluster Ids for each data point specified by the pairs in x and y',
schema=schema)
A schema described as such can be retrieved through the REST Endpoints API
or through the get_endpoints
client API as follows:
client.get_endpoints()['add']['schema']
Once a Python function has been deployed to the server process, you can use the
client's query
method to query it (assuming that you’re already connected to the
service):
x = [6.35, 6.40, 6.65, 8.60, 8.90, 9.00, 9.10]
y = [1.95, 1.95, 2.05, 3.05, 3.05, 3.10, 3.15]
client.query('clustering', x, y)
Response:
{
"model": "clustering",
"response": [0, 0, 0, 1, 1, 1, 1],
"uuid": "1ca01e46-733c-4a77-b3da-3ded84dff4cd",
"version": 2
}
The other core functionality aside from deploying and querying methods as endpoints
is the ad-hoc execution of Python code, called evaluate
. Evaluate does not
have a Python API in tabpy-tools
, only a raw REST interface
that other client bindings can easily implement. Tableau connects to TabPy
using REST Evaluate
.
evaluate
allows calling a deployed endpoint from within the Python code block.
The convention for this is to use a provided function call tabpy.query
in the
code, which behaves like the query
method in tabpy-tools
. See the
REST API documentation for an example.