diff --git a/README.md b/README.md
index ff59829..0ca9cca 100644
--- a/README.md
+++ b/README.md
@@ -83,8 +83,14 @@ it an outlier. There's a threshold parameter for how strict you might want to be
This allows you to define a function that can make handle preprocessing. It's
constructed in such a way that you can use the arguments of the function as a parameter
that you can benchmark in a grid-search. This is especially powerful in combination
-with the pandas `.pipe` method. If you're unfamiliar with this amazing feature, yo may appreciate
-[this tutorial](https://calmcode.io/pandas-pipe/introduction.html).
+with the pandas `.pipe` method. If you're unfamiliar with this amazing feature, you
+may appreciate [this tutorial](https://calmcode.io/pandas-pipe/introduction.html).
+
+#### InteractivePreprocessor
+
+This allows you to draw features that you'd like to add to your dataset or
+your machine learning pipeline. You can use it via `tfm.fit(df).transform(df)` and
+`df.pipe(tfm)`.
### Datasets
diff --git a/docs/api/preprocessing.md b/docs/api/preprocessing.md
index 4d8b3e6..7f5767e 100644
--- a/docs/api/preprocessing.md
+++ b/docs/api/preprocessing.md
@@ -1,3 +1,5 @@
# `from hulearn.preprocessing import *`
::: hulearn.preprocessing.pipetransformer
+
+::: hulearn.preprocessing.interactivepreprocessor
diff --git a/docs/examples/examples.md b/docs/examples/examples.md
index 90fcac0..dac6935 100644
--- a/docs/examples/examples.md
+++ b/docs/examples/examples.md
@@ -1,27 +1,77 @@
This page contains a list of short examples that demonstrate the utility of
the tools in this package. The goal for each example is to be small and consise.
-This page is still under construction.
+## Precision and Subgroups
+
+It can be the case that for a subgroup of the population you do not need a model.
+Suppose that we have a session log dataset from "World of Warcraft". We know when
+people logged in, if they were part of a guild and when they stopped playing. You
+can create a machine learning model to predict which players are at risk of quitting
+the game but you might also be able to come up with some simple rules.
+
+Here is one rule that might work out swell:
+
+> "If any player was playing the video game at 24:00 on new-years eve,
+odds are that this person is very invested in the game and won't stop playing."
+
+This one rule will not cover the entire population but for the subgroup it can be an
+effective rule.
+
+![](wow.png)
+
+As an illustrative example we'll implement this diagram as a `Classifier`.
-## Insurance for Future Data
+```python
+import numpy as np
+from hulearn.outlier import InteractiveOutlierDetector
+from hulearn.classification import FunctionClassifier, InteractiveClassifier
+
+
+classifier = SomeScikitLearnModel()
+
+def make_decision(dataf):
+ # First we create a resulting array with all the predictions
+ res = classifier.predict(dataf)
+
+ # Override model prediction if a user is a heavy_user, no matter what
+ res = np.where(dataf['heavy_user'], "stays", res)
+
+ return res
+
+fallback_model = FunctionClassifier(make_decision)
+```
## No Data No Problem
-## Model Guarantees
+Let's say that we're interested in detecting fraud at a tax office. Even without
+looking at the data we can already come up with some sensible rules.
-## Precision and Subgroups
+- Any minor making over the median income is "suspicious".
+- Any person who started more than 2 companies in a year is "suspicious".
+- Any person who has more than 10 bank accounts is "suspicious".
-## Dealing with NA
+The thing with these rules is that they are easy to explain but they are not based
+on data at all. In fact, they may not occur in the data at all. This means that
+a machine learning model may not have picked up this pattern that we're interested
+in. Thankfully, the lack in data can be compensated with business rules.
+
+![](tree.png)
## Comfort Zone
+Models typically have a "comfort zone". If a new data point comes in that is
+very different from what the models saw before it should not be treated the same way.
+You can also argue that points with low `proba` score should also not be
+automated.
+
If you want to prevent predictions where the model is "unsure" then you
-might want to construct a `FunctionClassifier` that handles the logic you
-require. For example, we might draw a diagram like;
+might want to follow this diagram;
![](../guide/finding-outliers/diagram.png)
-As an illustrative example we'll implement a diagram like above as a `Classifier`.
+You can construct such a system by creating a `FunctionClassifier` that
+handles the logic you require. As an illustrative example we'll implement
+this diagram as a `Classifier`.
```python
import numpy as np
@@ -55,5 +105,3 @@ For more information on why this tactic is helpful:
- [blogpost](https://koaning.io/posts/high-on-probability-low-on-certainty/)
- [pydata talk](https://www.youtube.com/watch?v=Z8MEFI7ZJlA)
-
-## Risk Class Translation
diff --git a/docs/examples/faq.md b/docs/examples/faq.md
index 4c9aee4..4255fa3 100644
--- a/docs/examples/faq.md
+++ b/docs/examples/faq.md
@@ -2,6 +2,15 @@
Feel free to ask questions [here](https://github.com/koaning/human-learn/issues).
+## What are the Lessons Learned
+
+If you're interested in some of the lessons the creators of this tool learned
+while creating it, all you need to do is follow the python tradition.
+
+```python
+from hulearn import this
+```
+
## Why Make This?
Back in the old days, it was common to write rule-based systems. Systems that do;
diff --git a/docs/examples/tree.png b/docs/examples/tree.png
new file mode 100644
index 0000000..5a8c8b0
Binary files /dev/null and b/docs/examples/tree.png differ
diff --git a/docs/examples/wow.png b/docs/examples/wow.png
new file mode 100644
index 0000000..7705a59
Binary files /dev/null and b/docs/examples/wow.png differ
diff --git a/docs/guide/drawing-features/custom-features.md b/docs/guide/drawing-features/custom-features.md
new file mode 100644
index 0000000..7832dce
--- /dev/null
+++ b/docs/guide/drawing-features/custom-features.md
@@ -0,0 +1,57 @@
+Sofar we've explored drawing as a tool for models, but it can also be used
+as a tool to generate features. To explore this, let's load in the penguins
+dataset again.
+
+```python
+from sklego.datasets import load_penguins
+
+df = load_penguins(as_frame=True).dropna()
+```
+
+## Drawing
+
+We can draw over this dataset. It's like before but with one crucial differenc
+
+```python
+from hulearn.experimental.interactive import InteractiveCharts
+
+# Note that the `labels` arugment here is a list, not a string! This
+# tells the tool that we want to be able to add custom groups that are
+# not defined by a column in the dataframe.
+charts = InteractiveCharts(df, labels=['group_one', 'group_two'])
+```
+
+Let's make a custom drawing.
+
+```python
+charts.add_chart(x="flipper_length_mm", y="body_mass_g")
+```
+
+Let's assume the new drawing looks something like this.
+
+![](drawing.png)
+
+Sofar these drawn features have been used to construct models. But they
+can also be used to help label data or generate extra features for machine
+learning models.
+
+## Features
+
+This library makes it easy to add these features to scikit-learn
+pipelines or to pandas. To get started, you'll want to import the
+`InteractivePreprocessor`.
+
+```python
+from hulearn.preprocessing import InteractivePreprocessor
+tfm = InteractivePreprocessor(json_desc=charts.data())
+```
+
+This `tfm` object is can be used as a preprocessing step inside of
+scikit-learn but it can also be used in a pandas pipeline.
+
+```python
+# The flow for scikit-learn
+tfm.fit(df).transform(df)
+# The flow for pandas
+df.pipe(tfm.pandas_pipe)
+```
diff --git a/docs/guide/drawing-features/drawing.png b/docs/guide/drawing-features/drawing.png
new file mode 100644
index 0000000..072cef3
Binary files /dev/null and b/docs/guide/drawing-features/drawing.png differ
diff --git a/docs/guide/notebooks/05-custom-features.ipynb b/docs/guide/notebooks/05-custom-features.ipynb
new file mode 100644
index 0000000..fa038fb
--- /dev/null
+++ b/docs/guide/notebooks/05-custom-features.ipynb
@@ -0,0 +1,674 @@
+{
+ "cells": [
+ {
+ "cell_type": "code",
+ "execution_count": 1,
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "%load_ext autoreload\n",
+ "%autoreload 2"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 70,
+ "metadata": {},
+ "outputs": [
+ {
+ "data": {
+ "text/html": [
+ "
\n",
+ "\n",
+ "
\n",
+ " \n",
+ " \n",
+ " | \n",
+ " species | \n",
+ " island | \n",
+ " bill_length_mm | \n",
+ " bill_depth_mm | \n",
+ " flipper_length_mm | \n",
+ " body_mass_g | \n",
+ " sex | \n",
+ "
\n",
+ " \n",
+ " \n",
+ " \n",
+ " 0 | \n",
+ " Adelie | \n",
+ " Torgersen | \n",
+ " 39.1 | \n",
+ " 18.7 | \n",
+ " 181.0 | \n",
+ " 3750.0 | \n",
+ " male | \n",
+ "
\n",
+ " \n",
+ " 1 | \n",
+ " Adelie | \n",
+ " Torgersen | \n",
+ " 39.5 | \n",
+ " 17.4 | \n",
+ " 186.0 | \n",
+ " 3800.0 | \n",
+ " female | \n",
+ "
\n",
+ " \n",
+ " 2 | \n",
+ " Adelie | \n",
+ " Torgersen | \n",
+ " 40.3 | \n",
+ " 18.0 | \n",
+ " 195.0 | \n",
+ " 3250.0 | \n",
+ " female | \n",
+ "
\n",
+ " \n",
+ " 4 | \n",
+ " Adelie | \n",
+ " Torgersen | \n",
+ " 36.7 | \n",
+ " 19.3 | \n",
+ " 193.0 | \n",
+ " 3450.0 | \n",
+ " female | \n",
+ "
\n",
+ " \n",
+ " 5 | \n",
+ " Adelie | \n",
+ " Torgersen | \n",
+ " 39.3 | \n",
+ " 20.6 | \n",
+ " 190.0 | \n",
+ " 3650.0 | \n",
+ " male | \n",
+ "
\n",
+ " \n",
+ "
\n",
+ "
"
+ ],
+ "text/plain": [
+ " species island bill_length_mm bill_depth_mm flipper_length_mm \\\n",
+ "0 Adelie Torgersen 39.1 18.7 181.0 \n",
+ "1 Adelie Torgersen 39.5 17.4 186.0 \n",
+ "2 Adelie Torgersen 40.3 18.0 195.0 \n",
+ "4 Adelie Torgersen 36.7 19.3 193.0 \n",
+ "5 Adelie Torgersen 39.3 20.6 190.0 \n",
+ "\n",
+ " body_mass_g sex \n",
+ "0 3750.0 male \n",
+ "1 3800.0 female \n",
+ "2 3250.0 female \n",
+ "4 3450.0 female \n",
+ "5 3650.0 male "
+ ]
+ },
+ "execution_count": 70,
+ "metadata": {},
+ "output_type": "execute_result"
+ }
+ ],
+ "source": [
+ "import numpy as np\n",
+ "import pandas as pd\n",
+ "from sklego.datasets import load_penguins\n",
+ "\n",
+ "df = load_penguins(as_frame=True).dropna()\n",
+ "df.head()"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 71,
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "from hulearn.experimental.interactive import InteractiveCharts"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 72,
+ "metadata": {},
+ "outputs": [
+ {
+ "data": {
+ "text/html": [
+ "\n",
+ " \n",
+ "
\n",
+ "
Loading BokehJS ...\n",
+ "
"
+ ]
+ },
+ "metadata": {},
+ "output_type": "display_data"
+ },
+ {
+ "data": {
+ "application/javascript": [
+ "\n",
+ "(function(root) {\n",
+ " function now() {\n",
+ " return new Date();\n",
+ " }\n",
+ "\n",
+ " var force = true;\n",
+ "\n",
+ " if (typeof root._bokeh_onload_callbacks === \"undefined\" || force === true) {\n",
+ " root._bokeh_onload_callbacks = [];\n",
+ " root._bokeh_is_loading = undefined;\n",
+ " }\n",
+ "\n",
+ " var JS_MIME_TYPE = 'application/javascript';\n",
+ " var HTML_MIME_TYPE = 'text/html';\n",
+ " var EXEC_MIME_TYPE = 'application/vnd.bokehjs_exec.v0+json';\n",
+ " var CLASS_NAME = 'output_bokeh rendered_html';\n",
+ "\n",
+ " /**\n",
+ " * Render data to the DOM node\n",
+ " */\n",
+ " function render(props, node) {\n",
+ " var script = document.createElement(\"script\");\n",
+ " node.appendChild(script);\n",
+ " }\n",
+ "\n",
+ " /**\n",
+ " * Handle when an output is cleared or removed\n",
+ " */\n",
+ " function handleClearOutput(event, handle) {\n",
+ " var cell = handle.cell;\n",
+ "\n",
+ " var id = cell.output_area._bokeh_element_id;\n",
+ " var server_id = cell.output_area._bokeh_server_id;\n",
+ " // Clean up Bokeh references\n",
+ " if (id != null && id in Bokeh.index) {\n",
+ " Bokeh.index[id].model.document.clear();\n",
+ " delete Bokeh.index[id];\n",
+ " }\n",
+ "\n",
+ " if (server_id !== undefined) {\n",
+ " // Clean up Bokeh references\n",
+ " var cmd = \"from bokeh.io.state import curstate; print(curstate().uuid_to_server['\" + server_id + \"'].get_sessions()[0].document.roots[0]._id)\";\n",
+ " cell.notebook.kernel.execute(cmd, {\n",
+ " iopub: {\n",
+ " output: function(msg) {\n",
+ " var id = msg.content.text.trim();\n",
+ " if (id in Bokeh.index) {\n",
+ " Bokeh.index[id].model.document.clear();\n",
+ " delete Bokeh.index[id];\n",
+ " }\n",
+ " }\n",
+ " }\n",
+ " });\n",
+ " // Destroy server and session\n",
+ " var cmd = \"import bokeh.io.notebook as ion; ion.destroy_server('\" + server_id + \"')\";\n",
+ " cell.notebook.kernel.execute(cmd);\n",
+ " }\n",
+ " }\n",
+ "\n",
+ " /**\n",
+ " * Handle when a new output is added\n",
+ " */\n",
+ " function handleAddOutput(event, handle) {\n",
+ " var output_area = handle.output_area;\n",
+ " var output = handle.output;\n",
+ "\n",
+ " // limit handleAddOutput to display_data with EXEC_MIME_TYPE content only\n",
+ " if ((output.output_type != \"display_data\") || (!output.data.hasOwnProperty(EXEC_MIME_TYPE))) {\n",
+ " return\n",
+ " }\n",
+ "\n",
+ " var toinsert = output_area.element.find(\".\" + CLASS_NAME.split(' ')[0]);\n",
+ "\n",
+ " if (output.metadata[EXEC_MIME_TYPE][\"id\"] !== undefined) {\n",
+ " toinsert[toinsert.length - 1].firstChild.textContent = output.data[JS_MIME_TYPE];\n",
+ " // store reference to embed id on output_area\n",
+ " output_area._bokeh_element_id = output.metadata[EXEC_MIME_TYPE][\"id\"];\n",
+ " }\n",
+ " if (output.metadata[EXEC_MIME_TYPE][\"server_id\"] !== undefined) {\n",
+ " var bk_div = document.createElement(\"div\");\n",
+ " bk_div.innerHTML = output.data[HTML_MIME_TYPE];\n",
+ " var script_attrs = bk_div.children[0].attributes;\n",
+ " for (var i = 0; i < script_attrs.length; i++) {\n",
+ " toinsert[toinsert.length - 1].firstChild.setAttribute(script_attrs[i].name, script_attrs[i].value);\n",
+ " toinsert[toinsert.length - 1].firstChild.textContent = bk_div.children[0].textContent\n",
+ " }\n",
+ " // store reference to server id on output_area\n",
+ " output_area._bokeh_server_id = output.metadata[EXEC_MIME_TYPE][\"server_id\"];\n",
+ " }\n",
+ " }\n",
+ "\n",
+ " function register_renderer(events, OutputArea) {\n",
+ "\n",
+ " function append_mime(data, metadata, element) {\n",
+ " // create a DOM node to render to\n",
+ " var toinsert = this.create_output_subarea(\n",
+ " metadata,\n",
+ " CLASS_NAME,\n",
+ " EXEC_MIME_TYPE\n",
+ " );\n",
+ " this.keyboard_manager.register_events(toinsert);\n",
+ " // Render to node\n",
+ " var props = {data: data, metadata: metadata[EXEC_MIME_TYPE]};\n",
+ " render(props, toinsert[toinsert.length - 1]);\n",
+ " element.append(toinsert);\n",
+ " return toinsert\n",
+ " }\n",
+ "\n",
+ " /* Handle when an output is cleared or removed */\n",
+ " events.on('clear_output.CodeCell', handleClearOutput);\n",
+ " events.on('delete.Cell', handleClearOutput);\n",
+ "\n",
+ " /* Handle when a new output is added */\n",
+ " events.on('output_added.OutputArea', handleAddOutput);\n",
+ "\n",
+ " /**\n",
+ " * Register the mime type and append_mime function with output_area\n",
+ " */\n",
+ " OutputArea.prototype.register_mime_type(EXEC_MIME_TYPE, append_mime, {\n",
+ " /* Is output safe? */\n",
+ " safe: true,\n",
+ " /* Index of renderer in `output_area.display_order` */\n",
+ " index: 0\n",
+ " });\n",
+ " }\n",
+ "\n",
+ " // register the mime type if in Jupyter Notebook environment and previously unregistered\n",
+ " if (root.Jupyter !== undefined) {\n",
+ " var events = require('base/js/events');\n",
+ " var OutputArea = require('notebook/js/outputarea').OutputArea;\n",
+ "\n",
+ " if (OutputArea.prototype.mime_types().indexOf(EXEC_MIME_TYPE) == -1) {\n",
+ " register_renderer(events, OutputArea);\n",
+ " }\n",
+ " }\n",
+ "\n",
+ " \n",
+ " if (typeof (root._bokeh_timeout) === \"undefined\" || force === true) {\n",
+ " root._bokeh_timeout = Date.now() + 5000;\n",
+ " root._bokeh_failed_load = false;\n",
+ " }\n",
+ "\n",
+ " var NB_LOAD_WARNING = {'data': {'text/html':\n",
+ " \"\\n\"+\n",
+ " \"
\\n\"+\n",
+ " \"BokehJS does not appear to have successfully loaded. If loading BokehJS from CDN, this \\n\"+\n",
+ " \"may be due to a slow or bad network connection. Possible fixes:\\n\"+\n",
+ " \"
\\n\"+\n",
+ " \"
\\n\"+\n",
+ " \"- re-rerun `output_notebook()` to attempt to load from CDN again, or
\\n\"+\n",
+ " \"- use INLINE resources instead, as so:
\\n\"+\n",
+ " \"
\\n\"+\n",
+ " \"
\\n\"+\n",
+ " \"from bokeh.resources import INLINE\\n\"+\n",
+ " \"output_notebook(resources=INLINE)\\n\"+\n",
+ " \"
\\n\"+\n",
+ " \"
\"}};\n",
+ "\n",
+ " function display_loaded() {\n",
+ " var el = document.getElementById(\"1788\");\n",
+ " if (el != null) {\n",
+ " el.textContent = \"BokehJS is loading...\";\n",
+ " }\n",
+ " if (root.Bokeh !== undefined) {\n",
+ " if (el != null) {\n",
+ " el.textContent = \"BokehJS \" + root.Bokeh.version + \" successfully loaded.\";\n",
+ " }\n",
+ " } else if (Date.now() < root._bokeh_timeout) {\n",
+ " setTimeout(display_loaded, 100)\n",
+ " }\n",
+ " }\n",
+ "\n",
+ "\n",
+ " function run_callbacks() {\n",
+ " try {\n",
+ " root._bokeh_onload_callbacks.forEach(function(callback) {\n",
+ " if (callback != null)\n",
+ " callback();\n",
+ " });\n",
+ " } finally {\n",
+ " delete root._bokeh_onload_callbacks\n",
+ " }\n",
+ " console.debug(\"Bokeh: all callbacks have finished\");\n",
+ " }\n",
+ "\n",
+ " function load_libs(css_urls, js_urls, callback) {\n",
+ " if (css_urls == null) css_urls = [];\n",
+ " if (js_urls == null) js_urls = [];\n",
+ "\n",
+ " root._bokeh_onload_callbacks.push(callback);\n",
+ " if (root._bokeh_is_loading > 0) {\n",
+ " console.debug(\"Bokeh: BokehJS is being loaded, scheduling callback at\", now());\n",
+ " return null;\n",
+ " }\n",
+ " if (js_urls == null || js_urls.length === 0) {\n",
+ " run_callbacks();\n",
+ " return null;\n",
+ " }\n",
+ " console.debug(\"Bokeh: BokehJS not loaded, scheduling load and callback at\", now());\n",
+ " root._bokeh_is_loading = css_urls.length + js_urls.length;\n",
+ "\n",
+ " function on_load() {\n",
+ " root._bokeh_is_loading--;\n",
+ " if (root._bokeh_is_loading === 0) {\n",
+ " console.debug(\"Bokeh: all BokehJS libraries/stylesheets loaded\");\n",
+ " run_callbacks()\n",
+ " }\n",
+ " }\n",
+ "\n",
+ " function on_error() {\n",
+ " console.error(\"failed to load \" + url);\n",
+ " }\n",
+ "\n",
+ " for (var i = 0; i < css_urls.length; i++) {\n",
+ " var url = css_urls[i];\n",
+ " const element = document.createElement(\"link\");\n",
+ " element.onload = on_load;\n",
+ " element.onerror = on_error;\n",
+ " element.rel = \"stylesheet\";\n",
+ " element.type = \"text/css\";\n",
+ " element.href = url;\n",
+ " console.debug(\"Bokeh: injecting link tag for BokehJS stylesheet: \", url);\n",
+ " document.body.appendChild(element);\n",
+ " }\n",
+ "\n",
+ " const hashes = {\"https://cdn.bokeh.org/bokeh/release/bokeh-2.2.1.min.js\": \"qkRvDQVAIfzsJo40iRBbxt6sttt0hv4lh74DG7OK4MCHv4C5oohXYoHUM5W11uqS\", \"https://cdn.bokeh.org/bokeh/release/bokeh-widgets-2.2.1.min.js\": \"Sb7Mr06a9TNlet/GEBeKaf5xH3eb6AlCzwjtU82wNPyDrnfoiVl26qnvlKjmcAd+\", \"https://cdn.bokeh.org/bokeh/release/bokeh-tables-2.2.1.min.js\": \"HaJ15vgfmcfRtB4c4YBOI4f1MUujukqInOWVqZJZZGK7Q+ivud0OKGSTn/Vm2iso\"};\n",
+ "\n",
+ " for (var i = 0; i < js_urls.length; i++) {\n",
+ " var url = js_urls[i];\n",
+ " var element = document.createElement('script');\n",
+ " element.onload = on_load;\n",
+ " element.onerror = on_error;\n",
+ " element.async = false;\n",
+ " element.src = url;\n",
+ " if (url in hashes) {\n",
+ " element.crossOrigin = \"anonymous\";\n",
+ " element.integrity = \"sha384-\" + hashes[url];\n",
+ " }\n",
+ " console.debug(\"Bokeh: injecting script tag for BokehJS library: \", url);\n",
+ " document.head.appendChild(element);\n",
+ " }\n",
+ " };\n",
+ "\n",
+ " function inject_raw_css(css) {\n",
+ " const element = document.createElement(\"style\");\n",
+ " element.appendChild(document.createTextNode(css));\n",
+ " document.body.appendChild(element);\n",
+ " }\n",
+ "\n",
+ " \n",
+ " var js_urls = [\"https://cdn.bokeh.org/bokeh/release/bokeh-2.2.1.min.js\", \"https://cdn.bokeh.org/bokeh/release/bokeh-widgets-2.2.1.min.js\", \"https://cdn.bokeh.org/bokeh/release/bokeh-tables-2.2.1.min.js\"];\n",
+ " var css_urls = [];\n",
+ " \n",
+ "\n",
+ " var inline_js = [\n",
+ " function(Bokeh) {\n",
+ " Bokeh.set_log_level(\"info\");\n",
+ " },\n",
+ " function(Bokeh) {\n",
+ " \n",
+ " \n",
+ " }\n",
+ " ];\n",
+ "\n",
+ " function run_inline_js() {\n",
+ " \n",
+ " if (root.Bokeh !== undefined || force === true) {\n",
+ " \n",
+ " for (var i = 0; i < inline_js.length; i++) {\n",
+ " inline_js[i].call(root, root.Bokeh);\n",
+ " }\n",
+ " if (force === true) {\n",
+ " display_loaded();\n",
+ " }} else if (Date.now() < root._bokeh_timeout) {\n",
+ " setTimeout(run_inline_js, 100);\n",
+ " } else if (!root._bokeh_failed_load) {\n",
+ " console.log(\"Bokeh: BokehJS failed to load within specified timeout.\");\n",
+ " root._bokeh_failed_load = true;\n",
+ " } else if (force !== true) {\n",
+ " var cell = $(document.getElementById(\"1788\")).parents('.cell').data().cell;\n",
+ " cell.output_area.append_execute_result(NB_LOAD_WARNING)\n",
+ " }\n",
+ "\n",
+ " }\n",
+ "\n",
+ " if (root._bokeh_is_loading === 0) {\n",
+ " console.debug(\"Bokeh: BokehJS loaded, going straight to plotting\");\n",
+ " run_inline_js();\n",
+ " } else {\n",
+ " load_libs(css_urls, js_urls, function() {\n",
+ " console.debug(\"Bokeh: BokehJS plotting callback run at\", now());\n",
+ " run_inline_js();\n",
+ " });\n",
+ " }\n",
+ "}(window));"
+ ],
+ "application/vnd.bokehjs_load.v0+json": "\n(function(root) {\n function now() {\n return new Date();\n }\n\n var force = true;\n\n if (typeof root._bokeh_onload_callbacks === \"undefined\" || force === true) {\n root._bokeh_onload_callbacks = [];\n root._bokeh_is_loading = undefined;\n }\n\n \n\n \n if (typeof (root._bokeh_timeout) === \"undefined\" || force === true) {\n root._bokeh_timeout = Date.now() + 5000;\n root._bokeh_failed_load = false;\n }\n\n var NB_LOAD_WARNING = {'data': {'text/html':\n \"\\n\"+\n \"
\\n\"+\n \"BokehJS does not appear to have successfully loaded. If loading BokehJS from CDN, this \\n\"+\n \"may be due to a slow or bad network connection. Possible fixes:\\n\"+\n \"
\\n\"+\n \"
\\n\"+\n \"- re-rerun `output_notebook()` to attempt to load from CDN again, or
\\n\"+\n \"- use INLINE resources instead, as so:
\\n\"+\n \"
\\n\"+\n \"
\\n\"+\n \"from bokeh.resources import INLINE\\n\"+\n \"output_notebook(resources=INLINE)\\n\"+\n \"
\\n\"+\n \"
\"}};\n\n function display_loaded() {\n var el = document.getElementById(\"1788\");\n if (el != null) {\n el.textContent = \"BokehJS is loading...\";\n }\n if (root.Bokeh !== undefined) {\n if (el != null) {\n el.textContent = \"BokehJS \" + root.Bokeh.version + \" successfully loaded.\";\n }\n } else if (Date.now() < root._bokeh_timeout) {\n setTimeout(display_loaded, 100)\n }\n }\n\n\n function run_callbacks() {\n try {\n root._bokeh_onload_callbacks.forEach(function(callback) {\n if (callback != null)\n callback();\n });\n } finally {\n delete root._bokeh_onload_callbacks\n }\n console.debug(\"Bokeh: all callbacks have finished\");\n }\n\n function load_libs(css_urls, js_urls, callback) {\n if (css_urls == null) css_urls = [];\n if (js_urls == null) js_urls = [];\n\n root._bokeh_onload_callbacks.push(callback);\n if (root._bokeh_is_loading > 0) {\n console.debug(\"Bokeh: BokehJS is being loaded, scheduling callback at\", now());\n return null;\n }\n if (js_urls == null || js_urls.length === 0) {\n run_callbacks();\n return null;\n }\n console.debug(\"Bokeh: BokehJS not loaded, scheduling load and callback at\", now());\n root._bokeh_is_loading = css_urls.length + js_urls.length;\n\n function on_load() {\n root._bokeh_is_loading--;\n if (root._bokeh_is_loading === 0) {\n console.debug(\"Bokeh: all BokehJS libraries/stylesheets loaded\");\n run_callbacks()\n }\n }\n\n function on_error() {\n console.error(\"failed to load \" + url);\n }\n\n for (var i = 0; i < css_urls.length; i++) {\n var url = css_urls[i];\n const element = document.createElement(\"link\");\n element.onload = on_load;\n element.onerror = on_error;\n element.rel = \"stylesheet\";\n element.type = \"text/css\";\n element.href = url;\n console.debug(\"Bokeh: injecting link tag for BokehJS stylesheet: \", url);\n document.body.appendChild(element);\n }\n\n const hashes = {\"https://cdn.bokeh.org/bokeh/release/bokeh-2.2.1.min.js\": \"qkRvDQVAIfzsJo40iRBbxt6sttt0hv4lh74DG7OK4MCHv4C5oohXYoHUM5W11uqS\", \"https://cdn.bokeh.org/bokeh/release/bokeh-widgets-2.2.1.min.js\": \"Sb7Mr06a9TNlet/GEBeKaf5xH3eb6AlCzwjtU82wNPyDrnfoiVl26qnvlKjmcAd+\", \"https://cdn.bokeh.org/bokeh/release/bokeh-tables-2.2.1.min.js\": \"HaJ15vgfmcfRtB4c4YBOI4f1MUujukqInOWVqZJZZGK7Q+ivud0OKGSTn/Vm2iso\"};\n\n for (var i = 0; i < js_urls.length; i++) {\n var url = js_urls[i];\n var element = document.createElement('script');\n element.onload = on_load;\n element.onerror = on_error;\n element.async = false;\n element.src = url;\n if (url in hashes) {\n element.crossOrigin = \"anonymous\";\n element.integrity = \"sha384-\" + hashes[url];\n }\n console.debug(\"Bokeh: injecting script tag for BokehJS library: \", url);\n document.head.appendChild(element);\n }\n };\n\n function inject_raw_css(css) {\n const element = document.createElement(\"style\");\n element.appendChild(document.createTextNode(css));\n document.body.appendChild(element);\n }\n\n \n var js_urls = [\"https://cdn.bokeh.org/bokeh/release/bokeh-2.2.1.min.js\", \"https://cdn.bokeh.org/bokeh/release/bokeh-widgets-2.2.1.min.js\", \"https://cdn.bokeh.org/bokeh/release/bokeh-tables-2.2.1.min.js\"];\n var css_urls = [];\n \n\n var inline_js = [\n function(Bokeh) {\n Bokeh.set_log_level(\"info\");\n },\n function(Bokeh) {\n \n \n }\n ];\n\n function run_inline_js() {\n \n if (root.Bokeh !== undefined || force === true) {\n \n for (var i = 0; i < inline_js.length; i++) {\n inline_js[i].call(root, root.Bokeh);\n }\n if (force === true) {\n display_loaded();\n }} else if (Date.now() < root._bokeh_timeout) {\n setTimeout(run_inline_js, 100);\n } else if (!root._bokeh_failed_load) {\n console.log(\"Bokeh: BokehJS failed to load within specified timeout.\");\n root._bokeh_failed_load = true;\n } else if (force !== true) {\n var cell = $(document.getElementById(\"1788\")).parents('.cell').data().cell;\n cell.output_area.append_execute_result(NB_LOAD_WARNING)\n }\n\n }\n\n if (root._bokeh_is_loading === 0) {\n console.debug(\"Bokeh: BokehJS loaded, going straight to plotting\");\n run_inline_js();\n } else {\n load_libs(css_urls, js_urls, function() {\n console.debug(\"Bokeh: BokehJS plotting callback run at\", now());\n run_inline_js();\n });\n }\n}(window));"
+ },
+ "metadata": {},
+ "output_type": "display_data"
+ }
+ ],
+ "source": [
+ "clf = InteractiveCharts(df, labels=['group_one', 'group_two'])"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 74,
+ "metadata": {},
+ "outputs": [
+ {
+ "data": {
+ "application/vnd.bokehjs_exec.v0+json": "",
+ "text/html": [
+ "\n",
+ ""
+ ]
+ },
+ "metadata": {
+ "application/vnd.bokehjs_exec.v0+json": {
+ "server_id": "3ddc2b24612c4d0a9d900f17c973459a"
+ }
+ },
+ "output_type": "display_data"
+ }
+ ],
+ "source": [
+ "clf.add_chart(x=\"flipper_length_mm\", y=\"body_mass_g\")"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 75,
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "from hulearn.preprocessing import InteractivePreprocessor"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 76,
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "tfm = InteractivePreprocessor(json_desc=clf.data())"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 78,
+ "metadata": {},
+ "outputs": [
+ {
+ "data": {
+ "text/html": [
+ "\n",
+ "\n",
+ "
\n",
+ " \n",
+ " \n",
+ " | \n",
+ " species | \n",
+ " island | \n",
+ " bill_length_mm | \n",
+ " bill_depth_mm | \n",
+ " flipper_length_mm | \n",
+ " body_mass_g | \n",
+ " sex | \n",
+ " group_one | \n",
+ " group_two | \n",
+ "
\n",
+ " \n",
+ " \n",
+ " \n",
+ " 0 | \n",
+ " Adelie | \n",
+ " Torgersen | \n",
+ " 39.1 | \n",
+ " 18.7 | \n",
+ " 181.0 | \n",
+ " 3750.0 | \n",
+ " male | \n",
+ " 0 | \n",
+ " 0 | \n",
+ "
\n",
+ " \n",
+ " 1 | \n",
+ " Adelie | \n",
+ " Torgersen | \n",
+ " 39.5 | \n",
+ " 17.4 | \n",
+ " 186.0 | \n",
+ " 3800.0 | \n",
+ " female | \n",
+ " 0 | \n",
+ " 0 | \n",
+ "
\n",
+ " \n",
+ " 2 | \n",
+ " Adelie | \n",
+ " Torgersen | \n",
+ " 40.3 | \n",
+ " 18.0 | \n",
+ " 195.0 | \n",
+ " 3250.0 | \n",
+ " female | \n",
+ " 0 | \n",
+ " 0 | \n",
+ "
\n",
+ " \n",
+ " 3 | \n",
+ " Adelie | \n",
+ " Torgersen | \n",
+ " 36.7 | \n",
+ " 19.3 | \n",
+ " 193.0 | \n",
+ " 3450.0 | \n",
+ " female | \n",
+ " 0 | \n",
+ " 0 | \n",
+ "
\n",
+ " \n",
+ " 4 | \n",
+ " Adelie | \n",
+ " Torgersen | \n",
+ " 39.3 | \n",
+ " 20.6 | \n",
+ " 190.0 | \n",
+ " 3650.0 | \n",
+ " male | \n",
+ " 0 | \n",
+ " 0 | \n",
+ "
\n",
+ " \n",
+ "
\n",
+ "
"
+ ],
+ "text/plain": [
+ " species island bill_length_mm bill_depth_mm flipper_length_mm \\\n",
+ "0 Adelie Torgersen 39.1 18.7 181.0 \n",
+ "1 Adelie Torgersen 39.5 17.4 186.0 \n",
+ "2 Adelie Torgersen 40.3 18.0 195.0 \n",
+ "3 Adelie Torgersen 36.7 19.3 193.0 \n",
+ "4 Adelie Torgersen 39.3 20.6 190.0 \n",
+ "\n",
+ " body_mass_g sex group_one group_two \n",
+ "0 3750.0 male 0 0 \n",
+ "1 3800.0 female 0 0 \n",
+ "2 3250.0 female 0 0 \n",
+ "3 3450.0 female 0 0 \n",
+ "4 3650.0 male 0 0 "
+ ]
+ },
+ "execution_count": 78,
+ "metadata": {},
+ "output_type": "execute_result"
+ }
+ ],
+ "source": [
+ "df.pipe(tfm.pandas_pipe).head()"
+ ]
+ }
+ ],
+ "metadata": {
+ "kernelspec": {
+ "display_name": "Python 3",
+ "language": "python",
+ "name": "python3"
+ },
+ "language_info": {
+ "codemirror_mode": {
+ "name": "ipython",
+ "version": 3
+ },
+ "file_extension": ".py",
+ "mimetype": "text/x-python",
+ "name": "python",
+ "nbconvert_exporter": "python",
+ "pygments_lexer": "ipython3",
+ "version": "3.7.7"
+ }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 4
+}
diff --git a/docs/index.md b/docs/index.md
index e78eb72..01fe444 100644
--- a/docs/index.md
+++ b/docs/index.md
@@ -31,11 +31,22 @@ You can install this tool via `pip`.
python -m pip install human-learn
```
+## Guides
+
+To help you get started we've written some helpful getting started guides.
+
+1. [Functions as a Model](guide/function-classifier/function-classifier.html)
+2. [Human Preprocessing](guide/function-preprocess/function-preprocessing.html)
+3. [Drawing as a Model](guide/drawing-classifier/drawing.html)
+4. [Outliers and Comfort](guide/finding-outliers/outliers.html)
+5. [Drawing Features](guide/function-classifier/function-classifier.html)
+
+You can also check out the API documentation [here](api/classification.html).
+
## Features
This library hosts a couple of models that you can play with.
-
### Classification Models
#### FunctionClassifier
@@ -79,8 +90,14 @@ it an outlier. There's a threshold parameter for how strict you might want to be
This allows you to define a function that can make handle preprocessing. It's
constructed in such a way that you can use the arguments of the function as a parameter
that you can benchmark in a grid-search. This is especially powerful in combination
-with the pandas `.pipe` method. If you're unfamiliar with this amazing feature, yo may appreciate
-[this tutorial](https://calmcode.io/pandas-pipe/introduction.html).
+with the pandas `.pipe` method. If you're unfamiliar with this amazing feature, you
+may appreciate [this tutorial](https://calmcode.io/pandas-pipe/introduction.html).
+
+#### InteractivePreprocessor
+
+This allows you to draw features that you'd like to add to your dataset or
+your machine learning pipeline. You can use it via `tfm.fit(df).transform(df)` and
+`df.pipe(tfm)`.
### Datasets
diff --git a/hulearn/__init__.py b/hulearn/__init__.py
index 3dc1f76..485f44a 100644
--- a/hulearn/__init__.py
+++ b/hulearn/__init__.py
@@ -1 +1 @@
-__version__ = "0.1.0"
+__version__ = "0.1.1"
diff --git a/hulearn/experimental/interactive.py b/hulearn/experimental/interactive.py
index ee4fad2..edff2d9 100644
--- a/hulearn/experimental/interactive.py
+++ b/hulearn/experimental/interactive.py
@@ -6,7 +6,7 @@
from bokeh.plotting import figure, show
from bokeh.models import PolyDrawTool, PolyEditTool
from bokeh.layouts import row
-from bokeh.models import Label
+from bokeh.models.annotations import Label
from bokeh.models.widgets import Div
from bokeh.io import output_notebook
@@ -37,7 +37,7 @@ def __init__(self, dataf, labels):
self.labels = labels
self.charts = []
- def add_chart(self, x, y):
+ def add_chart(self, x, y, size=5, alpha=0.5):
"""
Generate an interactive chart to a cell.
@@ -71,7 +71,12 @@ def add_chart(self, x, y):
```
"""
chart = SingleInteractiveChart(
- dataf=self.dataf.copy(), labels=self.labels, x=x, y=y
+ dataf=self.dataf.copy(),
+ labels=self.labels,
+ x=x,
+ y=y,
+ size=size,
+ alpha=alpha,
)
self.charts.append(chart)
chart.show()
@@ -84,7 +89,7 @@ def to_json(self, path):
class SingleInteractiveChart:
- def __init__(self, dataf, labels, x, y):
+ def __init__(self, dataf, labels, x, y, size=5, alpha=0.5):
self.uuid = str(uuid.uuid4())[:10]
self.x = x
self.y = y
@@ -103,7 +108,9 @@ def __init__(self, dataf, labels, x, y):
if len(self.labels) > 5:
raise ValueError("We currently only allow for 5 classes max.")
- self.plot.circle(x=x, y=y, color="color", source=self.source)
+ self.plot.circle(
+ x=x, y=y, color="color", source=self.source, size=size, alpha=alpha
+ )
# Create all the tools for drawing
self.poly_patches = {}
diff --git a/hulearn/preprocessing/__init__.py b/hulearn/preprocessing/__init__.py
index 1956717..cee21c1 100644
--- a/hulearn/preprocessing/__init__.py
+++ b/hulearn/preprocessing/__init__.py
@@ -1,3 +1,4 @@
from hulearn.preprocessing.pipetransformer import PipeTransformer
+from hulearn.preprocessing.interactivepreprocessor import InteractivePreprocessor
-__all__ = ["PipeTransformer"]
+__all__ = ["PipeTransformer", "InteractivePreprocessor"]
diff --git a/hulearn/preprocessing/interactivepreprocessor.py b/hulearn/preprocessing/interactivepreprocessor.py
new file mode 100644
index 0000000..3004d21
--- /dev/null
+++ b/hulearn/preprocessing/interactivepreprocessor.py
@@ -0,0 +1,147 @@
+import json
+import pathlib
+
+import numpy as np
+import pandas as pd
+from shapely.geometry import Point
+from shapely.geometry.polygon import Polygon
+
+from sklearn.base import BaseEstimator
+from sklearn.utils.validation import check_is_fitted
+
+
+class InteractivePreprocessor(BaseEstimator):
+ """
+ This tool allows you to take a drawn model and use it as a featurizer.
+
+ Arguments:
+ json_desc: chart da ta in dictionary form
+ refit: if `True`, you no longer need to call `.fit(X, y)` in order to `.predict(X)`
+ """
+
+ def __init__(self, json_desc, refit=True):
+ self.json_desc = json_desc
+ self.refit = refit
+
+ @classmethod
+ def from_json(cls, path, refit=True):
+ """
+ Load the classifier from json stored on disk.
+
+ Arguments:
+ path: path of the json file
+ refit: if `True`, you no longer need to call `.fit(X, y)` in order to `.predict(X)`
+
+ Usage:
+
+ ```python
+ from hulearn.classification import InteractivePreprocessor
+
+ InteractivePreprocessor.from_json("path/to/file.json")
+ ```
+ """
+ json_desc = json.loads(pathlib.Path(path).read_text())
+ return InteractivePreprocessor(json_desc=json_desc, refit=refit)
+
+ @property
+ def poly_data(self):
+ for chart in self.json_desc:
+ chard_id = chart["chart_id"]
+ labels = chart["polygons"].keys()
+ coords = chart["polygons"].values()
+ for lab, p in zip(labels, coords):
+ x_lab, y_lab = p.keys()
+ x_coords, y_coords = list(p.values())
+ for i in range(len(x_coords)):
+ poly_data = list(zip(x_coords[i], y_coords[i]))
+ if len(poly_data) >= 3:
+ poly = Polygon(poly_data)
+ yield {
+ "x_lab": x_lab,
+ "y_lab": y_lab,
+ "poly": poly,
+ "label": lab,
+ "chart_id": chard_id,
+ }
+
+ def _count_hits(self, clf_data, data_in):
+ counts = {k: 0 for k in self.classes_}
+ for c in clf_data:
+ point = Point(data_in[c["x_lab"]], data_in[c["y_lab"]])
+ if c["poly"].contains(point):
+ counts[c["label"]] += 1
+ return counts
+
+ def fit(self, X, y=None):
+ """
+ Fit the classifier. Bit of a formality, it's not doing anything specifically.
+ """
+ self.classes_ = list(self.json_desc[0]["polygons"].keys())
+ self.fitted_ = True
+ return self
+
+ def transform(self, X):
+ """
+ Apply the counting/binning based on the drawings.
+
+ Usage:
+
+ ```python
+ from hulearn.preprocessing import InteractivePreprocessor
+ clf = InteractivePreprocessor(clf_data)
+ X, y = load_data(...)
+
+ # This doesn't do anything. But scikit-learn demands it.
+ clf.fit(X, y)
+
+ # This makes predictions, based on your drawn model.
+ clf.transform(X)
+ ```
+ """
+ # Because we're not doing anything during training, for convenience this
+ # method can formally "fit" during the predict call. This is a scikit-learn
+ # anti-pattern so we allow you to turn this off.
+ if self.refit:
+ if not self.fitted_:
+ self.fit(X)
+ check_is_fitted(self, ["classes_", "fitted_"])
+ if isinstance(X, pd.DataFrame):
+ hits = [
+ self._count_hits(self.poly_data, x[1].to_dict()) for x in X.iterrows()
+ ]
+ else:
+ hits = [
+ self._count_hits(self.poly_data, {k: v for k, v in enumerate(x)})
+ for x in X
+ ]
+ count_arr = np.array([[h[c] for c in self.classes_] for h in hits])
+ return count_arr
+
+ def pandas_pipe(self, dataf):
+ """
+ Use this transformer as part of a `.pipe()` method chain in pandas.
+
+ Usage:
+
+ ```python
+ import numpy as np
+ import pandas as pd
+
+ # Load in a dataframe from somewhere
+ df = load_data(...)
+
+ # Load in drawn chart data
+ from hulearn.preprocessing import InteractivePreprocessor
+ tfm = InteractivePreprocessor.from_json("path/file.json")
+
+ # This adds new columns to the dataframe
+ df.pipe(pandas_pipe)
+ ```
+ """
+ new_dataf = pd.DataFrame(
+ self.fit(dataf).transform(dataf), columns=self.classes_
+ )
+ return pd.concat(
+ [dataf.copy().reset_index(drop=True), new_dataf.reset_index(drop=True)],
+ axis=1,
+ )
diff --git a/hulearn/this.py b/hulearn/this.py
index 17edfec..e361ca5 100644
--- a/hulearn/this.py
+++ b/hulearn/this.py
@@ -1,18 +1,18 @@
msg = """
-Computers are like calculators,
-they never really learn.
-Natural Intelligence is a pretty good idea,
-and Artifical Stupidity a valid concern.
-
Why worry about state of the art,
maybe it's time that we all admit.
That one-size-fits-all suit,
-usually does not fit.
+is not bespoke and usually does not fit.
+
+Computers can flow a lot of tensors,
+but they never really learn.
+Natural Intelligence is a pretty good idea,
+and Artifical Stupidity a valid concern.
There are many ways to solve a problem,
-but don't let it go to your head.
+but don't let fancy tools go to your head.
Why would you use a predefined solution,
-if you can customize one instead.
+if you can create a custom one instead.
"""
print(msg)
diff --git a/mkdocs.yml b/mkdocs.yml
index 60e8002..e23a2d4 100644
--- a/mkdocs.yml
+++ b/mkdocs.yml
@@ -13,6 +13,7 @@ nav:
- Human Preprocessing: guide/function-preprocess/function-preprocessing.md
- Drawing as a Model: guide/drawing-classifier/drawing.md
- Outliers and Comfort: guide/finding-outliers/outliers.md
+ - Drawing Features: guide/drawing-features/custom-features.md
- Api:
- Models:
- Classification: api/classification.md
diff --git a/tests/test_transformers/test_interactive_preprocessor.py b/tests/test_transformers/test_interactive_preprocessor.py
new file mode 100644
index 0000000..b532234
--- /dev/null
+++ b/tests/test_transformers/test_interactive_preprocessor.py
@@ -0,0 +1,111 @@
+import pytest
+
+from sklego.datasets import load_penguins
+from sklearn.pipeline import Pipeline, FeatureUnion
+from hulearn.preprocessing import InteractivePreprocessor, PipeTransformer
+
+from hulearn.common import flatten
+
+from tests.conftest import (
+ select_tests,
+ general_checks,
+ nonmeta_checks,
+)
+
+
+@pytest.mark.parametrize(
+ "test_fn",
+ select_tests(
+ include=flatten([general_checks, nonmeta_checks]),
+ exclude=[
+ "check_estimators_pickle",
+ "check_estimators_nan_inf",
+ "check_estimators_empty_data_messages",
+ "check_complex_data",
+ "check_dtype_object",
+ "check_estimators_dtypes",
+ "check_dict_unchanged",
+ "check_fit1d",
+ "check_methods_subset_invariance",
+ "check_fit2d_predict1d",
+ ],
+ ),
+)
+def test_estimator_checks(test_fn):
+ """
+ We're skipping a lot of tests here mainly because this model is "bespoke"
+ it is *not* general. Therefore a lot of assumptions are broken.
+ """
+ clf = InteractivePreprocessor.from_json("tests/test_classification/demo-data.json")
+ test_fn(InteractivePreprocessor, clf)
+
+
+def test_base_predict_usecase():
+ clf = InteractivePreprocessor.from_json("tests/test_classification/demo-data.json")
+ df = load_penguins(as_frame=True).dropna()
+ X, y = df.drop(columns=["species"]), df["species"]
+
+ preds = clf.fit(X, y).transform(X)
+
+ assert preds.shape[0] == df.shape[0]
+ assert preds.shape[1] == 3
+
+
+def identity(x):
+ return x
+
+
+def test_grid_predict_usecase():
+ tfm = InteractivePreprocessor.from_json("tests/test_classification/demo-data.json")
+ pipe = Pipeline(
+ [
+ (
+ "features",
+ FeatureUnion(
+ [("original", PipeTransformer(identity)), ("new_feats", tfm)]
+ ),
+ ),
+ ]
+ )
+ df = load_penguins(as_frame=True).dropna()
+ X, y = df.drop(columns=["species", "island", "sex"]), df["species"]
+
+ preds = pipe.fit(X, y).transform(X)
+
+ assert preds.shape[0] == df.shape[0]
+ assert preds.shape[1] == X.shape[1] + 3
+
+
+def test_ignore_bad_data():
+ """
+ There might be some "bad data" drawn. For example, when you quickly hit double-click you might
+ draw a line instead of a poly. Bokeh is "okeh" with it, but our point-in-poly algorithm is not.
+ """
+ data = [
+ {
+ "chart_id": "9ec8e755-2",
+ "x": "bill_length_mm",
+ "y": "bill_depth_mm",
+ "polygons": {
+ "Adelie": {"bill_length_mm": [], "bill_depth_mm": []},
+ "Gentoo": {"bill_length_mm": [], "bill_depth_mm": []},
+ "Chinstrap": {"bill_length_mm": [], "bill_depth_mm": []},
+ },
+ },
+ {
+ "chart_id": "11640372-c",
+ "x": "flipper_length_mm",
+ "y": "body_mass_g",
+ "polygons": {
+ "Adelie": {
+ "flipper_length_mm": [[214.43261376806052, 256.2612913545137]],
+ "body_mass_g": [[3950.9482324534456, 3859.9137496948247]],
+ },
+ "Gentoo": {"flipper_length_mm": [], "body_mass_g": []},
+ "Chinstrap": {"flipper_length_mm": [], "body_mass_g": []},
+ },
+ },
+ ]
+
+ clf = InteractivePreprocessor(json_desc=data)
+ assert len(list(clf.poly_data)) == 0