diff --git a/README.md b/README.md index ff59829..0ca9cca 100644 --- a/README.md +++ b/README.md @@ -83,8 +83,14 @@ it an outlier. There's a threshold parameter for how strict you might want to be This allows you to define a function that can make handle preprocessing. It's constructed in such a way that you can use the arguments of the function as a parameter that you can benchmark in a grid-search. This is especially powerful in combination -with the pandas `.pipe` method. If you're unfamiliar with this amazing feature, yo may appreciate -[this tutorial](https://calmcode.io/pandas-pipe/introduction.html). +with the pandas `.pipe` method. If you're unfamiliar with this amazing feature, you +may appreciate [this tutorial](https://calmcode.io/pandas-pipe/introduction.html). + +#### InteractivePreprocessor + +This allows you to draw features that you'd like to add to your dataset or +your machine learning pipeline. You can use it via `tfm.fit(df).transform(df)` and +`df.pipe(tfm)`. ### Datasets diff --git a/docs/api/preprocessing.md b/docs/api/preprocessing.md index 4d8b3e6..7f5767e 100644 --- a/docs/api/preprocessing.md +++ b/docs/api/preprocessing.md @@ -1,3 +1,5 @@ # `from hulearn.preprocessing import *` ::: hulearn.preprocessing.pipetransformer + +::: hulearn.preprocessing.interactivepreprocessor diff --git a/docs/examples/examples.md b/docs/examples/examples.md index 90fcac0..dac6935 100644 --- a/docs/examples/examples.md +++ b/docs/examples/examples.md @@ -1,27 +1,77 @@ This page contains a list of short examples that demonstrate the utility of the tools in this package. The goal for each example is to be small and consise. -This page is still under construction. +## Precision and Subgroups + +It can be the case that for a subgroup of the population you do not need a model. +Suppose that we have a session log dataset from "World of Warcraft". We know when +people logged in, if they were part of a guild and when they stopped playing. You +can create a machine learning model to predict which players are at risk of quitting +the game but you might also be able to come up with some simple rules. + +Here is one rule that might work out swell: + +> "If any player was playing the video game at 24:00 on new-years eve, +odds are that this person is very invested in the game and won't stop playing." + +This one rule will not cover the entire population but for the subgroup it can be an +effective rule. + +![](wow.png) + +As an illustrative example we'll implement this diagram as a `Classifier`. -## Insurance for Future Data +```python +import numpy as np +from hulearn.outlier import InteractiveOutlierDetector +from hulearn.classification import FunctionClassifier, InteractiveClassifier + + +classifier = SomeScikitLearnModel() + +def make_decision(dataf): + # First we create a resulting array with all the predictions + res = classifier.predict(dataf) + + # Override model prediction if a user is a heavy_user, no matter what + res = np.where(dataf['heavy_user'], "stays", res) + + return res + +fallback_model = FunctionClassifier(make_decision) +``` ## No Data No Problem -## Model Guarantees +Let's say that we're interested in detecting fraud at a tax office. Even without +looking at the data we can already come up with some sensible rules. -## Precision and Subgroups +- Any minor making over the median income is "suspicious". +- Any person who started more than 2 companies in a year is "suspicious". +- Any person who has more than 10 bank accounts is "suspicious". -## Dealing with NA +The thing with these rules is that they are easy to explain but they are not based +on data at all. In fact, they may not occur in the data at all. This means that +a machine learning model may not have picked up this pattern that we're interested +in. Thankfully, the lack in data can be compensated with business rules. + +![](tree.png) ## Comfort Zone +Models typically have a "comfort zone". If a new data point comes in that is +very different from what the models saw before it should not be treated the same way. +You can also argue that points with low `proba` score should also not be +automated. + If you want to prevent predictions where the model is "unsure" then you -might want to construct a `FunctionClassifier` that handles the logic you -require. For example, we might draw a diagram like; +might want to follow this diagram; ![](../guide/finding-outliers/diagram.png) -As an illustrative example we'll implement a diagram like above as a `Classifier`. +You can construct such a system by creating a `FunctionClassifier` that +handles the logic you require. As an illustrative example we'll implement +this diagram as a `Classifier`. ```python import numpy as np @@ -55,5 +105,3 @@ For more information on why this tactic is helpful: - [blogpost](https://koaning.io/posts/high-on-probability-low-on-certainty/) - [pydata talk](https://www.youtube.com/watch?v=Z8MEFI7ZJlA) - -## Risk Class Translation diff --git a/docs/examples/faq.md b/docs/examples/faq.md index 4c9aee4..4255fa3 100644 --- a/docs/examples/faq.md +++ b/docs/examples/faq.md @@ -2,6 +2,15 @@ Feel free to ask questions [here](https://github.com/koaning/human-learn/issues). +## What are the Lessons Learned + +If you're interested in some of the lessons the creators of this tool learned +while creating it, all you need to do is follow the python tradition. + +```python +from hulearn import this +``` + ## Why Make This? Back in the old days, it was common to write rule-based systems. Systems that do; diff --git a/docs/examples/tree.png b/docs/examples/tree.png new file mode 100644 index 0000000..5a8c8b0 Binary files /dev/null and b/docs/examples/tree.png differ diff --git a/docs/examples/wow.png b/docs/examples/wow.png new file mode 100644 index 0000000..7705a59 Binary files /dev/null and b/docs/examples/wow.png differ diff --git a/docs/guide/drawing-features/custom-features.md b/docs/guide/drawing-features/custom-features.md new file mode 100644 index 0000000..7832dce --- /dev/null +++ b/docs/guide/drawing-features/custom-features.md @@ -0,0 +1,57 @@ +Sofar we've explored drawing as a tool for models, but it can also be used +as a tool to generate features. To explore this, let's load in the penguins +dataset again. + +```python +from sklego.datasets import load_penguins + +df = load_penguins(as_frame=True).dropna() +``` + +## Drawing + +We can draw over this dataset. It's like before but with one crucial differenc + +```python +from hulearn.experimental.interactive import InteractiveCharts + +# Note that the `labels` arugment here is a list, not a string! This +# tells the tool that we want to be able to add custom groups that are +# not defined by a column in the dataframe. +charts = InteractiveCharts(df, labels=['group_one', 'group_two']) +``` + +Let's make a custom drawing. + +```python +charts.add_chart(x="flipper_length_mm", y="body_mass_g") +``` + +Let's assume the new drawing looks something like this. + +![](drawing.png) + +Sofar these drawn features have been used to construct models. But they +can also be used to help label data or generate extra features for machine +learning models. + +## Features + +This library makes it easy to add these features to scikit-learn +pipelines or to pandas. To get started, you'll want to import the +`InteractivePreprocessor`. + +```python +from hulearn.preprocessing import InteractivePreprocessor +tfm = InteractivePreprocessor(json_desc=charts.data()) +``` + +This `tfm` object is can be used as a preprocessing step inside of +scikit-learn but it can also be used in a pandas pipeline. + +```python +# The flow for scikit-learn +tfm.fit(df).transform(df) +# The flow for pandas +df.pipe(tfm.pandas_pipe) +``` diff --git a/docs/guide/drawing-features/drawing.png b/docs/guide/drawing-features/drawing.png new file mode 100644 index 0000000..072cef3 Binary files /dev/null and b/docs/guide/drawing-features/drawing.png differ diff --git a/docs/guide/notebooks/05-custom-features.ipynb b/docs/guide/notebooks/05-custom-features.ipynb new file mode 100644 index 0000000..fa038fb --- /dev/null +++ b/docs/guide/notebooks/05-custom-features.ipynb @@ -0,0 +1,674 @@ +{ + "cells": [ + { + "cell_type": "code", + "execution_count": 1, + "metadata": {}, + "outputs": [], + "source": [ + "%load_ext autoreload\n", + "%autoreload 2" + ] + }, + { + "cell_type": "code", + "execution_count": 70, + "metadata": {}, + "outputs": [ + { + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
speciesislandbill_length_mmbill_depth_mmflipper_length_mmbody_mass_gsex
0AdelieTorgersen39.118.7181.03750.0male
1AdelieTorgersen39.517.4186.03800.0female
2AdelieTorgersen40.318.0195.03250.0female
4AdelieTorgersen36.719.3193.03450.0female
5AdelieTorgersen39.320.6190.03650.0male
\n", + "
" + ], + "text/plain": [ + " species island bill_length_mm bill_depth_mm flipper_length_mm \\\n", + "0 Adelie Torgersen 39.1 18.7 181.0 \n", + "1 Adelie Torgersen 39.5 17.4 186.0 \n", + "2 Adelie Torgersen 40.3 18.0 195.0 \n", + "4 Adelie Torgersen 36.7 19.3 193.0 \n", + "5 Adelie Torgersen 39.3 20.6 190.0 \n", + "\n", + " body_mass_g sex \n", + "0 3750.0 male \n", + "1 3800.0 female \n", + "2 3250.0 female \n", + "4 3450.0 female \n", + "5 3650.0 male " + ] + }, + "execution_count": 70, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "import numpy as np\n", + "import pandas as pd\n", + "from sklego.datasets import load_penguins\n", + "\n", + "df = load_penguins(as_frame=True).dropna()\n", + "df.head()" + ] + }, + { + "cell_type": "code", + "execution_count": 71, + "metadata": {}, + "outputs": [], + "source": [ + "from hulearn.experimental.interactive import InteractiveCharts" + ] + }, + { + "cell_type": "code", + "execution_count": 72, + "metadata": {}, + "outputs": [ + { + "data": { + "text/html": [ + "\n", + "
\n", + " \n", + " Loading BokehJS ...\n", + "
" + ] + }, + "metadata": {}, + "output_type": "display_data" + }, + { + "data": { + "application/javascript": [ + "\n", + "(function(root) {\n", + " function now() {\n", + " return new Date();\n", + " }\n", + "\n", + " var force = true;\n", + "\n", + " if (typeof root._bokeh_onload_callbacks === \"undefined\" || force === true) {\n", + " root._bokeh_onload_callbacks = [];\n", + " root._bokeh_is_loading = undefined;\n", + " }\n", + "\n", + " var JS_MIME_TYPE = 'application/javascript';\n", + " var HTML_MIME_TYPE = 'text/html';\n", + " var EXEC_MIME_TYPE = 'application/vnd.bokehjs_exec.v0+json';\n", + " var CLASS_NAME = 'output_bokeh rendered_html';\n", + "\n", + " /**\n", + " * Render data to the DOM node\n", + " */\n", + " function render(props, node) {\n", + " var script = document.createElement(\"script\");\n", + " node.appendChild(script);\n", + " }\n", + "\n", + " /**\n", + " * Handle when an output is cleared or removed\n", + " */\n", + " function handleClearOutput(event, handle) {\n", + " var cell = handle.cell;\n", + "\n", + " var id = cell.output_area._bokeh_element_id;\n", + " var server_id = cell.output_area._bokeh_server_id;\n", + " // Clean up Bokeh references\n", + " if (id != null && id in Bokeh.index) {\n", + " Bokeh.index[id].model.document.clear();\n", + " delete Bokeh.index[id];\n", + " }\n", + "\n", + " if (server_id !== undefined) {\n", + " // Clean up Bokeh references\n", + " var cmd = \"from bokeh.io.state import curstate; print(curstate().uuid_to_server['\" + server_id + \"'].get_sessions()[0].document.roots[0]._id)\";\n", + " cell.notebook.kernel.execute(cmd, {\n", + " iopub: {\n", + " output: function(msg) {\n", + " var id = msg.content.text.trim();\n", + " if (id in Bokeh.index) {\n", + " Bokeh.index[id].model.document.clear();\n", + " delete Bokeh.index[id];\n", + " }\n", + " }\n", + " }\n", + " });\n", + " // Destroy server and session\n", + " var cmd = \"import bokeh.io.notebook as ion; ion.destroy_server('\" + server_id + \"')\";\n", + " cell.notebook.kernel.execute(cmd);\n", + " }\n", + " }\n", + "\n", + " /**\n", + " * Handle when a new output is added\n", + " */\n", + " function handleAddOutput(event, handle) {\n", + " var output_area = handle.output_area;\n", + " var output = handle.output;\n", + "\n", + " // limit handleAddOutput to display_data with EXEC_MIME_TYPE content only\n", + " if ((output.output_type != \"display_data\") || (!output.data.hasOwnProperty(EXEC_MIME_TYPE))) {\n", + " return\n", + " }\n", + "\n", + " var toinsert = output_area.element.find(\".\" + CLASS_NAME.split(' ')[0]);\n", + "\n", + " if (output.metadata[EXEC_MIME_TYPE][\"id\"] !== undefined) {\n", + " toinsert[toinsert.length - 1].firstChild.textContent = output.data[JS_MIME_TYPE];\n", + " // store reference to embed id on output_area\n", + " output_area._bokeh_element_id = output.metadata[EXEC_MIME_TYPE][\"id\"];\n", + " }\n", + " if (output.metadata[EXEC_MIME_TYPE][\"server_id\"] !== undefined) {\n", + " var bk_div = document.createElement(\"div\");\n", + " bk_div.innerHTML = output.data[HTML_MIME_TYPE];\n", + " var script_attrs = bk_div.children[0].attributes;\n", + " for (var i = 0; i < script_attrs.length; i++) {\n", + " toinsert[toinsert.length - 1].firstChild.setAttribute(script_attrs[i].name, script_attrs[i].value);\n", + " toinsert[toinsert.length - 1].firstChild.textContent = bk_div.children[0].textContent\n", + " }\n", + " // store reference to server id on output_area\n", + " output_area._bokeh_server_id = output.metadata[EXEC_MIME_TYPE][\"server_id\"];\n", + " }\n", + " }\n", + "\n", + " function register_renderer(events, OutputArea) {\n", + "\n", + " function append_mime(data, metadata, element) {\n", + " // create a DOM node to render to\n", + " var toinsert = this.create_output_subarea(\n", + " metadata,\n", + " CLASS_NAME,\n", + " EXEC_MIME_TYPE\n", + " );\n", + " this.keyboard_manager.register_events(toinsert);\n", + " // Render to node\n", + " var props = {data: data, metadata: metadata[EXEC_MIME_TYPE]};\n", + " render(props, toinsert[toinsert.length - 1]);\n", + " element.append(toinsert);\n", + " return toinsert\n", + " }\n", + "\n", + " /* Handle when an output is cleared or removed */\n", + " events.on('clear_output.CodeCell', handleClearOutput);\n", + " events.on('delete.Cell', handleClearOutput);\n", + "\n", + " /* Handle when a new output is added */\n", + " events.on('output_added.OutputArea', handleAddOutput);\n", + "\n", + " /**\n", + " * Register the mime type and append_mime function with output_area\n", + " */\n", + " OutputArea.prototype.register_mime_type(EXEC_MIME_TYPE, append_mime, {\n", + " /* Is output safe? */\n", + " safe: true,\n", + " /* Index of renderer in `output_area.display_order` */\n", + " index: 0\n", + " });\n", + " }\n", + "\n", + " // register the mime type if in Jupyter Notebook environment and previously unregistered\n", + " if (root.Jupyter !== undefined) {\n", + " var events = require('base/js/events');\n", + " var OutputArea = require('notebook/js/outputarea').OutputArea;\n", + "\n", + " if (OutputArea.prototype.mime_types().indexOf(EXEC_MIME_TYPE) == -1) {\n", + " register_renderer(events, OutputArea);\n", + " }\n", + " }\n", + "\n", + " \n", + " if (typeof (root._bokeh_timeout) === \"undefined\" || force === true) {\n", + " root._bokeh_timeout = Date.now() + 5000;\n", + " root._bokeh_failed_load = false;\n", + " }\n", + "\n", + " var NB_LOAD_WARNING = {'data': {'text/html':\n", + " \"
\\n\"+\n", + " \"

\\n\"+\n", + " \"BokehJS does not appear to have successfully loaded. If loading BokehJS from CDN, this \\n\"+\n", + " \"may be due to a slow or bad network connection. Possible fixes:\\n\"+\n", + " \"

\\n\"+\n", + " \"\\n\"+\n", + " \"\\n\"+\n", + " \"from bokeh.resources import INLINE\\n\"+\n", + " \"output_notebook(resources=INLINE)\\n\"+\n", + " \"\\n\"+\n", + " \"
\"}};\n", + "\n", + " function display_loaded() {\n", + " var el = document.getElementById(\"1788\");\n", + " if (el != null) {\n", + " el.textContent = \"BokehJS is loading...\";\n", + " }\n", + " if (root.Bokeh !== undefined) {\n", + " if (el != null) {\n", + " el.textContent = \"BokehJS \" + root.Bokeh.version + \" successfully loaded.\";\n", + " }\n", + " } else if (Date.now() < root._bokeh_timeout) {\n", + " setTimeout(display_loaded, 100)\n", + " }\n", + " }\n", + "\n", + "\n", + " function run_callbacks() {\n", + " try {\n", + " root._bokeh_onload_callbacks.forEach(function(callback) {\n", + " if (callback != null)\n", + " callback();\n", + " });\n", + " } finally {\n", + " delete root._bokeh_onload_callbacks\n", + " }\n", + " console.debug(\"Bokeh: all callbacks have finished\");\n", + " }\n", + "\n", + " function load_libs(css_urls, js_urls, callback) {\n", + " if (css_urls == null) css_urls = [];\n", + " if (js_urls == null) js_urls = [];\n", + "\n", + " root._bokeh_onload_callbacks.push(callback);\n", + " if (root._bokeh_is_loading > 0) {\n", + " console.debug(\"Bokeh: BokehJS is being loaded, scheduling callback at\", now());\n", + " return null;\n", + " }\n", + " if (js_urls == null || js_urls.length === 0) {\n", + " run_callbacks();\n", + " return null;\n", + " }\n", + " console.debug(\"Bokeh: BokehJS not loaded, scheduling load and callback at\", now());\n", + " root._bokeh_is_loading = css_urls.length + js_urls.length;\n", + "\n", + " function on_load() {\n", + " root._bokeh_is_loading--;\n", + " if (root._bokeh_is_loading === 0) {\n", + " console.debug(\"Bokeh: all BokehJS libraries/stylesheets loaded\");\n", + " run_callbacks()\n", + " }\n", + " }\n", + "\n", + " function on_error() {\n", + " console.error(\"failed to load \" + url);\n", + " }\n", + "\n", + " for (var i = 0; i < css_urls.length; i++) {\n", + " var url = css_urls[i];\n", + " const element = document.createElement(\"link\");\n", + " element.onload = on_load;\n", + " element.onerror = on_error;\n", + " element.rel = \"stylesheet\";\n", + " element.type = \"text/css\";\n", + " element.href = url;\n", + " console.debug(\"Bokeh: injecting link tag for BokehJS stylesheet: \", url);\n", + " document.body.appendChild(element);\n", + " }\n", + "\n", + " const hashes = {\"https://cdn.bokeh.org/bokeh/release/bokeh-2.2.1.min.js\": \"qkRvDQVAIfzsJo40iRBbxt6sttt0hv4lh74DG7OK4MCHv4C5oohXYoHUM5W11uqS\", \"https://cdn.bokeh.org/bokeh/release/bokeh-widgets-2.2.1.min.js\": \"Sb7Mr06a9TNlet/GEBeKaf5xH3eb6AlCzwjtU82wNPyDrnfoiVl26qnvlKjmcAd+\", \"https://cdn.bokeh.org/bokeh/release/bokeh-tables-2.2.1.min.js\": \"HaJ15vgfmcfRtB4c4YBOI4f1MUujukqInOWVqZJZZGK7Q+ivud0OKGSTn/Vm2iso\"};\n", + "\n", + " for (var i = 0; i < js_urls.length; i++) {\n", + " var url = js_urls[i];\n", + " var element = document.createElement('script');\n", + " element.onload = on_load;\n", + " element.onerror = on_error;\n", + " element.async = false;\n", + " element.src = url;\n", + " if (url in hashes) {\n", + " element.crossOrigin = \"anonymous\";\n", + " element.integrity = \"sha384-\" + hashes[url];\n", + " }\n", + " console.debug(\"Bokeh: injecting script tag for BokehJS library: \", url);\n", + " document.head.appendChild(element);\n", + " }\n", + " };\n", + "\n", + " function inject_raw_css(css) {\n", + " const element = document.createElement(\"style\");\n", + " element.appendChild(document.createTextNode(css));\n", + " document.body.appendChild(element);\n", + " }\n", + "\n", + " \n", + " var js_urls = [\"https://cdn.bokeh.org/bokeh/release/bokeh-2.2.1.min.js\", \"https://cdn.bokeh.org/bokeh/release/bokeh-widgets-2.2.1.min.js\", \"https://cdn.bokeh.org/bokeh/release/bokeh-tables-2.2.1.min.js\"];\n", + " var css_urls = [];\n", + " \n", + "\n", + " var inline_js = [\n", + " function(Bokeh) {\n", + " Bokeh.set_log_level(\"info\");\n", + " },\n", + " function(Bokeh) {\n", + " \n", + " \n", + " }\n", + " ];\n", + "\n", + " function run_inline_js() {\n", + " \n", + " if (root.Bokeh !== undefined || force === true) {\n", + " \n", + " for (var i = 0; i < inline_js.length; i++) {\n", + " inline_js[i].call(root, root.Bokeh);\n", + " }\n", + " if (force === true) {\n", + " display_loaded();\n", + " }} else if (Date.now() < root._bokeh_timeout) {\n", + " setTimeout(run_inline_js, 100);\n", + " } else if (!root._bokeh_failed_load) {\n", + " console.log(\"Bokeh: BokehJS failed to load within specified timeout.\");\n", + " root._bokeh_failed_load = true;\n", + " } else if (force !== true) {\n", + " var cell = $(document.getElementById(\"1788\")).parents('.cell').data().cell;\n", + " cell.output_area.append_execute_result(NB_LOAD_WARNING)\n", + " }\n", + "\n", + " }\n", + "\n", + " if (root._bokeh_is_loading === 0) {\n", + " console.debug(\"Bokeh: BokehJS loaded, going straight to plotting\");\n", + " run_inline_js();\n", + " } else {\n", + " load_libs(css_urls, js_urls, function() {\n", + " console.debug(\"Bokeh: BokehJS plotting callback run at\", now());\n", + " run_inline_js();\n", + " });\n", + " }\n", + "}(window));" + ], + "application/vnd.bokehjs_load.v0+json": "\n(function(root) {\n function now() {\n return new Date();\n }\n\n var force = true;\n\n if (typeof root._bokeh_onload_callbacks === \"undefined\" || force === true) {\n root._bokeh_onload_callbacks = [];\n root._bokeh_is_loading = undefined;\n }\n\n \n\n \n if (typeof (root._bokeh_timeout) === \"undefined\" || force === true) {\n root._bokeh_timeout = Date.now() + 5000;\n root._bokeh_failed_load = false;\n }\n\n var NB_LOAD_WARNING = {'data': {'text/html':\n \"
\\n\"+\n \"

\\n\"+\n \"BokehJS does not appear to have successfully loaded. If loading BokehJS from CDN, this \\n\"+\n \"may be due to a slow or bad network connection. Possible fixes:\\n\"+\n \"

\\n\"+\n \"\\n\"+\n \"\\n\"+\n \"from bokeh.resources import INLINE\\n\"+\n \"output_notebook(resources=INLINE)\\n\"+\n \"\\n\"+\n \"
\"}};\n\n function display_loaded() {\n var el = document.getElementById(\"1788\");\n if (el != null) {\n el.textContent = \"BokehJS is loading...\";\n }\n if (root.Bokeh !== undefined) {\n if (el != null) {\n el.textContent = \"BokehJS \" + root.Bokeh.version + \" successfully loaded.\";\n }\n } else if (Date.now() < root._bokeh_timeout) {\n setTimeout(display_loaded, 100)\n }\n }\n\n\n function run_callbacks() {\n try {\n root._bokeh_onload_callbacks.forEach(function(callback) {\n if (callback != null)\n callback();\n });\n } finally {\n delete root._bokeh_onload_callbacks\n }\n console.debug(\"Bokeh: all callbacks have finished\");\n }\n\n function load_libs(css_urls, js_urls, callback) {\n if (css_urls == null) css_urls = [];\n if (js_urls == null) js_urls = [];\n\n root._bokeh_onload_callbacks.push(callback);\n if (root._bokeh_is_loading > 0) {\n console.debug(\"Bokeh: BokehJS is being loaded, scheduling callback at\", now());\n return null;\n }\n if (js_urls == null || js_urls.length === 0) {\n run_callbacks();\n return null;\n }\n console.debug(\"Bokeh: BokehJS not loaded, scheduling load and callback at\", now());\n root._bokeh_is_loading = css_urls.length + js_urls.length;\n\n function on_load() {\n root._bokeh_is_loading--;\n if (root._bokeh_is_loading === 0) {\n console.debug(\"Bokeh: all BokehJS libraries/stylesheets loaded\");\n run_callbacks()\n }\n }\n\n function on_error() {\n console.error(\"failed to load \" + url);\n }\n\n for (var i = 0; i < css_urls.length; i++) {\n var url = css_urls[i];\n const element = document.createElement(\"link\");\n element.onload = on_load;\n element.onerror = on_error;\n element.rel = \"stylesheet\";\n element.type = \"text/css\";\n element.href = url;\n console.debug(\"Bokeh: injecting link tag for BokehJS stylesheet: \", url);\n document.body.appendChild(element);\n }\n\n const hashes = {\"https://cdn.bokeh.org/bokeh/release/bokeh-2.2.1.min.js\": \"qkRvDQVAIfzsJo40iRBbxt6sttt0hv4lh74DG7OK4MCHv4C5oohXYoHUM5W11uqS\", \"https://cdn.bokeh.org/bokeh/release/bokeh-widgets-2.2.1.min.js\": \"Sb7Mr06a9TNlet/GEBeKaf5xH3eb6AlCzwjtU82wNPyDrnfoiVl26qnvlKjmcAd+\", \"https://cdn.bokeh.org/bokeh/release/bokeh-tables-2.2.1.min.js\": \"HaJ15vgfmcfRtB4c4YBOI4f1MUujukqInOWVqZJZZGK7Q+ivud0OKGSTn/Vm2iso\"};\n\n for (var i = 0; i < js_urls.length; i++) {\n var url = js_urls[i];\n var element = document.createElement('script');\n element.onload = on_load;\n element.onerror = on_error;\n element.async = false;\n element.src = url;\n if (url in hashes) {\n element.crossOrigin = \"anonymous\";\n element.integrity = \"sha384-\" + hashes[url];\n }\n console.debug(\"Bokeh: injecting script tag for BokehJS library: \", url);\n document.head.appendChild(element);\n }\n };\n\n function inject_raw_css(css) {\n const element = document.createElement(\"style\");\n element.appendChild(document.createTextNode(css));\n document.body.appendChild(element);\n }\n\n \n var js_urls = [\"https://cdn.bokeh.org/bokeh/release/bokeh-2.2.1.min.js\", \"https://cdn.bokeh.org/bokeh/release/bokeh-widgets-2.2.1.min.js\", \"https://cdn.bokeh.org/bokeh/release/bokeh-tables-2.2.1.min.js\"];\n var css_urls = [];\n \n\n var inline_js = [\n function(Bokeh) {\n Bokeh.set_log_level(\"info\");\n },\n function(Bokeh) {\n \n \n }\n ];\n\n function run_inline_js() {\n \n if (root.Bokeh !== undefined || force === true) {\n \n for (var i = 0; i < inline_js.length; i++) {\n inline_js[i].call(root, root.Bokeh);\n }\n if (force === true) {\n display_loaded();\n }} else if (Date.now() < root._bokeh_timeout) {\n setTimeout(run_inline_js, 100);\n } else if (!root._bokeh_failed_load) {\n console.log(\"Bokeh: BokehJS failed to load within specified timeout.\");\n root._bokeh_failed_load = true;\n } else if (force !== true) {\n var cell = $(document.getElementById(\"1788\")).parents('.cell').data().cell;\n cell.output_area.append_execute_result(NB_LOAD_WARNING)\n }\n\n }\n\n if (root._bokeh_is_loading === 0) {\n console.debug(\"Bokeh: BokehJS loaded, going straight to plotting\");\n run_inline_js();\n } else {\n load_libs(css_urls, js_urls, function() {\n console.debug(\"Bokeh: BokehJS plotting callback run at\", now());\n run_inline_js();\n });\n }\n}(window));" + }, + "metadata": {}, + "output_type": "display_data" + } + ], + "source": [ + "clf = InteractiveCharts(df, labels=['group_one', 'group_two'])" + ] + }, + { + "cell_type": "code", + "execution_count": 74, + "metadata": {}, + "outputs": [ + { + "data": { + "application/vnd.bokehjs_exec.v0+json": "", + "text/html": [ + "\n", + "" + ] + }, + "metadata": { + "application/vnd.bokehjs_exec.v0+json": { + "server_id": "3ddc2b24612c4d0a9d900f17c973459a" + } + }, + "output_type": "display_data" + } + ], + "source": [ + "clf.add_chart(x=\"flipper_length_mm\", y=\"body_mass_g\")" + ] + }, + { + "cell_type": "code", + "execution_count": 75, + "metadata": {}, + "outputs": [], + "source": [ + "from hulearn.preprocessing import InteractivePreprocessor" + ] + }, + { + "cell_type": "code", + "execution_count": 76, + "metadata": {}, + "outputs": [], + "source": [ + "tfm = InteractivePreprocessor(json_desc=clf.data())" + ] + }, + { + "cell_type": "code", + "execution_count": 78, + "metadata": {}, + "outputs": [ + { + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
speciesislandbill_length_mmbill_depth_mmflipper_length_mmbody_mass_gsexgroup_onegroup_two
0AdelieTorgersen39.118.7181.03750.0male00
1AdelieTorgersen39.517.4186.03800.0female00
2AdelieTorgersen40.318.0195.03250.0female00
3AdelieTorgersen36.719.3193.03450.0female00
4AdelieTorgersen39.320.6190.03650.0male00
\n", + "
" + ], + "text/plain": [ + " species island bill_length_mm bill_depth_mm flipper_length_mm \\\n", + "0 Adelie Torgersen 39.1 18.7 181.0 \n", + "1 Adelie Torgersen 39.5 17.4 186.0 \n", + "2 Adelie Torgersen 40.3 18.0 195.0 \n", + "3 Adelie Torgersen 36.7 19.3 193.0 \n", + "4 Adelie Torgersen 39.3 20.6 190.0 \n", + "\n", + " body_mass_g sex group_one group_two \n", + "0 3750.0 male 0 0 \n", + "1 3800.0 female 0 0 \n", + "2 3250.0 female 0 0 \n", + "3 3450.0 female 0 0 \n", + "4 3650.0 male 0 0 " + ] + }, + "execution_count": 78, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "df.pipe(tfm.pandas_pipe).head()" + ] + } + ], + "metadata": { + "kernelspec": { + "display_name": "Python 3", + "language": "python", + "name": "python3" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.7.7" + } + }, + "nbformat": 4, + "nbformat_minor": 4 +} diff --git a/docs/index.md b/docs/index.md index e78eb72..01fe444 100644 --- a/docs/index.md +++ b/docs/index.md @@ -31,11 +31,22 @@ You can install this tool via `pip`. python -m pip install human-learn ``` +## Guides + +To help you get started we've written some helpful getting started guides. + +1. [Functions as a Model](guide/function-classifier/function-classifier.html) +2. [Human Preprocessing](guide/function-preprocess/function-preprocessing.html) +3. [Drawing as a Model](guide/drawing-classifier/drawing.html) +4. [Outliers and Comfort](guide/finding-outliers/outliers.html) +5. [Drawing Features](guide/function-classifier/function-classifier.html) + +You can also check out the API documentation [here](api/classification.html). + ## Features This library hosts a couple of models that you can play with. - ### Classification Models #### FunctionClassifier @@ -79,8 +90,14 @@ it an outlier. There's a threshold parameter for how strict you might want to be This allows you to define a function that can make handle preprocessing. It's constructed in such a way that you can use the arguments of the function as a parameter that you can benchmark in a grid-search. This is especially powerful in combination -with the pandas `.pipe` method. If you're unfamiliar with this amazing feature, yo may appreciate -[this tutorial](https://calmcode.io/pandas-pipe/introduction.html). +with the pandas `.pipe` method. If you're unfamiliar with this amazing feature, you +may appreciate [this tutorial](https://calmcode.io/pandas-pipe/introduction.html). + +#### InteractivePreprocessor + +This allows you to draw features that you'd like to add to your dataset or +your machine learning pipeline. You can use it via `tfm.fit(df).transform(df)` and +`df.pipe(tfm)`. ### Datasets diff --git a/hulearn/__init__.py b/hulearn/__init__.py index 3dc1f76..485f44a 100644 --- a/hulearn/__init__.py +++ b/hulearn/__init__.py @@ -1 +1 @@ -__version__ = "0.1.0" +__version__ = "0.1.1" diff --git a/hulearn/experimental/interactive.py b/hulearn/experimental/interactive.py index ee4fad2..edff2d9 100644 --- a/hulearn/experimental/interactive.py +++ b/hulearn/experimental/interactive.py @@ -6,7 +6,7 @@ from bokeh.plotting import figure, show from bokeh.models import PolyDrawTool, PolyEditTool from bokeh.layouts import row -from bokeh.models import Label +from bokeh.models.annotations import Label from bokeh.models.widgets import Div from bokeh.io import output_notebook @@ -37,7 +37,7 @@ def __init__(self, dataf, labels): self.labels = labels self.charts = [] - def add_chart(self, x, y): + def add_chart(self, x, y, size=5, alpha=0.5): """ Generate an interactive chart to a cell. @@ -71,7 +71,12 @@ def add_chart(self, x, y): ``` """ chart = SingleInteractiveChart( - dataf=self.dataf.copy(), labels=self.labels, x=x, y=y + dataf=self.dataf.copy(), + labels=self.labels, + x=x, + y=y, + size=size, + alpha=alpha, ) self.charts.append(chart) chart.show() @@ -84,7 +89,7 @@ def to_json(self, path): class SingleInteractiveChart: - def __init__(self, dataf, labels, x, y): + def __init__(self, dataf, labels, x, y, size=5, alpha=0.5): self.uuid = str(uuid.uuid4())[:10] self.x = x self.y = y @@ -103,7 +108,9 @@ def __init__(self, dataf, labels, x, y): if len(self.labels) > 5: raise ValueError("We currently only allow for 5 classes max.") - self.plot.circle(x=x, y=y, color="color", source=self.source) + self.plot.circle( + x=x, y=y, color="color", source=self.source, size=size, alpha=alpha + ) # Create all the tools for drawing self.poly_patches = {} diff --git a/hulearn/preprocessing/__init__.py b/hulearn/preprocessing/__init__.py index 1956717..cee21c1 100644 --- a/hulearn/preprocessing/__init__.py +++ b/hulearn/preprocessing/__init__.py @@ -1,3 +1,4 @@ from hulearn.preprocessing.pipetransformer import PipeTransformer +from hulearn.preprocessing.interactivepreprocessor import InteractivePreprocessor -__all__ = ["PipeTransformer"] +__all__ = ["PipeTransformer", "InteractivePreprocessor"] diff --git a/hulearn/preprocessing/interactivepreprocessor.py b/hulearn/preprocessing/interactivepreprocessor.py new file mode 100644 index 0000000..3004d21 --- /dev/null +++ b/hulearn/preprocessing/interactivepreprocessor.py @@ -0,0 +1,147 @@ +import json +import pathlib + +import numpy as np +import pandas as pd +from shapely.geometry import Point +from shapely.geometry.polygon import Polygon + +from sklearn.base import BaseEstimator +from sklearn.utils.validation import check_is_fitted + + +class InteractivePreprocessor(BaseEstimator): + """ + This tool allows you to take a drawn model and use it as a featurizer. + + Arguments: + json_desc: chart da ta in dictionary form + refit: if `True`, you no longer need to call `.fit(X, y)` in order to `.predict(X)` + """ + + def __init__(self, json_desc, refit=True): + self.json_desc = json_desc + self.refit = refit + + @classmethod + def from_json(cls, path, refit=True): + """ + Load the classifier from json stored on disk. + + Arguments: + path: path of the json file + refit: if `True`, you no longer need to call `.fit(X, y)` in order to `.predict(X)` + + Usage: + + ```python + from hulearn.classification import InteractivePreprocessor + + InteractivePreprocessor.from_json("path/to/file.json") + ``` + """ + json_desc = json.loads(pathlib.Path(path).read_text()) + return InteractivePreprocessor(json_desc=json_desc, refit=refit) + + @property + def poly_data(self): + for chart in self.json_desc: + chard_id = chart["chart_id"] + labels = chart["polygons"].keys() + coords = chart["polygons"].values() + for lab, p in zip(labels, coords): + x_lab, y_lab = p.keys() + x_coords, y_coords = list(p.values()) + for i in range(len(x_coords)): + poly_data = list(zip(x_coords[i], y_coords[i])) + if len(poly_data) >= 3: + poly = Polygon(poly_data) + yield { + "x_lab": x_lab, + "y_lab": y_lab, + "poly": poly, + "label": lab, + "chart_id": chard_id, + } + + def _count_hits(self, clf_data, data_in): + counts = {k: 0 for k in self.classes_} + for c in clf_data: + point = Point(data_in[c["x_lab"]], data_in[c["y_lab"]]) + if c["poly"].contains(point): + counts[c["label"]] += 1 + return counts + + def fit(self, X, y=None): + """ + Fit the classifier. Bit of a formality, it's not doing anything specifically. + """ + self.classes_ = list(self.json_desc[0]["polygons"].keys()) + self.fitted_ = True + return self + + def transform(self, X): + """ + Apply the counting/binning based on the drawings. + + Usage: + + ```python + from hulearn.preprocessing import InteractivePreprocessor + clf = InteractivePreprocessor(clf_data) + X, y = load_data(...) + + # This doesn't do anything. But scikit-learn demands it. + clf.fit(X, y) + + # This makes predictions, based on your drawn model. + clf.transform(X) + ``` + """ + # Because we're not doing anything during training, for convenience this + # method can formally "fit" during the predict call. This is a scikit-learn + # anti-pattern so we allow you to turn this off. + if self.refit: + if not self.fitted_: + self.fit(X) + check_is_fitted(self, ["classes_", "fitted_"]) + if isinstance(X, pd.DataFrame): + hits = [ + self._count_hits(self.poly_data, x[1].to_dict()) for x in X.iterrows() + ] + else: + hits = [ + self._count_hits(self.poly_data, {k: v for k, v in enumerate(x)}) + for x in X + ] + count_arr = np.array([[h[c] for c in self.classes_] for h in hits]) + return count_arr + + def pandas_pipe(self, dataf): + """ + Use this transformer as part of a `.pipe()` method chain in pandas. + + Usage: + + ```python + import numpy as np + import pandas as pd + + # Load in a dataframe from somewhere + df = load_data(...) + + # Load in drawn chart data + from hulearn.preprocessing import InteractivePreprocessor + tfm = InteractivePreprocessor.from_json("path/file.json") + + # This adds new columns to the dataframe + df.pipe(pandas_pipe) + ``` + """ + new_dataf = pd.DataFrame( + self.fit(dataf).transform(dataf), columns=self.classes_ + ) + return pd.concat( + [dataf.copy().reset_index(drop=True), new_dataf.reset_index(drop=True)], + axis=1, + ) diff --git a/hulearn/this.py b/hulearn/this.py index 17edfec..e361ca5 100644 --- a/hulearn/this.py +++ b/hulearn/this.py @@ -1,18 +1,18 @@ msg = """ -Computers are like calculators, -they never really learn. -Natural Intelligence is a pretty good idea, -and Artifical Stupidity a valid concern. - Why worry about state of the art, maybe it's time that we all admit. That one-size-fits-all suit, -usually does not fit. +is not bespoke and usually does not fit. + +Computers can flow a lot of tensors, +but they never really learn. +Natural Intelligence is a pretty good idea, +and Artifical Stupidity a valid concern. There are many ways to solve a problem, -but don't let it go to your head. +but don't let fancy tools go to your head. Why would you use a predefined solution, -if you can customize one instead. +if you can create a custom one instead. """ print(msg) diff --git a/mkdocs.yml b/mkdocs.yml index 60e8002..e23a2d4 100644 --- a/mkdocs.yml +++ b/mkdocs.yml @@ -13,6 +13,7 @@ nav: - Human Preprocessing: guide/function-preprocess/function-preprocessing.md - Drawing as a Model: guide/drawing-classifier/drawing.md - Outliers and Comfort: guide/finding-outliers/outliers.md + - Drawing Features: guide/drawing-features/custom-features.md - Api: - Models: - Classification: api/classification.md diff --git a/tests/test_transformers/test_interactive_preprocessor.py b/tests/test_transformers/test_interactive_preprocessor.py new file mode 100644 index 0000000..b532234 --- /dev/null +++ b/tests/test_transformers/test_interactive_preprocessor.py @@ -0,0 +1,111 @@ +import pytest + +from sklego.datasets import load_penguins +from sklearn.pipeline import Pipeline, FeatureUnion +from hulearn.preprocessing import InteractivePreprocessor, PipeTransformer + +from hulearn.common import flatten + +from tests.conftest import ( + select_tests, + general_checks, + nonmeta_checks, +) + + +@pytest.mark.parametrize( + "test_fn", + select_tests( + include=flatten([general_checks, nonmeta_checks]), + exclude=[ + "check_estimators_pickle", + "check_estimators_nan_inf", + "check_estimators_empty_data_messages", + "check_complex_data", + "check_dtype_object", + "check_estimators_dtypes", + "check_dict_unchanged", + "check_fit1d", + "check_methods_subset_invariance", + "check_fit2d_predict1d", + ], + ), +) +def test_estimator_checks(test_fn): + """ + We're skipping a lot of tests here mainly because this model is "bespoke" + it is *not* general. Therefore a lot of assumptions are broken. + """ + clf = InteractivePreprocessor.from_json("tests/test_classification/demo-data.json") + test_fn(InteractivePreprocessor, clf) + + +def test_base_predict_usecase(): + clf = InteractivePreprocessor.from_json("tests/test_classification/demo-data.json") + df = load_penguins(as_frame=True).dropna() + X, y = df.drop(columns=["species"]), df["species"] + + preds = clf.fit(X, y).transform(X) + + assert preds.shape[0] == df.shape[0] + assert preds.shape[1] == 3 + + +def identity(x): + return x + + +def test_grid_predict_usecase(): + tfm = InteractivePreprocessor.from_json("tests/test_classification/demo-data.json") + pipe = Pipeline( + [ + ( + "features", + FeatureUnion( + [("original", PipeTransformer(identity)), ("new_feats", tfm)] + ), + ), + ] + ) + df = load_penguins(as_frame=True).dropna() + X, y = df.drop(columns=["species", "island", "sex"]), df["species"] + + preds = pipe.fit(X, y).transform(X) + + assert preds.shape[0] == df.shape[0] + assert preds.shape[1] == X.shape[1] + 3 + + +def test_ignore_bad_data(): + """ + There might be some "bad data" drawn. For example, when you quickly hit double-click you might + draw a line instead of a poly. Bokeh is "okeh" with it, but our point-in-poly algorithm is not. + """ + data = [ + { + "chart_id": "9ec8e755-2", + "x": "bill_length_mm", + "y": "bill_depth_mm", + "polygons": { + "Adelie": {"bill_length_mm": [], "bill_depth_mm": []}, + "Gentoo": {"bill_length_mm": [], "bill_depth_mm": []}, + "Chinstrap": {"bill_length_mm": [], "bill_depth_mm": []}, + }, + }, + { + "chart_id": "11640372-c", + "x": "flipper_length_mm", + "y": "body_mass_g", + "polygons": { + "Adelie": { + "flipper_length_mm": [[214.43261376806052, 256.2612913545137]], + "body_mass_g": [[3950.9482324534456, 3859.9137496948247]], + }, + "Gentoo": {"flipper_length_mm": [], "body_mass_g": []}, + "Chinstrap": {"flipper_length_mm": [], "body_mass_g": []}, + }, + }, + ] + + clf = InteractivePreprocessor(json_desc=data) + assert len(list(clf.poly_data)) == 0