Preprocessor + Docs (#23)

* start * fixed-colors * milestone * gogog * grid-search-fail * base-tests-work * added-sklego-test-dep * go-go-go * added-shapely-dep * bokeh-dep * added-guide * docs-ready * added-extra-test * fixed-final-test * outliers * added-outlier-func-tests * moar-testing * fix-test * added-3.8-test * added-3.9-test * lol-3.9-no-exist-yet * removed-deps-not-used * docs-updated * end-of-day * cleanup * added-preprocessor * docs-added * docs * more-docs-and-tests * this * added-label
koaning · Oct 1, 2020 · 07bc44a · 07bc44a
1 parent 968b400
commit 07bc44a
Show file tree

Hide file tree

Showing 17 changed files with 1,110 additions and 30 deletions.
diff --git a/README.md b/README.md
@@ -83,8 +83,14 @@ it an outlier. There's a threshold parameter for how strict you might want to be
 This allows you to define a function that can make handle preprocessing. It's
 constructed in such a way that you can use the arguments of the function as a parameter
 that you can benchmark in a grid-search. This is especially powerful in combination
-with the pandas `.pipe` method. If you're unfamiliar with this amazing feature, yo may appreciate
-[this tutorial](https://calmcode.io/pandas-pipe/introduction.html).
+with the pandas `.pipe` method. If you're unfamiliar with this amazing feature, you
+may appreciate [this tutorial](https://calmcode.io/pandas-pipe/introduction.html).
+
+#### InteractivePreprocessor
+
+This allows you to draw features that you'd like to add to your dataset or
+your machine learning pipeline. You can use it via `tfm.fit(df).transform(df)` and
+`df.pipe(tfm)`.
 
 ### Datasets
 

diff --git a/docs/api/preprocessing.md b/docs/api/preprocessing.md
@@ -1,3 +1,5 @@
 # `from hulearn.preprocessing import *`
 
 ::: hulearn.preprocessing.pipetransformer
+
+::: hulearn.preprocessing.interactivepreprocessor
diff --git a/docs/examples/examples.md b/docs/examples/examples.md
@@ -1,27 +1,77 @@
 This page contains a list of short examples that demonstrate the utility of
 the tools in this package. The goal for each example is to be small and consise.
 
-This page is still under construction.
+## Precision and Subgroups
+
+It can be the case that for a subgroup of the population you do not need a model.
+Suppose that we have a session log dataset from "World of Warcraft". We know when
+people logged in, if they were part of a guild and when they stopped playing. You
+can create a machine learning model to predict which players are at risk of quitting
+the game but you might also be able to come up with some simple rules.
+
+Here is one rule that might work out swell:
+
+> "If any player was playing the video game at 24:00 on new-years eve,
+odds are that this person is very invested in the game and won't stop playing."
+
+This one rule will not cover the entire population but for the subgroup it can be an
+effective rule.
+
+![](wow.png)
+
+As an illustrative example we'll implement this diagram as a `Classifier`.
 
-## Insurance for Future Data
+```python
+import numpy as np
+from hulearn.outlier import InteractiveOutlierDetector
+from hulearn.classification import FunctionClassifier, InteractiveClassifier
+
+
+classifier = SomeScikitLearnModel()
+
+def make_decision(dataf):
+    # First we create a resulting array with all the predictions
+    res = classifier.predict(dataf)
+
+    # Override model prediction if a user is a heavy_user, no matter what
+    res = np.where(dataf['heavy_user'], "stays", res)
+
+    return res
+
+fallback_model = FunctionClassifier(make_decision)
+```
 
 ## No Data No Problem
 
-## Model Guarantees
+Let's say that we're interested in detecting fraud at a tax office. Even without
+looking at the data we can already come up with some sensible rules.
 
-## Precision and Subgroups
+- Any minor making over the median income is "suspicious".
+- Any person who started more than 2 companies in a year is "suspicious".
+- Any person who has more than 10 bank accounts is "suspicious".
 
-## Dealing with NA
+The thing with these rules is that they are easy to explain but they are not based
+on data at all. In fact, they may not occur in the data at all. This means that
+a machine learning model may not have picked up this pattern that we're interested
+in. Thankfully, the lack in data can be compensated with business rules.
+
+![](tree.png)
 
 ## Comfort Zone
 
+Models typically have a "comfort zone". If a new data point comes in that is
+very different from what the models saw before it should not be treated the same way.
+You can also argue that points with low `proba` score should also not be
+automated.
+
 If you want to prevent predictions where the model is "unsure" then you
-might want to construct a `FunctionClassifier` that handles the logic you
-require. For example, we might draw a diagram like;
+might want to follow this diagram;
 
 ![](../guide/finding-outliers/diagram.png)
 
-As an illustrative example we'll implement a diagram like above as a `Classifier`.
+You can construct such a system by creating a `FunctionClassifier` that
+handles the logic you require. As an illustrative example we'll implement
+this diagram as a `Classifier`.
 
 ```python
 import numpy as np
@@ -55,5 +105,3 @@ For more information on why this tactic is helpful:
 
 - [blogpost](https://koaning.io/posts/high-on-probability-low-on-certainty/)
 - [pydata talk](https://www.youtube.com/watch?v=Z8MEFI7ZJlA)
-
-## Risk Class Translation
diff --git a/docs/examples/faq.md b/docs/examples/faq.md
@@ -2,6 +2,15 @@
 
 Feel free to ask questions [here](https://github.com/koaning/human-learn/issues).
 
+## What are the Lessons Learned
+
+If you're interested in some of the lessons the creators of this tool learned
+while creating it, all you need to do is follow the python tradition.
+
+```python
+from hulearn import this
+```
+
 ## Why Make This?
 
 Back in the old days, it was common to write rule-based systems. Systems that do;

diff --git a/docs/examples/tree.png b/docs/examples/tree.png
diff --git a/docs/examples/wow.png b/docs/examples/wow.png
diff --git a/docs/guide/drawing-features/custom-features.md b/docs/guide/drawing-features/custom-features.md
@@ -0,0 +1,57 @@
+Sofar we've explored drawing as a tool for models, but it can also be used
+as a tool to generate features. To explore this, let's load in the penguins
+dataset again.
+
+```python
+from sklego.datasets import load_penguins
+
+df = load_penguins(as_frame=True).dropna()
+```
+
+## Drawing
+
+We can draw over this dataset. It's like before but with one crucial differenc
+
+```python
+from hulearn.experimental.interactive import InteractiveCharts
+
+# Note that the `labels` arugment here is a list, not a string! This
+# tells the tool that we want to be able to add custom groups that are
+# not defined by a column in the dataframe.
+charts = InteractiveCharts(df, labels=['group_one', 'group_two'])
+```
+
+Let's make a custom drawing.
+
+```python
+charts.add_chart(x="flipper_length_mm", y="body_mass_g")
+```
+
+Let's assume the new drawing looks something like this.
+
+![](drawing.png)
+
+Sofar these drawn features have been used to construct models. But they
+can also be used to help label data or generate extra features for machine
+learning models.
+
+## Features
+
+This library makes it easy to add these features to scikit-learn
+pipelines or to pandas. To get started, you'll want to import the
+`InteractivePreprocessor`.
+
+```python
+from hulearn.preprocessing import InteractivePreprocessor
+tfm = InteractivePreprocessor(json_desc=charts.data())
+```
+
+This `tfm` object is can be used as a preprocessing step inside of
+scikit-learn but it can also be used in a pandas pipeline.
+
+```python
+# The flow for scikit-learn
+tfm.fit(df).transform(df)
+# The flow for pandas
+df.pipe(tfm.pandas_pipe)
+```
diff --git a/docs/guide/drawing-features/drawing.png b/docs/guide/drawing-features/drawing.png