Skip to content

Commit

Permalink
Preprocessor + Docs (#23)
Browse files Browse the repository at this point in the history
* start

* fixed-colors

* milestone

* gogog

* grid-search-fail

* base-tests-work

* added-sklego-test-dep

* go-go-go

* added-shapely-dep

* bokeh-dep

* added-guide

* docs-ready

* added-extra-test

* fixed-final-test

* outliers

* added-outlier-func-tests

* moar-testing

* fix-test

* added-3.8-test

* added-3.9-test

* lol-3.9-no-exist-yet

* removed-deps-not-used

* docs-updated

* end-of-day

* cleanup

* added-preprocessor

* docs-added

* docs

* more-docs-and-tests

* this

* added-label
  • Loading branch information
koaning authored Oct 1, 2020
1 parent 968b400 commit 07bc44a
Show file tree
Hide file tree
Showing 17 changed files with 1,110 additions and 30 deletions.
10 changes: 8 additions & 2 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -83,8 +83,14 @@ it an outlier. There's a threshold parameter for how strict you might want to be
This allows you to define a function that can make handle preprocessing. It's
constructed in such a way that you can use the arguments of the function as a parameter
that you can benchmark in a grid-search. This is especially powerful in combination
with the pandas `.pipe` method. If you're unfamiliar with this amazing feature, yo may appreciate
[this tutorial](https://calmcode.io/pandas-pipe/introduction.html).
with the pandas `.pipe` method. If you're unfamiliar with this amazing feature, you
may appreciate [this tutorial](https://calmcode.io/pandas-pipe/introduction.html).

#### InteractivePreprocessor

This allows you to draw features that you'd like to add to your dataset or
your machine learning pipeline. You can use it via `tfm.fit(df).transform(df)` and
`df.pipe(tfm)`.

### Datasets

Expand Down
2 changes: 2 additions & 0 deletions docs/api/preprocessing.md
Original file line number Diff line number Diff line change
@@ -1,3 +1,5 @@
# `from hulearn.preprocessing import *`

::: hulearn.preprocessing.pipetransformer

::: hulearn.preprocessing.interactivepreprocessor
68 changes: 58 additions & 10 deletions docs/examples/examples.md
Original file line number Diff line number Diff line change
@@ -1,27 +1,77 @@
This page contains a list of short examples that demonstrate the utility of
the tools in this package. The goal for each example is to be small and consise.

This page is still under construction.
## Precision and Subgroups

It can be the case that for a subgroup of the population you do not need a model.
Suppose that we have a session log dataset from "World of Warcraft". We know when
people logged in, if they were part of a guild and when they stopped playing. You
can create a machine learning model to predict which players are at risk of quitting
the game but you might also be able to come up with some simple rules.

Here is one rule that might work out swell:

> "If any player was playing the video game at 24:00 on new-years eve,
odds are that this person is very invested in the game and won't stop playing."

This one rule will not cover the entire population but for the subgroup it can be an
effective rule.

![](wow.png)

As an illustrative example we'll implement this diagram as a `Classifier`.

## Insurance for Future Data
```python
import numpy as np
from hulearn.outlier import InteractiveOutlierDetector
from hulearn.classification import FunctionClassifier, InteractiveClassifier


classifier = SomeScikitLearnModel()

def make_decision(dataf):
# First we create a resulting array with all the predictions
res = classifier.predict(dataf)

# Override model prediction if a user is a heavy_user, no matter what
res = np.where(dataf['heavy_user'], "stays", res)

return res

fallback_model = FunctionClassifier(make_decision)
```

## No Data No Problem

## Model Guarantees
Let's say that we're interested in detecting fraud at a tax office. Even without
looking at the data we can already come up with some sensible rules.

## Precision and Subgroups
- Any minor making over the median income is "suspicious".
- Any person who started more than 2 companies in a year is "suspicious".
- Any person who has more than 10 bank accounts is "suspicious".

## Dealing with NA
The thing with these rules is that they are easy to explain but they are not based
on data at all. In fact, they may not occur in the data at all. This means that
a machine learning model may not have picked up this pattern that we're interested
in. Thankfully, the lack in data can be compensated with business rules.

![](tree.png)

## Comfort Zone

Models typically have a "comfort zone". If a new data point comes in that is
very different from what the models saw before it should not be treated the same way.
You can also argue that points with low `proba` score should also not be
automated.

If you want to prevent predictions where the model is "unsure" then you
might want to construct a `FunctionClassifier` that handles the logic you
require. For example, we might draw a diagram like;
might want to follow this diagram;

![](../guide/finding-outliers/diagram.png)

As an illustrative example we'll implement a diagram like above as a `Classifier`.
You can construct such a system by creating a `FunctionClassifier` that
handles the logic you require. As an illustrative example we'll implement
this diagram as a `Classifier`.

```python
import numpy as np
Expand Down Expand Up @@ -55,5 +105,3 @@ For more information on why this tactic is helpful:

- [blogpost](https://koaning.io/posts/high-on-probability-low-on-certainty/)
- [pydata talk](https://www.youtube.com/watch?v=Z8MEFI7ZJlA)

## Risk Class Translation
9 changes: 9 additions & 0 deletions docs/examples/faq.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,6 +2,15 @@

Feel free to ask questions [here](https://github.com/koaning/human-learn/issues).

## What are the Lessons Learned

If you're interested in some of the lessons the creators of this tool learned
while creating it, all you need to do is follow the python tradition.

```python
from hulearn import this
```

## Why Make This?

Back in the old days, it was common to write rule-based systems. Systems that do;
Expand Down
Binary file added docs/examples/tree.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added docs/examples/wow.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
57 changes: 57 additions & 0 deletions docs/guide/drawing-features/custom-features.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,57 @@
Sofar we've explored drawing as a tool for models, but it can also be used
as a tool to generate features. To explore this, let's load in the penguins
dataset again.

```python
from sklego.datasets import load_penguins

df = load_penguins(as_frame=True).dropna()
```

## Drawing

We can draw over this dataset. It's like before but with one crucial differenc

```python
from hulearn.experimental.interactive import InteractiveCharts

# Note that the `labels` arugment here is a list, not a string! This
# tells the tool that we want to be able to add custom groups that are
# not defined by a column in the dataframe.
charts = InteractiveCharts(df, labels=['group_one', 'group_two'])
```

Let's make a custom drawing.

```python
charts.add_chart(x="flipper_length_mm", y="body_mass_g")
```

Let's assume the new drawing looks something like this.

![](drawing.png)

Sofar these drawn features have been used to construct models. But they
can also be used to help label data or generate extra features for machine
learning models.

## Features

This library makes it easy to add these features to scikit-learn
pipelines or to pandas. To get started, you'll want to import the
`InteractivePreprocessor`.

```python
from hulearn.preprocessing import InteractivePreprocessor
tfm = InteractivePreprocessor(json_desc=charts.data())
```

This `tfm` object is can be used as a preprocessing step inside of
scikit-learn but it can also be used in a pandas pipeline.

```python
# The flow for scikit-learn
tfm.fit(df).transform(df)
# The flow for pandas
df.pipe(tfm.pandas_pipe)
```
Binary file added docs/guide/drawing-features/drawing.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading

0 comments on commit 07bc44a

Please sign in to comment.