Adegorical is a python package for performing advanced transformations on categorical data. This can be particularily useful in regression analysis but can be applied to other machine learning techniques (at your own peril).
This function returns the data structure it is given:
- Pandas series input returns a pandas dataframe
- Numpy column input returns a numpy array
- Python list input returns a list of lists
import adegorical as ad
encoding_types = ad.help()
print(encoding_types)
['dummy', 'binary', 'simple_contrast', 'simple_regression','backward_difference_contrast', 'forward_difference_contrast', 'simple_helmert']
- Dummy
- Binary
- Simple Contrast
- Simple Regression
- Forward Difference Contrast
- Backward Difference Contrast
- Simple Helmert
The encoding methods in this package were built off of the work found on UCLA's Advance Categorical Variable Encoding and a Presentation by Harris Holly. Unfortunately, UCLA removed the webpage from their website. An archived version of the website can be found in this repository.
Dummy is the standard when it comes to categorical variable encoding. N-1 columns is expected where N is the number of unique categorical variables.
colors = ['yellow', 'red', 'green', 'wenge', 'orange', 'red', 'yellow', 'blue', 'magenta', 'wenge']
df = pd.DataFrame({'colors':colors})
categorial_frame = ad.get_categorical(df['colors'],
encoding='dummy',
column_name=None)
yellow_dummy | wenge_dummy | red_dummy | green_dummy | magenta_dummy | magenta_dummy |
---|---|---|---|---|---|
1 | 0 | 0 | 0 | 0 | 0 |
0 | 0 | 1 | 0 | 0 | 0 |
0 | 0 | 0 | 1 | 0 | 0 |
0 | 1 | 0 | 0 | 0 | 0 |
0 | 0 | 0 | 0 | 0 | 0 |
0 | 0 | 1 | 0 | 0 | 0 |
1 | 0 | 0 | 0 | 0 | 0 |
0 | 0 | 0 | 0 | 0 | 1 |
0 | 0 | 0 | 0 | 1 | 0 |
0 | 1 | 0 | 0 | 0 | 0 |
All the columns in sequential combination compose a binary representation of the categorical variable. The length of the string of the binary representation of the unique number of categorical variables is expected.
colors = ['yellow', 'red', 'green', 'wenge', 'orange', 'red', 'yellow', 'blue', 'magenta', 'wenge']
df = pd.DataFrame({'colors':colors})
categorial_frame = ad.get_categorical(df['colors'],
encoding='binary',
reference='red',
column_name='binary')
binary_1 | binary_2 | binary_3 |
---|---|---|
0 | 0 | 0 |
1 | 1 | 0 |
0 | 1 | 1 |
0 | 0 | 1 |
0 | 1 | 0 |
1 | 1 | 0 |
0 | 0 | 0 |
1 | 0 | 1 |
1 | 0 | 0 |
0 | 0 | 1 |
Instead of all zeros on our reference value as with dummy variables, the row becomes negative one. N-1 columns is expected
colors = ['yellow', 'red', 'green', 'wenge', 'orange', 'red', 'yellow', 'blue', 'magenta', 'wenge']
df = pd.DataFrame({'colors':colors})
categorial_frame = ad.get_categorical(df['colors'],
encoding='simple_contrast',
reference='red',
column_name='simple_contrast')
yellow_simple_contrast | wenge_simple_contrast | orange_simple_contrast | green_simple_contrast | magenta_simple_contrast | blue_simple_contrast |
---|---|---|---|---|---|
1 | 0 | 0 | 0 | 0 | 0 |
-1 | -1 | -1 | -1 | -1 | -1 |
0 | 0 | 0 | 1 | 0 | 0 |
0 | 1 | 0 | 0 | 0 | 0 |
0 | 0 | 1 | 0 | 0 | 0 |
-1 | -1 | -1 | -1 | -1 | -1 |
1 | 0 | 0 | 0 | 0 | 0 |
0 | 0 | 0 | 0 | 0 | 1 |
0 | 0 | 0 | 0 | 1 | 0 |
0 | 1 | 0 | 0 | 0 | 0 |
- Forward Difference Regression
- Backward Difference Regression
- Simple Helmert Regression
- Reverse Helmert
- Polynomial
- Regression Polynomial
- Deviation
- Deviation Regression
- Manipulate data in native format rather than converting to lists and back to native format (i.e. pandas data input, transforming via optimized pandas methods)
- Redo column naming convension on binary. Results are a combination of columns so having a "blue" column doesn't make much sense
- Simple Regression
- Backward Difference Contrast
- Forward Difference Contrast
- Simple Helmert
- Remaining encoding methods found in todo encoding methods