pyjanitor
is a Python implementation of the R package janitor
, and
provides a clean API for cleaning data.
- Installation:
conda install -c conda-forge pyjanitor
- Check out the collection of general functions
Originally a port of the R package,
pyjanitor
has evolved from a set of convenient data cleaning routines
into an experiment with the method chaining
paradigm.
Data preprocessing usually consists of a series of steps that involve transforming raw data into an understandable/usable format. These series of steps need to be run in a certain sequence to achieve success. We take a base data file as the starting point, and perform actions on it, such as removing null/empty rows, replacing them with other values, adding/renaming/removing columns of data, filtering rows and others. More formally, these steps along with their relationships and dependencies are commonly referred to as a Directed Acyclic Graph (DAG).
The pandas
API has been invaluable for the Python data science ecosystem,
and implements method chaining of a subset of methods as part of the API.
For example, resetting indexes (.reset_index()
),
dropping null values (.dropna()
), and more,
are accomplished via the appropriate pd.DataFrame
method calls.
Inspired by the ease-of-use
and expressiveness of the dplyr
package
of the R statistical language ecosystem,
we have evolved pyjanitor
into a language
for expressing the data processing DAG for pandas
users.
To accomplish this, actions for which we would need to invoke imperative-style statements, can be replaced with method chains that allow one to read off the logical order of actions taken. Let us see the annotated example below. First off, here is the textual description of a data cleaning pathway:
- Create a
DataFrame
. - Delete one column.
- Drop rows with empty values in two particular columns.
- Rename another two columns.
- Add a new column.
Let's import some libraries and begin with some sample data for this example :
# Libraries
import numpy as np
import pandas as pd
import janitor
# Sample Data curated for this example
company_sales = {
'SalesMonth': ['Jan', 'Feb', 'Mar', 'April'],
'Company1': [150.0, 200.0, 300.0, 400.0],
'Company2': [180.0, 250.0, np.nan, 500.0],
'Company3': [400.0, 500.0, 600.0, 675.0]
}
In pandas
code, most users might type something like this:
# The Pandas Way
# 1. Create a pandas DataFrame from the company_sales dictionary
df = pd.DataFrame.from_dict(company_sales)
# 2. Delete a column from the DataFrame. Say 'Company1'
del df['Company1']
# 3. Drop rows that have empty values in columns 'Company2' and 'Company3'
df = df.dropna(subset=['Company2', 'Company3'])
# 4. Rename 'Company2' to 'Amazon' and 'Company3' to 'Facebook'
df = df.rename(
{
'Company2': 'Amazon',
'Company3': 'Facebook',
},
axis=1,
)
# 5. Let's add some data for another company. Say 'Google'
df['Google'] = [450.0, 550.0, 800.0]
# Output looks like this:
# Out[15]:
# SalesMonth Amazon Facebook Google
# 0 Jan 180.0 400.0 450.0
# 1 Feb 250.0 500.0 550.0
# 3 April 500.0 675.0 800.0
Slightly more advanced users might take advantage of the functional API:
df = (
pd.DataFrame(company_sales)
.drop(columns="Company1")
.dropna(subset=['Company2', 'Company3'])
.rename(columns={"Company2": "Amazon", "Company3": "Facebook"})
.assign(Google=[450.0, 550.0, 800.0])
)
# Output looks like this:
# Out[15]:
# SalesMonth Amazon Facebook Google
# 0 Jan 180.0 400.0 450.0
# 1 Feb 250.0 500.0 550.0
# 3 April 500.0 675.0 800.0
With pyjanitor
, we enable method chaining with method names
that are explicitly named verbs, which describe the action taken.
df = (
pd.DataFrame.from_dict(company_sales)
.remove_columns(['Company1'])
.dropna(subset=['Company2', 'Company3'])
.rename_column('Company2', 'Amazon')
.rename_column('Company3', 'Facebook')
.add_column('Google', [450.0, 550.0, 800.0])
)
# Output looks like this:
# Out[15]:
# SalesMonth Amazon Facebook Google
# 0 Jan 180.0 400.0 450.0
# 1 Feb 250.0 500.0 550.0
# 3 April 500.0 675.0 800.0
As such, pyjanitor's etymology has a two-fold relationship to "cleanliness". Firstly, it's about extending Pandas with convenient data cleaning routines. Secondly, it's about providing a cleaner, method-chaining, verb-based API for common pandas routines.
pyjanitor
is currently installable from PyPI:
pip install pyjanitor
pyjanitor
also can be installed by the conda package manager:
conda install pyjanitor -c conda-forge
pyjanitor
can be installed by the pipenv environment manager too. This requires enabling prerelease dependencies:
pipenv install --pre pyjanitor
pyjanitor
requires Python 3.6+.
Current functionality includes:
- Cleaning columns name (multi-indexes are possible!)
- Removing empty rows and columns
- Identifying duplicate entries
- Encoding columns as categorical
- Splitting your data into features and targets (for machine learning)
- Adding, removing, and renaming columns
- Coalesce multiple columns into a single column
- Date conversions (from matlab, excel, unix) to Python datetime format
- Expand a single column that has delimited, categorical values into dummy-encoded variables
- Concatenating and deconcatenating columns, based on a delimiter
- Syntactic sugar for filtering the dataframe based on queries on a column
- Experimental submodules for finance, biology, chemistry, engineering, and pyspark
The idea behind the API is two-fold:
- Copy the R package function names,
but enable Pythonic use with method chaining or
pandas
piping. - Add other utility functions
that make it easy to do data cleaning/preprocessing in
pandas
.
Continuing with the company_sales dataframe previously used:
import pandas as pd
import numpy as np
company_sales = {
'SalesMonth': ['Jan', 'Feb', 'Mar', 'April'],
'Company1': [150.0, 200.0, 300.0, 400.0],
'Company2': [180.0, 250.0, np.nan, 500.0],
'Company3': [400.0, 500.0, 600.0, 675.0]
}
As such, there are three ways to use the API.
The first, and most strongly recommended one, is to use pyjanitor
's functions
as if they were native to pandas.
import janitor # upon import, functions are registered as part of pandas.
# This cleans the column names as well as removes any duplicate rows
df = pd.DataFrame.from_dict(company_sales).clean_names().remove_empty()
The second is the functional API.
from janitor import clean_names, remove_empty
df = pd.DataFrame.from_dict(company_sales)
df = clean_names(df)
df = remove_empty(df)
The final way is to use the pipe()
method:
from janitor import clean_names, remove_empty
df = (
pd.DataFrame.from_dict(company_sales)
.pipe(clean_names)
.pipe(remove_empty)
)
Follow contribution docs for a full description of the process of contributing to pyjanitor
.
Keeping in mind the etymology of pyjanitor, contributing a new function to pyjanitor is a task that is not difficult at all.
First off, you will need to define the function that expresses the data processing/cleaning routine, such that it accepts a dataframe as the first argument, and returns a modified dataframe:
.. code-block:: python
import pandas_flavor as pf
@pf.register_dataframe_method
def my_data_cleaning_function(df, arg1, arg2, ...):
# Put data processing function here.
return df
We use pandas_flavor
to register the function natively on a pandas.DataFrame
.
Secondly, we ask that you contribute a test case, to ensure that it works as intended. Follow the contribution docs for further details.
If you have a feature request, please post it as an issue on the GitHub repository issue tracker. Even better, put in a PR for it! We are more than happy to guide you through the codebase so that you can put in a contribution to the codebase.
Because pyjanitor
is currently maintained by volunteers
and has no fiscal support,
any feature requests will be prioritized according to
what maintainers encounter as a need in our day-to-day jobs.
Please temper expectations accordingly.
pyjanitor
only extends or aliases the pandas
API
(and other dataframe APIs),
but will never fix or replace them.
Undesirable pandas
behaviour should be reported upstream
in the pandas
issue tracker.
We explicitly do not fix the pandas
API.
If at some point the pandas
devs
decide to take something from pyjanitor
and internalize it as part of the official pandas
API,
then we will deprecate it from pyjanitor
,
while acknowledging the original contributors' contribution
as part of the official deprecation record.
Test data for chemistry submodule can be found at Predictive Toxicology.