A data validation library for scientists, engineers, and analysts seeking correctness.
pandas
data structures contain information that pandera
explicitly
validates at runtime. This is useful in production-critical or reproducible
research settings. With pandera
, you can:
- Check the types and properties of columns in a
DataFrame
or values in aSeries
. - Perform more complex statistical validation like hypothesis testing.
- Seamlessly integrate with existing data analysis/processing pipelines via function decorators.
pandera
provides a flexible and expressive API for performing data validation
on tidy (long-form) and wide data to make data processing pipelines more
readable and robust.
The official documentation is hosted on ReadTheDocs: https://pandera.readthedocs.io
Using pip:
pip install pandera
Using conda:
conda install -c conda-forge pandera
import pandas as pd
import pandera as pa
# data to validate
df = pd.DataFrame({
"column1": [1, 4, 0, 10, 9],
"column2": [-1.3, -1.4, -2.9, -10.1, -20.4],
"column3": ["value_1", "value_2", "value_3", "value_2", "value_1"]
})
# define schema
schema = pa.DataFrameSchema({
"column1": pa.Column(pa.Int, checks=pa.Check.less_than_or_equal_to(10)),
"column2": pa.Column(pa.Float, checks=pa.Check.less_than(-1.2)),
"column3": pa.Column(pa.String, checks=[
pa.Check.str_startswith("value_"),
# define custom checks as functions that take a series as input and
# outputs a boolean or boolean Series
pa.Check(lambda s: s.str.split("_", expand=True).shape[1] == 2)
]),
})
validated_df = schema.validate(df)
print(validated_df)
# column1 column2 column3
# 0 1 -1.3 value_1
# 1 4 -1.4 value_2
# 2 0 -2.9 value_3
# 3 10 -10.1 value_2
# 4 9 -20.4 value_1
git clone https://github.com/pandera-dev/pandera.git
cd pandera
pip install -r requirements-dev.txt
pip install -e .
pip install pytest
pytest tests
All contributions, bug reports, bug fixes, documentation improvements, enhancements and ideas are welcome.
A detailed overview on how to contribute can be found in the contributing guide on GitHub.
Go here to submit feature requests or bugfixes.
Here are a few other alternatives for validating Python data structures.
Generic Python object data validation
pandas
-specific data validation
Other tools that include data validation
pandas
-centric data types, column nullability, and uniqueness are first-class concepts.check_input
andcheck_output
decorators enable seamless integration with existing code.Check
s provide flexibility and performance by providing access topandas
API by design.Hypothesis
class provides a tidy-first interface for statistical hypothesis testing.Check
s andHypothesis
objects support both tidy and wide data validation.- Comprehensive documentation on key functionality.
@software{niels_bantilan_2020_3926689,
author = {Niels Bantilan and
Nigel Markey and
Riccardo Albertazzi and
Nemanja Radojković and
chr1st1ank and
Aditya Singh and
Anthony Truchet - C3.AI and
Steve Taylor and
Sunho Kim and
Zachary Lawrence},
title = {{pandera-dev/pandera: 0.4.4: bugfixes in yaml
serialization, error reporting, refactor internals}},
month = jul,
year = 2020,
publisher = {Zenodo},
version = {0.4.4},
doi = {10.5281/zenodo.3926689},
url = {https://doi.org/10.5281/zenodo.3926689}
}