Automated Data Cleaning Tool.
The main goal is to develop a Python tool datacleanbot
such that:
Given a random parsed raw dataset representing a supervised learning problem, the Python tool is capable of automatically identifying the potential issues and reporting the results and recommendations to the end-user in an effective way.
$ pip install datacleanbot
OpenML is used to easily import datasets and share models and experiments.
$ pip install openml
For Windows, you need to have C++ Compiler installed.
>>> import openml as oml
>>> data = oml.datasets.get_dataset(id) # id: openml dataset id
>>> X, y, categorical_indicator, features = data.get_data(target=data.default_target_attribute, dataset_format='array')
>>> Xy = np.concatenate((X,y.reshape((y.shape[0],1))), axis=1)
>>> import datacleanbot.dataclean as dc
>>> Xy = dc.autoclean(Xy, data.name, features)
datacleanbot
is equipped with the following capabilities:
- Present an overview report of the given dataset
- The most important features
- Statistical information (e.g., mean, max, min)
- Data types of features
- Clean common data problems in the raw dataset
- Duplicated records
- Inconsistent column names
- Missing values
- Outliers
The three aspects datacleanbot
meaningfully automates are marked in bold.
The user's guide can be found at datacleanbot.