Ensure that PostgreSQL is installed on your machine.

Install the required Python packages.

pip install -r requirements.txt

If you encounter connection issues to PostgreSQL, ensure to update the parameters in the main such as user and password according to your PostgreSQL configuration.

Data Generation with pyAgrum: Simulating Incomplete Datasets

Project Overview

This project leverages pyAgrum, a Python library for Bayesian networks, to generate synthetic data with controlled missingness mechanisms. The input to the project is a Bayesian network (BN) that represents:

The causal structure of the attributes in a dataset.
The mechanism of missingness, encoding the probability of missing values.

By combining causal relationships between variables and the missingness mechanism, this project creates both complete datasets and incomplete datasets by introducing missing values. Then compute a block independent probabilistic database from the incomplete database using the probailities defined in the bayesian network.

Inputs

Missingness Graph (Bayesian Network):
- A Bayesian network (BN) is provided as input, which defines the dependencies between variables and the missingness mechanism using indicator variables.
- Each indicator variable determines whether an attribute will have a missing value (NA) in the final dataset.
Database Size:
- User specifies the number of tuples in the generated synthetic dataset.
Missingness Rate:
- The probability of missingness is determined by the indicator variables in the missingness graph.

Data Generation Process

1. Complete Data Generation

Using pyAgrum, a complete synthetic dataset is generated based on the input BN, ensuring dependencies between attributes are respected according to the conditional/absolute probability distributions.

2. Handling Missing Data Using Indicators

Each partially observed attribute has a corresponding indicator variable that determines whether a missing value (NA) should be introduced in the dataset.
If the indicator variable equals 1, the target attribute is set to NA.

3. Probabilistic Database Generation

Using the incomplete database, and the conditional/absolute probabilities defined in the BN, we build the probabilistic database.

Outputs

Complete Dataset: A synthetic dataset without missing values.
Incomplete Dataset: The same dataset with missing values (NA) introduced according to the missingness graph.
Probabilistic Dataset: .

How to Use

A variety of missingness graphs, representing the three missingness mechanisms, are available in the /quantitative_bayesian_network/generators directory and are ready to be used.
In the main script, you can specify the desired missingness graph, the size of the database, and the missingness rate.
The quality of the generated datasets can be evaluated using metrics such as KL divergence, Wasserstein distance, and Euclidean distance, comparing the joint probability distribution from the Bayesian network with the empirical distribution of the generated data (methods available in /src/data_evaluation).
The data can be stored either as a CSV file, in a PostgreSQL database, or both, depending on the specified parameter.

Name		Name	Last commit message	Last commit date
Latest commit History 8 Commits
data/raw		data/raw
notebooks		notebooks
quantitative_bayesian_networks/generators		quantitative_bayesian_networks/generators
result/data_generation_quality_in_py_agrum		result/data_generation_quality_in_py_agrum
src		src
main.py		main.py
readme.md		readme.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Ensure that PostgreSQL is installed on your machine.

Install the required Python packages.

If you encounter connection issues to PostgreSQL, ensure to update the parameters in the main such as user and password according to your PostgreSQL configuration.

Data Generation with pyAgrum: Simulating Incomplete Datasets

Project Overview

Inputs

Data Generation Process

1. Complete Data Generation

2. Handling Missing Data Using Indicators

3. Probabilistic Database Generation

Outputs

How to Use

About

Releases

Packages

Languages

MOULAYID/data_generation_pyAgrum

Folders and files

Latest commit

History

Repository files navigation

Ensure that PostgreSQL is installed on your machine.

Install the required Python packages.

If you encounter connection issues to PostgreSQL, ensure to update the parameters in the main such as user and password according to your PostgreSQL configuration.

Data Generation with pyAgrum: Simulating Incomplete Datasets

Project Overview

Inputs

Data Generation Process

1. Complete Data Generation

2. Handling Missing Data Using Indicators

3. Probabilistic Database Generation

Outputs

How to Use

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages