Skip to content

MOULAYID/data_generation_pyAgrum

Repository files navigation

Ensure that PostgreSQL is installed on your machine.

Install the required Python packages.

pip install -r requirements.txt 

If you encounter connection issues to PostgreSQL, ensure to update the parameters in the main such as user and password according to your PostgreSQL configuration.

Data Generation with pyAgrum: Simulating Incomplete Datasets

Project Overview

This project leverages pyAgrum, a Python library for Bayesian networks, to generate synthetic data with controlled missingness mechanisms. The input to the project is a Bayesian network (BN) that represents:

  1. The causal structure of the attributes in a dataset.
  2. The mechanism of missingness, encoding the probability of missing values.

By combining causal relationships between variables and the missingness mechanism, this project creates both complete datasets and incomplete datasets by introducing missing values. Then compute a block independent probabilistic database from the incomplete database using the probailities defined in the bayesian network.

Inputs

  1. Missingness Graph (Bayesian Network):

    • A Bayesian network (BN) is provided as input, which defines the dependencies between variables and the missingness mechanism using indicator variables.
    • Each indicator variable determines whether an attribute will have a missing value (NA) in the final dataset.
  2. Database Size:

    • User specifies the number of tuples in the generated synthetic dataset.
  3. Missingness Rate:

    • The probability of missingness is determined by the indicator variables in the missingness graph.

Data Generation Process

1. Complete Data Generation

  • Using pyAgrum, a complete synthetic dataset is generated based on the input BN, ensuring dependencies between attributes are respected according to the conditional/absolute probability distributions.

2. Handling Missing Data Using Indicators

  • Each partially observed attribute has a corresponding indicator variable that determines whether a missing value (NA) should be introduced in the dataset.
  • If the indicator variable equals 1, the target attribute is set to NA.

3. Probabilistic Database Generation

  • Using the incomplete database, and the conditional/absolute probabilities defined in the BN, we build the probabilistic database.

Outputs

  1. Complete Dataset: A synthetic dataset without missing values.
  2. Incomplete Dataset: The same dataset with missing values (NA) introduced according to the missingness graph.
  3. Probabilistic Dataset: .

How to Use

  • A variety of missingness graphs, representing the three missingness mechanisms, are available in the /quantitative_bayesian_network/generators directory and are ready to be used.
  • In the main script, you can specify the desired missingness graph, the size of the database, and the missingness rate.
  • The quality of the generated datasets can be evaluated using metrics such as KL divergence, Wasserstein distance, and Euclidean distance, comparing the joint probability distribution from the Bayesian network with the empirical distribution of the generated data (methods available in /src/data_evaluation).
  • The data can be stored either as a CSV file, in a PostgreSQL database, or both, depending on the specified parameter.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published