SynthEHRella

A Python package for synthetic Electronic Health Records (EHR) data generation benchmarking.

Cite

	Citation
systematic review paper	Chen X, Wu Z, Shi X, Cho H, Mukherjee B $^\dagger$ (2024+). Generating Synthetic Electronic Health Record (EHR) Data: A Systematic Review and Benchmarking. [: correspdonding author ([email protected]); $^\dagger$: senior author ([email protected])].

Package Overview

synthEHRella/
│
├── synthEHRella/
│   ├── __init__.py              
│   ├── data/
|   |   ├── methods/
│   |   |   ├── cor-gan
│   |   |   ├── plasmode
│   |   |   ├── synthea
│   |   |   └── ...
│   │   ├── __init__.py
│   │   ├── data_generator.py
│   │   ├── utils.py    
│   ├── evaluation/
│   │   ├── __init__.py          
│   │   ├── fidelity.py
│   │   ├── utility.py
│   │   └── privacy.py
│   ├── utils/
│   │   ├── __init__.py
│   │   ├── data_transform.py
│   │   └── ... (mapping files)
│   ├── run_generation.py           
│   ├── run_evaluation.py
│   ├── run_postprocessing.py
│   └── run_preprocessing.py
├── setup.py
├── LICENSE
├── README.md
└── environmental.yaml

Preliminary

You should have the following files before installing the package.

Obtain MIMIC-III/IV Data

MIMIC-III and MIMIC-IV data are open source and available through PhysioNet. Request access here:

MIMIC-III v1.4

MIMIC-IV v3.1

After obtaining the data, you can preprocess it for use with SynthEHRella. Note that in the paper, we used MIMIC-III v1.4 and MIMIC-IV v2.2 for the evaluation.

(Optional) Obtain Synthea Software

NOTE: Synthea software is optional and only required if you want to generate synthetic data using Synthea.

Synthea is an open-source synthetic EHR data generation method that generates synthetic data by simulating the history of synthetic patients. You can download the software from their Github Repository. The file you need to download is called synthea-with-dependencies.jar. After obtaining the software, you could place it in the synthEHRella/synthEHRella/data/methods/synthea folder.

Installation

Before installing the package, make sure you have the following dependencies installed:

conda env create -f environment.yaml

To install the synthEHRella package, run the following command:

pip install .

Usage

STEP 1: Preprocessing Real EHR Data

To preprocess real EHR data, run:

python -m synthEHRella.run_preprocessing <admissionFile> <diagnosisFile> <patientsFile> <outPath> binary <dataset>

Arguments:

<admissionFile>, <diagnosisFile>, <patientsFile>: Paths to the admission, diagnosis, and patients files in MIMIC-III/IV.
<outPath>: Path to save the preprocessed data.
<dataset>: Name of the dataset (i.e., mimic3, mimic4).

The preprocessing will output multiple files:

For mimic3:

processed_mimic3.matrix: Matrix-format EHR data coded by ICD-9.
processed_mimic3.pid: Patient IDs (most of the time we don't need this).
processed_mimic3.types: Disease types for each dimension.
mimic3-real-phecodex.npy: EHR data in PhecodeX coding (used for visualization in the paper).
mimic3-real-phecodexm.npy: EHR data in PhecodeX coding with higher hierachy (for quantitative evaluation in the paper).

For mimic4:

mimic4-real-phecodex.npy: EHR data in PhecodeX coding (used for visualization in the paper).
mimic4-real-phecodexm.npy: EHR data in PhecodeX coding with higher hierachy (for quantitative evaluation in the paper).

The processed MIMIC-III dataset will be used as the training data for all the methods except ehrdiff. For ehrdiff, the training data provided by their authors are used to train the model to guarantee only minor change on their codebase. The data can be downloaded from their Github Repository.

STEP 2: Generating Synthetic EHR Data

To generate synthetic EHR data, run:

python -m synthEHRella.run_generation <method> --params data.real_training_data_path=<path_to_real_data> generation.ckpt_dir=<path_to_save_ckpt> generation.params.num_gen_samples=<num_gen_samples>

Arguments:

<method>: The method for generating synthetic data.
<path_to_real_data>: Path to preprocessed real EHR data.
<path_to_save_ckpt>: Directory to save the generated files (e.g., model checkpoints, synthetic data).
<num_gen_samples>: Number of synthetic samples to generate.

Currently, the following methods are supported:

corgan: CorGAN
plasmode: Plasmode
synthea: Synthea
ehrdiff: EHRDiff
medgan: MedGAN
vae: VAE
promptehr: PromptEHR
resample: Resample (baseline methods)
prevalence-based-random: Prevalence-based Random (baseline methods)

STEP 3: Post-Processing

To transform synthetic data from ICD-9 / SNOMED-CT to PhecodeX, run:

python -m synthEHRella.run_postprocessing <method> --synthetic_data_path <path_to_synthetic_data> --output_path <path_to_save_results>

Arguments:

<method>: Method used to generate synthetic data.
<path_to_synthetic_data>: Path to the synthetic EHR data.
<path_to_save_results>: Path to save the post-processed results.

Instruction on finding synthetic data path for each method:

corgan, medgan, and vae: the synthetic data are contained in a .npy file under the {ckpt_dir} set in generation called synthetic-{gen_sample_size}.npy. gen_sample_size is the number of samples generated.
ehrdiff: the synthetic data are contained in a file called 'all_x.npy' under the '{ckpt_dir}/samples' folder.
plasmode: the synthetic data are all the files under the {ckpt_dir} folder.
synthea: the synthetic data are contained in a file called conditions.csv under the {ckpt_dir}/csv folder.
promptehr: the synthetic data are contained in a file called promptehr-synthetic.npy under the {ckpt_dir} folder.
resample and prevalence-based-random: the synthetic data are contained in a file called {resample/pbr}-synthetic.npy under the {ckpt_dir} folder.

STEP 4: Evaluation

To evaluate the fidelity, utility, and privacy of synthetic EHR data, run:

python -m SynthEHRella.run_evaluation <method> --synthetic_data_path <path_to_synthetic_data> --real_eval_data_path <path_to_real_eval_data> --output_dir <path_to_save_results>

Arguments:

<method>: Method used to generate synthetic data.
<path_to_synthetic_data>: Path to synthetic EHR data.
<path_to_real_eval_data>: Path to real evaluation data (should be .npy with PhecodeX coding, such as mimic3-real-phecodexm.npy or mimic4-real-phecodexm.npy).
<path_to_save_results>: Path to save evaluation results.

Contributing

SynthEHRella is designed to be an open-source benchmark for synthetic EHR generation methods. We welcome all kinds of suggestions, and contributions! Please feel free to open an issue or submit a pull request.

License

SynthEHRella is licensed under the MIT License. See LICENSE for more information.

Acknowledgments

This work was supported through grant DMS1712933 from the National Science Foundation and MI-CARES grant 1UG3CA267907 from the National Cancer Institute. The funders had no role in the design of the study; collection, analysis, or interpretation of the data; writing of the report; or the decision to submit the manuscript for publication.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

SynthEHRella

Cite

Package Overview

Preliminary

Obtain MIMIC-III/IV Data

(Optional) Obtain Synthea Software

Installation

Usage

STEP 1: Preprocessing Real EHR Data

STEP 2: Generating Synthetic EHR Data

STEP 3: Post-Processing

STEP 4: Evaluation

Contributing

License

Acknowledgments

About

Releases

Packages

Contributors 3

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 26 Commits
example_commands		example_commands
synthEHRella		synthEHRella
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
environment.yaml		environment.yaml
setup.py		setup.py

License

chenxran/synthEHRella

Folders and files

Latest commit

History

Repository files navigation

SynthEHRella

Cite

Package Overview

Preliminary

Obtain MIMIC-III/IV Data

(Optional) Obtain Synthea Software

Installation

Usage

STEP 1: Preprocessing Real EHR Data

STEP 2: Generating Synthetic EHR Data

STEP 3: Post-Processing

STEP 4: Evaluation

Contributing

License

Acknowledgments

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Contributors 3

Languages

Packages