A Python package for synthetic Electronic Health Records (EHR) data generation benchmarking.
Citation | |
---|---|
systematic review paper | Chen X, Wu Z*, Shi X, Cho H, Mukherjee B |
synthEHRella/
│
├── synthEHRella/
│ ├── __init__.py
│ ├── data/
| | ├── methods/
│ | | ├── cor-gan
│ | | ├── plasmode
│ | | ├── synthea
│ | | └── ...
│ │ ├── __init__.py
│ │ ├── data_generator.py
│ │ ├── utils.py
│ ├── evaluation/
│ │ ├── __init__.py
│ │ ├── fidelity.py
│ │ ├── utility.py
│ │ └── privacy.py
│ ├── utils/
│ │ ├── __init__.py
│ │ ├── data_transform.py
│ │ └── ... (mapping files)
│ ├── run_generation.py
│ ├── run_evaluation.py
│ ├── run_postprocessing.py
│ └── run_preprocessing.py
├── setup.py
├── LICENSE
├── README.md
└── environmental.yaml
You should have the following files before installing the package.
MIMIC-III and MIMIC-IV data are open source and available through PhysioNet. Request access here:
After obtaining the data, you can preprocess it for use with SynthEHRella. Note that in the paper, we used MIMIC-III v1.4 and MIMIC-IV v2.2 for the evaluation.
NOTE: Synthea software is optional and only required if you want to generate synthetic data using Synthea.
Synthea is an open-source synthetic EHR data generation method that generates synthetic data by simulating the history of synthetic patients. You can download the software from their Github Repository. The file you need to download is called synthea-with-dependencies.jar
. After obtaining the software, you could place it in the synthEHRella/synthEHRella/data/methods/synthea
folder.
Before installing the package, make sure you have the following dependencies installed:
conda env create -f environment.yaml
To install the synthEHRella
package, run the following command:
pip install .
To preprocess real EHR data, run:
python -m synthEHRella.run_preprocessing <admissionFile> <diagnosisFile> <patientsFile> <outPath> binary <dataset>
Arguments:
-
<admissionFile>
,<diagnosisFile>
,<patientsFile>
: Paths to the admission, diagnosis, and patients files in MIMIC-III/IV. -
<outPath>
: Path to save the preprocessed data. -
<dataset>
: Name of the dataset (i.e.,mimic3
,mimic4
).
The preprocessing will output multiple files:
For mimic3
:
processed_mimic3.matrix
: Matrix-format EHR data coded by ICD-9.processed_mimic3.pid
: Patient IDs (most of the time we don't need this).processed_mimic3.types
: Disease types for each dimension.mimic3-real-phecodex.npy
: EHR data in PhecodeX coding (used for visualization in the paper).mimic3-real-phecodexm.npy
: EHR data in PhecodeX coding with higher hierachy (for quantitative evaluation in the paper).
For mimic4
:
mimic4-real-phecodex.npy
: EHR data in PhecodeX coding (used for visualization in the paper).mimic4-real-phecodexm.npy
: EHR data in PhecodeX coding with higher hierachy (for quantitative evaluation in the paper).
The processed MIMIC-III dataset will be used as the training data for all the methods except ehrdiff
. For ehrdiff
, the training data provided by their authors are used to train the model to guarantee only minor change on their codebase. The data can be downloaded from their Github Repository.
To generate synthetic EHR data, run:
python -m synthEHRella.run_generation <method> --params data.real_training_data_path=<path_to_real_data> generation.ckpt_dir=<path_to_save_ckpt> generation.params.num_gen_samples=<num_gen_samples>
Arguments:
-
<method>
: The method for generating synthetic data. -
<path_to_real_data>
: Path to preprocessed real EHR data. -
<path_to_save_ckpt>
: Directory to save the generated files (e.g., model checkpoints, synthetic data). -
<num_gen_samples>
: Number of synthetic samples to generate.
Currently, the following methods are supported:
-
corgan
: CorGAN -
plasmode
: Plasmode -
synthea
: Synthea -
ehrdiff
: EHRDiff -
medgan
: MedGAN -
vae
: VAE -
promptehr
: PromptEHR -
resample
: Resample (baseline methods) -
prevalence-based-random
: Prevalence-based Random (baseline methods)
To transform synthetic data from ICD-9 / SNOMED-CT to PhecodeX, run:
python -m synthEHRella.run_postprocessing <method> --synthetic_data_path <path_to_synthetic_data> --output_path <path_to_save_results>
Arguments:
-
<method>
: Method used to generate synthetic data. -
<path_to_synthetic_data>
: Path to the synthetic EHR data. -
<path_to_save_results>
: Path to save the post-processed results.
Instruction on finding synthetic data path for each method:
-
corgan
,medgan
, andvae
: the synthetic data are contained in a.npy
file under the{ckpt_dir}
set in generation calledsynthetic-{gen_sample_size}.npy
.gen_sample_size
is the number of samples generated. -
ehrdiff
: the synthetic data are contained in a file called 'all_x.npy' under the '{ckpt_dir}/samples' folder. -
plasmode
: the synthetic data are all the files under the{ckpt_dir}
folder. -
synthea
: the synthetic data are contained in a file calledconditions.csv
under the{ckpt_dir}/csv
folder. -
promptehr
: the synthetic data are contained in a file calledpromptehr-synthetic.npy
under the{ckpt_dir}
folder. -
resample
andprevalence-based-random
: the synthetic data are contained in a file called{resample/pbr}-synthetic.npy
under the{ckpt_dir}
folder.
To evaluate the fidelity, utility, and privacy of synthetic EHR data, run:
python -m SynthEHRella.run_evaluation <method> --synthetic_data_path <path_to_synthetic_data> --real_eval_data_path <path_to_real_eval_data> --output_dir <path_to_save_results>
Arguments:
-
<method>
: Method used to generate synthetic data. -
<path_to_synthetic_data>
: Path to synthetic EHR data. -
<path_to_real_eval_data>
: Path to real evaluation data (should be .npy with PhecodeX coding, such as mimic3-real-phecodexm.npy or mimic4-real-phecodexm.npy). -
<path_to_save_results>
: Path to save evaluation results.
SynthEHRella is designed to be an open-source benchmark for synthetic EHR generation methods. We welcome all kinds of suggestions, and contributions! Please feel free to open an issue or submit a pull request.
SynthEHRella is licensed under the MIT License. See LICENSE for more information.
This work was supported through grant DMS1712933 from the National Science Foundation and MI-CARES grant 1UG3CA267907 from the National Cancer Institute. The funders had no role in the design of the study; collection, analysis, or interpretation of the data; writing of the report; or the decision to submit the manuscript for publication.