Skip to content

This demo code is for our manuscript "Machine Learning for Identifying Data-Driven Subphenotypes of Incident Post-Acute SARS-CoV-2 Infection Conditions with Large Scale Electronic Health Records: Findings from the RECOVER"

License

Notifications You must be signed in to change notification settings

HaoZhangXidian/Subphenotyping-for-PASC

Repository files navigation

Subphenotyping-for-PASC

This demo code is for our manuscript "Machine Learning for Identifying Data-Driven Subphenotypes of Incident Post-Acute SARS-CoV-2 Infection Conditions with Large Scale Electronic Health Records: Findings from the RECOVER"

If there is any questions, please contact: [email protected]

Step 1 Data preparation

Since the privacy of dataset we used in the study, in this demo code, we just provide simulated dataset. For the real dataset, The INSIGHT data can be requested through https://insightcrn.org/. The OneFlorida+ data can be requested through https://onefloridaconsortium.org. Both the INSIGHT and the OneFlorida+ data are HIPAA-limited. Therefore, data use agreements must be established with the INSIGHT and OneFlorida+ networks.

After data preparation from the raw EHR tabel, we can obtain a data matrix with size N*137. N is the number of patients in the corhot, and 137 denotes 137 PASC. This is a binary matrix, where the element in i-th raw, j-th column denotes whether i-th patient has j-th pasc in the post-acute SARS-CoV-2 infection period of COVID-19 We put the simulated dataset in "https://drive.google.com/file/d/1ZN_hIiDfazCHOGl1GHQMNBXEkwk5a_TH/view?usp=sharing". After downloading, please put it in the the folder: "./dataset/"

Step 2 Train the Topic model

We prepared both Python and Matlab code for training the topic model. The python code is developed based on the Pydpm package (https://github.com/BoChenGroup/pydpm)

For Python, please run "./Python code for training topic modeling/Main_train_topic_model.py"

For Matlab, please run "./Matlab code for training topic modeling/main_PFA.m"

The parameter K in both codes are the nubmer of topics. In our study, we set it as 10.

After training, the well-trained model (Topics and topic proportions) are saved in "./trained_topic_model/".

Step 3 Visualize the topic

To reproduce the figure 2 in our manuscipt, please refer the code "./Python code for analysis/visualize_topic.py"

Step 4 Perform hierarhical clustering

To identify the subphenotypes based on topic proportions, please run the code "./Python code for analysis/perform_clustering.py". It will also reproduce the Supplemental figure 5 (UMAP and dendrogram).

To reproduce the Figure 3, please run "./Python code for analysis/show_circle_plot_PASC.py"

To reproduce the Figure 4, please run ""./Python code for analysis/show_circle_plot_med.py"

About

This demo code is for our manuscript "Machine Learning for Identifying Data-Driven Subphenotypes of Incident Post-Acute SARS-CoV-2 Infection Conditions with Large Scale Electronic Health Records: Findings from the RECOVER"

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published