dptech-corp
diff --git a/‎README.md
+74-3 b/‎README.md
+74-3
diff --git a/‎scripts/finetune_pka.sh
-51 b/‎scripts/finetune_pka.sh
-51
diff --git a/‎scripts/infer_free_energy.sh
-27 b/‎scripts/infer_free_energy.sh
-27
diff --git a/‎scripts/infer_mean_ensemble.py
+86 b/‎scripts/infer_mean_ensemble.py
+86
@@ -75,6 +75,13 @@ It is a [Uni-Mol](https://github.com/dptech-corp/Uni-Mol)-based neural network.
 
 ### Usage
 
+#### Dependencies
+
+The dependencies of Uni-p*K*<sub>a</sub> are the same as those of Uni-Mol.
+
+ - [Uni-Core](https://github.com/dptech-corp/Uni-Core), check its [Installation Documentation](https://github.com/dptech-corp/Uni-Core#installation).
+ - rdkit==2022.9.3, install via `pip install rdkit-pypi==2022.9.3`
+
 The recommended environment is the docker image.
 
 ```
@@ -83,7 +90,71 @@ docker pull dptechnology/unimol:latest-pytorch1.11.0-cuda11.3
 
 See details in [Uni-Mol](https://github.com/dptech-corp/Uni-Mol/tree/main/unimol#dependencies) repository.
 
-After the full datasets had been downloaded, use `scripts/pretrain_pka_mlm_aml.sh` to pretrain the model, use `scripts/finetune_pka_aml.sh` to finetune the model, use `infer_test.sh` to test the trained model on a macro-p*K*<sub>a</sub> dataset, and use `infer_free_energy.sh` to infer the free energy of given structures for any p*K*<sub>a</sub>-related tasks.
 
-## Todo
-Ready-to-run training workflow
+### Ready-to-run training workflow
+
+#### Data
+
+The raw data can be downloaded from [[AISSquare](https://www.aissquare.com/datasets/detail?pageType=datasets&name=Uni-pKa-Dataset)].
+
+
+#### Pretrain with ChemBL
+
+First, preprocess the ChemBL training and validation sets, and then pretrain the model:
+
+```bash
+# Preprocess training set
+python ./scripts/preprocess_pka.py --raw-csv-file Datasets/tsv/chembl_train.tsv --processed-lmdb-dir chembl --task-name train
+
+# Preprocess validation set
+python ./scripts/preprocess_pka.py --raw-csv-file Datasets/tsv/chembl_valid.tsv --processed-lmdb-dir chembl --task-name valid
+
+# Copy the necessary dict file
+cp -r unimol/examples/* chembl
+
+# Pretrain the model
+bash pretrain_pka.sh
+```
+
+Note: The `head_name` in the subsequent scripts must match the `task_name` in `pretrain_pka.sh`.
+
+
+#### Finetune with dwar-iBond
+
+Next, preprocess the dwar-iBond dataset and finetune the model:
+
+```bash
+# Preprocess
+python ./scripts/preprocess_pka.py --raw-csv-file Datasets/tsv/dwar-iBond.tsv --processed-lmdb-dir dwar --task-name dwar-iBond
+
+# Copy the necessary dict file
+cp -r unimol/examples/* dwar
+
+# Finetune the model
+bash finetune_pka.sh
+```
+
+#### Infer p*K*<sub>a</sub>
+
+Infer with the finetuned model, taking novartis_acid as an example:
+
+```bash
+# Preprocess
+python ./scripts/preprocess_pka.py --raw-csv-file Datasets/tsv/novartis_acid.tsv --processed-lmdb-dir novartis_acid --task-name novartis_acid
+
+# Copy the necessary examples from unimol
+cp -r unimol/examples/* novartis_acid
+
+# Run inference
+bash infer_pka.sh
+```
+To test with other external test datasets, it may be necessary to modify `data_path`, `infer_task`, and `results_path` in `infer_pka.sh`.
+
+#### Obtain the result files and calculate the metrics
+After inference, extract the results to CSV files and calculate the performance metrics (e.g., MAE, RMSE) on the results:
+
+```bash
+python ./scripts/infer_mean_ensemble.py --task pka --nfolds 5 --results-path novartis_acid_results
+```
+
+The metrics are calculated using the average of the 5-fold model predictions.
@@ -0,0 +1,86 @@
+# Copyright (c) DP Technology.
+# This source code is licensed under the MIT license found in the
+# LICENSE file in the root directory of this source tree.
+
+import pandas as pd
+import os
+import argparse
+import numpy as np
+import glob
+
+
+def cal_metrics(df):
+    mae = np.abs(df["predict"] - df["target"]).mean()
+    mse = ((df["predict"] - df["target"]) ** 2).mean()
+    rmse = np.sqrt(mse)
+    return mae, rmse
+
+
+def get_csv_results(results_path, nfolds, task):
+
+    all_smi_list, all_predict_list, all_target_list = [], [], []
+
+    for fold_idx in range(nfolds):
+        print(f"Processing fold {fold_idx}...")
+        fold_path = os.path.join(results_path, f'fold_{fold_idx}')
+        pkl_files = glob.glob(f"{fold_path}/*.pkl")
+        fold_data = pd.read_pickle(pkl_files[0])
+
+        smi_list, predict_list, target_list = [], [], []
+        for batch in fold_data:
+            sz = batch["bsz"]
+            for i in range(sz):
+                smi_list.append(batch["smi_name"][i])
+                predict_list.append(batch["predict"][i].cpu().item())
+                target_list.append(batch["target"][i].cpu().item())
+        fold_df = pd.DataFrame({"smiles": smi_list, "predict": predict_list, "target": target_list})
+        fold_df.to_csv(f'{fold_path}/fold_{fold_idx}.csv',index=False, sep='\t')
+
+        # for final combined results
+        all_smi_list.extend(smi_list)
+        all_predict_list.extend(predict_list)
+        all_target_list.extend(target_list)
+    
+    print(f"Combining results from {nfolds} folds into a single file...")
+    combined_df = pd.DataFrame({"smiles": all_smi_list, "predict": all_predict_list, "target": all_target_list})
+    combined_df.to_csv(f'{results_path}/all_results.csv', index=False, sep='\t')
+    
+    print(f"Calculating mean results for each SMILES...")
+    mean_results = combined_df.groupby('smiles', as_index=False).agg({
+    'predict': 'mean', 
+    'target': 'mean'
+    })
+    mean_results.to_csv(f'{results_path}/mean_results.csv', index=False, sep='\t')
+    if task == 'pka':
+        print(f"MAE and RMSE for this task...")
+        mae, rmse = cal_metrics(mean_results)
+        print(f'MAE: {round(mae, 4)}, RMSE: {round(rmse, 4)}')
+    print(f"Done!")
+
+
+def main():
+    parser = argparse.ArgumentParser(description='Model infer result mean ensemble')
+    parser.add_argument(
+        '--results-path', 
+        type=str, 
+        default='results',
+        help='path to save infer results'
+    )
+    parser.add_argument(
+        "--nfolds",
+        default=5,
+        type=int,
+        help="cross validation split folds"
+    )
+    parser.add_argument(
+        "--task",
+        default='pka',
+        type=str,
+        choices=['pka', 'free_energy']
+    )
+    args = parser.parse_args()
+    get_csv_results(args.results_path, args.nfolds, args.task)
+
+
+if __name__ == "__main__":
+    main()