This project focuses on developing a machine learning pipeline to predict binding affinities between protein-ligand pairs, leveraging provided UniProt and PubChem IDs. The approach includes exploratory data analysis, feature extraction, model training, and evaluation. The goal is to accurately predict binding affinities on a held-out test set using open-source libraries, while ensuring modular, reproducible, and well-documented code. The scientific project description is present in Summary folder.
The problem aims to:
- Build a model to predict binding between protein/molecule pairs using UniProt IDs and PubChem IDs with confirmed binding affinity.
- Expand the dataset with synthetic negative examples to balance training.
- Extract additional features to enhance model performance.
- Use auxiliary datasets and cutting-edge ML techniques for optimal results.
- Perform exploratory data analysis (EDA) and clean the dataset.
- Generate synthetic negative examples by creating non-binding protein-ligand pairs.
- Extract features for proteins using UniProt UniRep and ligands using PubChem data.
- Generates low-dimensional representation of proteins using Protein-Bert model and Morgan Fingerprints for ligands.
- Saves the processed dataset for downstream modeling.
- EDA and Cleaning: Handle duplicates and inconsistencies in the dataset.
- Feature Extraction:
- Proteins: Extract embeddings using Protein-Bert model.
- Ligands: Generate molecular fingerprints using RDKit and PubChemPy.
- Output:
- A cleaned and feature-enriched dataset ready for model training.
- Libraries:
pandas
,numpy
,seaborn
,matplotlib
,RDKit
,PubChemPy
,transformers
.
- Train multiple models (Logistic Regression, Random Forest, XGBoost, and a Neural Network) on the prepared dataset.
- Evaluate model performance and identify the best-performing approach.
- Compare results as a function of dataset size and model complexity.
- Model Training:
- Train models using scikit-learn, XGBoost, and PyTorch on low-dimensional protein and Morgan Fingerprint representation.
- Evaluation:
- Use accuracy, precision, recall, and F1 score metrics.
- Visualize performance comparisons across models.
- Output:
- Trained models and performance metrics for each approach.
- Libraries:
pandas
,numpy
,seaborn
,matplotlib
,transformers
,torch
,sklearn
,xgboost
.
Ensure the following dependencies are installed. Every Notebook has its own requirements (requirements_data_prep.txt, requirements_model_build.txt):
pip install -r requirements.txt
- Open
ML_protein_binnding_data_prep.ipynb
. - Execute all cells sequentially to generate the processed dataset.
- Open
ML_protein_binnding_model_building.ipynb
. - Load the processed dataset from the previous step.
- Execute the notebook to train and evaluate models.
- Save your test set in the expected format.
- Load the trained model pipeline.
- Run the inference script to predict binding affinities.
- Feature Engineering: Automated feature extraction using state-of-the-art tools.
- Reproducibility: Clear separation of data preparation and modeling pipelines.
- Model Comparisons: Comprehensive performance analysis across multiple algorithms.
- Incorporate additional auxiliary datasets to enhance feature quality.
- Explore advanced architectures like transformer-based models for protein-ligand embeddings.
- Optimize the pipeline for large-scale datasets using distributed processing.