Protein-Ligand Binding Prediction

Project Description

This project focuses on developing a machine learning pipeline to predict binding affinities between protein-ligand pairs, leveraging provided UniProt and PubChem IDs. The approach includes exploratory data analysis, feature extraction, model training, and evaluation. The goal is to accurately predict binding affinities on a held-out test set using open-source libraries, while ensuring modular, reproducible, and well-documented code. The scientific project description is present in Summary folder.

Problem Statement

The problem aims to:

Build a model to predict binding between protein/molecule pairs using UniProt IDs and PubChem IDs with confirmed binding affinity.
Expand the dataset with synthetic negative examples to balance training.
Extract additional features to enhance model performance.
Use auxiliary datasets and cutting-edge ML techniques for optimal results.

Workflow Overview

Notebook 1: Data Preparation (`ML_protein_binnding_data_prep.ipynb`)

Objectives

Perform exploratory data analysis (EDA) and clean the dataset.
Generate synthetic negative examples by creating non-binding protein-ligand pairs.
Extract features for proteins using UniProt UniRep and ligands using PubChem data.
Generates low-dimensional representation of proteins using Protein-Bert model and Morgan Fingerprints for ligands.
Saves the processed dataset for downstream modeling.

Key Steps

EDA and Cleaning: Handle duplicates and inconsistencies in the dataset.
Feature Extraction:
- Proteins: Extract embeddings using Protein-Bert model.
- Ligands: Generate molecular fingerprints using RDKit and PubChemPy.
Output:
- A cleaned and feature-enriched dataset ready for model training.

Dependencies

Libraries: pandas, numpy, seaborn, matplotlib, RDKit, PubChemPy, transformers.

Notebook 2: Model Training and Evaluation (`ML_protein_binnding_model_building.ipynb`)

Objectives

Train multiple models (Logistic Regression, Random Forest, XGBoost, and a Neural Network) on the prepared dataset.
Evaluate model performance and identify the best-performing approach.
Compare results as a function of dataset size and model complexity.

Key Steps

Model Training:
- Train models using scikit-learn, XGBoost, and PyTorch on low-dimensional protein and Morgan Fingerprint representation.
Evaluation:
- Use accuracy, precision, recall, and F1 score metrics.
- Visualize performance comparisons across models.
Output:
- Trained models and performance metrics for each approach.

Dependencies

Libraries: pandas, numpy, seaborn, matplotlib, transformers, torch, sklearn, xgboost.

How to Use

1. Setup

Ensure the following dependencies are installed. Every Notebook has its own requirements (requirements_data_prep.txt, requirements_model_build.txt):

pip install -r requirements.txt

2. Run Data Preparation

Open ML_protein_binnding_data_prep.ipynb.
Execute all cells sequentially to generate the processed dataset.

3. Train Models

Open ML_protein_binnding_model_building.ipynb.
Load the processed dataset from the previous step.
Execute the notebook to train and evaluate models.

4. Inference on Test Set

Save your test set in the expected format.
Load the trained model pipeline.
Run the inference script to predict binding affinities.

Project Highlights

Feature Engineering: Automated feature extraction using state-of-the-art tools.
Reproducibility: Clear separation of data preparation and modeling pipelines.
Model Comparisons: Comprehensive performance analysis across multiple algorithms.

Future Directions

Incorporate additional auxiliary datasets to enhance feature quality.
Explore advanced architectures like transformer-based models for protein-ligand embeddings.
Optimize the pipeline for large-scale datasets using distributed processing.

Name		Name	Last commit message	Last commit date
Latest commit History 10 Commits
Data		Data
Notebooks		Notebooks
Presentation		Presentation
Summary		Summary
images		images
ML_protein_binnding_data_prep.ipynb		ML_protein_binnding_data_prep.ipynb
ML_protein_binnding_model_building.ipynb		ML_protein_binnding_model_building.ipynb
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Protein-Ligand Binding Prediction

Project Description

Problem Statement

Workflow Overview

Notebook 1: Data Preparation (`ML_protein_binnding_data_prep.ipynb`)

Objectives

Key Steps

Dependencies

Notebook 2: Model Training and Evaluation (`ML_protein_binnding_model_building.ipynb`)

Objectives

Key Steps

Dependencies

How to Use

1. Setup

2. Run Data Preparation

3. Train Models

4. Inference on Test Set

Project Highlights

Future Directions

About

Releases

Packages

Languages

abuchin/ML_protein_ligand_prediction

Folders and files

Latest commit

History

Repository files navigation

Protein-Ligand Binding Prediction

Project Description

Problem Statement

Workflow Overview

Notebook 1: Data Preparation (ML_protein_binnding_data_prep.ipynb)

Objectives

Key Steps

Dependencies

Notebook 2: Model Training and Evaluation (ML_protein_binnding_model_building.ipynb)

Objectives

Key Steps

Dependencies

How to Use

1. Setup

2. Run Data Preparation

3. Train Models

4. Inference on Test Set

Project Highlights

Future Directions

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Notebook 1: Data Preparation (`ML_protein_binnding_data_prep.ipynb`)

Notebook 2: Model Training and Evaluation (`ML_protein_binnding_model_building.ipynb`)

Packages