ML Allergen Classifier

Objective

Build a machine learning model to identify protein allergens from amino acid sequences. The purpose is to use annotated genomes to identify organisms likely to contain many allergens, helping individuals, especially the immunocompromised or immunodepressed, make informed decisions about what to eat.

Building the Allergens and Non-Allergens Sequence Datasets

Gather Allergen Amino Acid Sequences:
- Scrape NCBI for amino acid sequences of proteins with the terms "allergen" or "allergenic."
Remove Redundant and Ambiguous Sequences:
- Calculate min and max allergen length
- Use min and max allergen length to gather non-allergen amino acid sequences from NCBI's protein database
- Remove redundant and ambiguous sequences
Data Mining (Allergens):
- Scrape NCBI for amino acid sequences of proteins with the terms "allergen" or "allergenic."
Data Cleaning (Allergens):
- Filter out proteins that are not explicitly allergens based on their header
- Visually examine the sequence headers
- Keep proteins that contain "allergen" or "allergenic" in their header and remove those with the word "precursor."
- Remove identical proteins with cd-hit
- Filter out sequences that contain ambiguities (X, B, Z, J, or O) or Selenocysteine (U)
- Remove outliers by length
Data Mining (Non-Allergens):
- Scrape NCBI for amino acid sequences (lengths 11 to 460) of complete proteins from SwissProt without "allergen," "allergenic," "partial," or "precursor."
Data Cleaning (Non-Allergens):
- Filter out sequences that contain ambiguities (X, B, Z, J, or O) and Selenocysteine (U)

Feature Extraction

Sequence length
Physicochemical properties
Amino acid proportions

Exploratory Data Analysis

Sources of allergens
Check Occurrences of Each Category
Analysis by sequence length
Amino Acid Profile/Abundance of Each Dataset

Cleaning

Importance of Features
Remove Correlated Features
Plot PCA of the Final Variables

Model Building

Preprocessing
Model Training
Optimizing for AUC, F1, and Recall

Model Comparison

Conclusion

model_knn & tuned_AUC_model_xgboost perform the best
Since the objective is to identify allergens in genomes to make decisions on what not to eat, higher recall (tuned_AUC_model_xgboost) may be preferred over higher precision (model_knn). Also, tuned_AUC_model_xgboost has a slightly higher AUC, which is to take into account in unbalanced data

To Do

Test the best models on out-of-sample sequences.
Collect genomes from foods and use the models on them.
Come up with an "allergen index" for each genome.

Name		Name	Last commit message	Last commit date
Latest commit History 9 Commits
some_images		some_images
ML_Allergen_Classifier.ipynb		ML_Allergen_Classifier.ipynb
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

ML Allergen Classifier

Objective

Building the Allergens and Non-Allergens Sequence Datasets

Feature Extraction

Exploratory Data Analysis

Cleaning

Model Building

Model Comparison

Conclusion

To Do

About

Releases

Packages

Languages

manuelgug/ML_Allergen_Classifier

Folders and files

Latest commit

History

Repository files navigation

ML Allergen Classifier

Objective

Building the Allergens and Non-Allergens Sequence Datasets

Feature Extraction

Exploratory Data Analysis

Cleaning

Model Building

Model Comparison

Conclusion

To Do

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages