Build a machine learning model to identify protein allergens from amino acid sequences. The purpose is to use annotated genomes to identify organisms likely to contain many allergens, helping individuals, especially the immunocompromised or immunodepressed, make informed decisions about what to eat.
-
Gather Allergen Amino Acid Sequences:
- Scrape NCBI for amino acid sequences of proteins with the terms "allergen" or "allergenic."
-
Remove Redundant and Ambiguous Sequences:
- Calculate min and max allergen length
- Use min and max allergen length to gather non-allergen amino acid sequences from NCBI's protein database
- Remove redundant and ambiguous sequences
-
Data Mining (Allergens):
- Scrape NCBI for amino acid sequences of proteins with the terms "allergen" or "allergenic."
-
Data Cleaning (Allergens):
- Filter out proteins that are not explicitly allergens based on their header
- Visually examine the sequence headers
- Keep proteins that contain "allergen" or "allergenic" in their header and remove those with the word "precursor."
- Remove identical proteins with cd-hit
- Filter out sequences that contain ambiguities (X, B, Z, J, or O) or Selenocysteine (U)
- Remove outliers by length
-
Data Mining (Non-Allergens):
- Scrape NCBI for amino acid sequences (lengths 11 to 460) of complete proteins from SwissProt without "allergen," "allergenic," "partial," or "precursor."
-
Data Cleaning (Non-Allergens):
- Filter out sequences that contain ambiguities (X, B, Z, J, or O) and Selenocysteine (U)
- Sequence length
- Physicochemical properties
- Amino acid proportions
- Sources of allergens
- Check Occurrences of Each Category
- Analysis by sequence length
- Amino Acid Profile/Abundance of Each Dataset
- Preprocessing
- Model Training
- Optimizing for AUC, F1, and Recall
- model_knn & tuned_AUC_model_xgboost perform the best
- Since the objective is to identify allergens in genomes to make decisions on what not to eat, higher recall (tuned_AUC_model_xgboost) may be preferred over higher precision (model_knn). Also, tuned_AUC_model_xgboost has a slightly higher AUC, which is to take into account in unbalanced data
- Test the best models on out-of-sample sequences.
- Collect genomes from foods and use the models on them.
- Come up with an "allergen index" for each genome.