Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cleaned Enzeptional #232

Closed
wants to merge 12 commits into from
Closed

Cleaned Enzeptional #232

wants to merge 12 commits into from

Conversation

yvesnana
Copy link
Contributor

  • Cleaned Enzeptional
  • Refactored both Processing and Core files

@cla-bot cla-bot bot added the cla-signed CLA has been signed label Nov 22, 2023
@yvesnana yvesnana requested a review from drugilsberg November 22, 2023 16:01
Copy link
Contributor

@drugilsberg drugilsberg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We are getting there. Address all remaining comments please, it should be straightforward. The CI is failing on black, ensure the styling is applied before commits.

Comment on lines +1 to +24
<!--
MIT License

Copyright (c) 2023 GT4SD team

Permission is hereby granted, free of charge, to any person obtaining a copy
of this software and associated documentation files (the "Software"), to deal
in the Software without restriction, including without limitation the rights
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
copies of the Software, and to permit persons to whom the Software is
furnished to do so, subject to the following conditions:

The above copyright notice and this permission notice shall be included in all
copies or substantial portions of the Software.

THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
SOFTWARE.
-->

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we do not license README.md files

Suggested change
<!--
MIT License
Copyright (c) 2023 GT4SD team
Permission is hereby granted, free of charge, to any person obtaining a copy
of this software and associated documentation files (the "Software"), to deal
in the Software without restriction, including without limitation the rights
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
copies of the Software, and to permit persons to whom the Software is
furnished to do so, subject to the following conditions:
The above copyright notice and this permission notice shall be included in all
copies or substantial portions of the Software.
THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
SOFTWARE.
-->

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Move this README.md in a dedicated example folder where we show how use the framework. You can get inspiration from here: https://github.com/GT4SD/gt4sd-core/tree/main/examples/regression_transformer

Comment on lines +30 to +43
## Requirements
- Python 3.6 or higher
- PyTorch
- Hugging Face's Transformers
- TAPE (Tasks Assessing Protein Embeddings)
- NumPy
- Joblib
- Logging module
- xgboost (optional)

## Installation
Ensure all required libraries are installed. You can install them using pip:
```bash
pip install torch transformers numpy joblib xgboost
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not need this needs yo be covered by the toolkit installer.

Suggested change
## Requirements
- Python 3.6 or higher
- PyTorch
- Hugging Face's Transformers
- TAPE (Tasks Assessing Protein Embeddings)
- NumPy
- Joblib
- Logging module
- xgboost (optional)
## Installation
Ensure all required libraries are installed. You can install them using pip:
```bash
pip install torch transformers numpy joblib xgboost

Comment on lines +61 to +137
### Example Usage
```python
# Set Up Model Paths
language_model_path = "Rostlab/prot_bert"
tokenizer_path = "Rostlab/prot_bert"
unmasking_model_path = "Rostlab/prot_bert"
chem_model_path = "Rostlab/prot_bert"
chem_tokenizer_path = "Rostlab/prot_bert"

protein_model = HFandTAPEModelUtility(embedding_model_path=language_model_path,
tokenizer_path=tokenizer_path,)

# Mutation Configuration
mutation_config = {
"type": "language-modeling",
"embedding_model_path": language_model_path,
"tokenizer_path": tokenizer_path,
"unmasking_model_path": unmasking_model_path
}

# Define Parameters
intervals = [[5, 10], [20, 25]]
batch_size = 5
top_k = 3
substrate_smiles = "CCCO" # Replace with actual substrate SMILES
product_smiles = "CCCO" # Replace with actual product SMILES

# Initialize Sequence Mutator
sample_sequence = "WLSNIDMILRSPYSHTGAVLIYKQPDNNEDNIHPSSSMYFDANILIEALSKALVP"
mutator = SequenceMutator(sequence=sample_sequence, mutation_config=mutation_config)

# Initialize Protein Sequence Optimizer
optimizer = ProteinSequenceOptimizer(
sequence=sample_sequence,
protein_model=protein_model,
substrate_smiles=substrate_smiles,
product_smiles=product_smiles,
chem_model_path=chem_model_path,
chem_tokenizer_path=chem_tokenizer_path,
mutator=mutator,
intervals=intervals,
batch_size=batch_size,
top_k=top_k,
selection_ratio=0.5,
perform_crossover=True,
crossover_type="single_point",
concat_order=["substrate", "sequence", "product"]
)

# Run optimization
optimized_sequences, iteration_info = optimizer.optimize(
num_iterations=5,
num_sequences=50,
num_mutations=5,
time_budget=3600
)

# Output results
for i in optimized_sequences:
seq = i["sequence"]
score = i["score"]
print(f"Sequence: {seq}, Score: {score}")

print(iteration_info)
```

## Customization
- Modify `intervals` to specify mutation regions in the sequence.
- Adjust `batch_size`, `top_k`, `selection_ratio`, and `crossover_type` for different optimization strategies.
- Change `concat_order` to alter the order of sequence, substrate, and product in the final embedding for scoring.
- Use `time_budget` to set a maximum time limit for each optimization iteration.

## Notes
- Ensure the paths to the models and tokenizers are correctly set.
- The script is designed for flexibility and can be adapted to different models and optimization strategies.
- For extensive usage, consider parallelizing or distributing the computation, especially for large-scale optimizations.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

consider moving the python snippet into a dedicated example after moving this.

Comment on lines +135 to +137
sequence (str): The original sequence to be mutated.
num_mutations (int): The number of mutations to introduce.
intervals (List[List[int]]): Intervals within the sequence
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do not type arguments in docstrings, only return types, follow the library standard

Comment on lines +88 to +102
def get_device(device: Optional[Union[torch.device, str]] = None) -> torch.device:
"""
Determines the appropriate torch device for computations.

Args:
device (Optional[Union[torch.device, str]]): The desired device
'cpu' or 'cuda:0'). If None,
automatically selects the device.

Returns:
torch.device: The determined torch device for computations.
"""
return torch.device(
"cuda:0" if torch.cuda.is_available() and device != "cpu" else "cpu"
)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@@ -84,6 +84,9 @@ gt4sd =
training_pipelines/tests/*json
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

need to add all enzeptional dependencies in install_requires with no version and update with the versions the requirements files to ensure the installation of the needed packages.

@yvesnana yvesnana closed this Mar 4, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cla-signed CLA has been signed
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants