-
Notifications
You must be signed in to change notification settings - Fork 74
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Cleaned Enzeptional #232
Cleaned Enzeptional #232
Conversation
yvesnana
commented
Nov 22, 2023
- Cleaned Enzeptional
- Refactored both Processing and Core files
Signed-off-by: nanayves <[email protected]>
Signed-off-by: nanayves <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We are getting there. Address all remaining comments please, it should be straightforward. The CI is failing on black, ensure the styling is applied before commits.
<!-- | ||
MIT License | ||
|
||
Copyright (c) 2023 GT4SD team | ||
|
||
Permission is hereby granted, free of charge, to any person obtaining a copy | ||
of this software and associated documentation files (the "Software"), to deal | ||
in the Software without restriction, including without limitation the rights | ||
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell | ||
copies of the Software, and to permit persons to whom the Software is | ||
furnished to do so, subject to the following conditions: | ||
|
||
The above copyright notice and this permission notice shall be included in all | ||
copies or substantial portions of the Software. | ||
|
||
THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR | ||
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, | ||
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE | ||
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER | ||
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, | ||
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE | ||
SOFTWARE. | ||
--> | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we do not license README.md files
<!-- | |
MIT License | |
Copyright (c) 2023 GT4SD team | |
Permission is hereby granted, free of charge, to any person obtaining a copy | |
of this software and associated documentation files (the "Software"), to deal | |
in the Software without restriction, including without limitation the rights | |
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell | |
copies of the Software, and to permit persons to whom the Software is | |
furnished to do so, subject to the following conditions: | |
The above copyright notice and this permission notice shall be included in all | |
copies or substantial portions of the Software. | |
THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR | |
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, | |
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE | |
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER | |
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, | |
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE | |
SOFTWARE. | |
--> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Move this README.md in a dedicated example folder where we show how use the framework. You can get inspiration from here: https://github.com/GT4SD/gt4sd-core/tree/main/examples/regression_transformer
## Requirements | ||
- Python 3.6 or higher | ||
- PyTorch | ||
- Hugging Face's Transformers | ||
- TAPE (Tasks Assessing Protein Embeddings) | ||
- NumPy | ||
- Joblib | ||
- Logging module | ||
- xgboost (optional) | ||
|
||
## Installation | ||
Ensure all required libraries are installed. You can install them using pip: | ||
```bash | ||
pip install torch transformers numpy joblib xgboost |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not need this needs yo be covered by the toolkit installer.
## Requirements | |
- Python 3.6 or higher | |
- PyTorch | |
- Hugging Face's Transformers | |
- TAPE (Tasks Assessing Protein Embeddings) | |
- NumPy | |
- Joblib | |
- Logging module | |
- xgboost (optional) | |
## Installation | |
Ensure all required libraries are installed. You can install them using pip: | |
```bash | |
pip install torch transformers numpy joblib xgboost |
### Example Usage | ||
```python | ||
# Set Up Model Paths | ||
language_model_path = "Rostlab/prot_bert" | ||
tokenizer_path = "Rostlab/prot_bert" | ||
unmasking_model_path = "Rostlab/prot_bert" | ||
chem_model_path = "Rostlab/prot_bert" | ||
chem_tokenizer_path = "Rostlab/prot_bert" | ||
|
||
protein_model = HFandTAPEModelUtility(embedding_model_path=language_model_path, | ||
tokenizer_path=tokenizer_path,) | ||
|
||
# Mutation Configuration | ||
mutation_config = { | ||
"type": "language-modeling", | ||
"embedding_model_path": language_model_path, | ||
"tokenizer_path": tokenizer_path, | ||
"unmasking_model_path": unmasking_model_path | ||
} | ||
|
||
# Define Parameters | ||
intervals = [[5, 10], [20, 25]] | ||
batch_size = 5 | ||
top_k = 3 | ||
substrate_smiles = "CCCO" # Replace with actual substrate SMILES | ||
product_smiles = "CCCO" # Replace with actual product SMILES | ||
|
||
# Initialize Sequence Mutator | ||
sample_sequence = "WLSNIDMILRSPYSHTGAVLIYKQPDNNEDNIHPSSSMYFDANILIEALSKALVP" | ||
mutator = SequenceMutator(sequence=sample_sequence, mutation_config=mutation_config) | ||
|
||
# Initialize Protein Sequence Optimizer | ||
optimizer = ProteinSequenceOptimizer( | ||
sequence=sample_sequence, | ||
protein_model=protein_model, | ||
substrate_smiles=substrate_smiles, | ||
product_smiles=product_smiles, | ||
chem_model_path=chem_model_path, | ||
chem_tokenizer_path=chem_tokenizer_path, | ||
mutator=mutator, | ||
intervals=intervals, | ||
batch_size=batch_size, | ||
top_k=top_k, | ||
selection_ratio=0.5, | ||
perform_crossover=True, | ||
crossover_type="single_point", | ||
concat_order=["substrate", "sequence", "product"] | ||
) | ||
|
||
# Run optimization | ||
optimized_sequences, iteration_info = optimizer.optimize( | ||
num_iterations=5, | ||
num_sequences=50, | ||
num_mutations=5, | ||
time_budget=3600 | ||
) | ||
|
||
# Output results | ||
for i in optimized_sequences: | ||
seq = i["sequence"] | ||
score = i["score"] | ||
print(f"Sequence: {seq}, Score: {score}") | ||
|
||
print(iteration_info) | ||
``` | ||
|
||
## Customization | ||
- Modify `intervals` to specify mutation regions in the sequence. | ||
- Adjust `batch_size`, `top_k`, `selection_ratio`, and `crossover_type` for different optimization strategies. | ||
- Change `concat_order` to alter the order of sequence, substrate, and product in the final embedding for scoring. | ||
- Use `time_budget` to set a maximum time limit for each optimization iteration. | ||
|
||
## Notes | ||
- Ensure the paths to the models and tokenizers are correctly set. | ||
- The script is designed for flexibility and can be adapted to different models and optimization strategies. | ||
- For extensive usage, consider parallelizing or distributing the computation, especially for large-scale optimizations. | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
consider moving the python snippet into a dedicated example after moving this.
sequence (str): The original sequence to be mutated. | ||
num_mutations (int): The number of mutations to introduce. | ||
intervals (List[List[int]]): Intervals within the sequence |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
do not type arguments in docstrings, only return types, follow the library standard
def get_device(device: Optional[Union[torch.device, str]] = None) -> torch.device: | ||
""" | ||
Determines the appropriate torch device for computations. | ||
|
||
Args: | ||
device (Optional[Union[torch.device, str]]): The desired device | ||
'cpu' or 'cuda:0'). If None, | ||
automatically selects the device. | ||
|
||
Returns: | ||
torch.device: The determined torch device for computations. | ||
""" | ||
return torch.device( | ||
"cuda:0" if torch.cuda.is_available() and device != "cpu" else "cpu" | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
use the methods we already have in the library, specifically this: https://github.com/GT4SD/gt4sd-core/blob/94f009132968ac4558e8256a4afcbf740065c5cd/src/gt4sd/frameworks/torch/__init__.py#L65C58-L65C59
@@ -84,6 +84,9 @@ gt4sd = | |||
training_pipelines/tests/*json |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
need to add all enzeptional dependencies in install_requires with no version and update with the versions the requirements files to ensure the installation of the needed packages.