-
Notifications
You must be signed in to change notification settings - Fork 74
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Embeddings generation #254
Comments
Hi @AsyaOrlova thanks for the interest, that is an interesting one. Per default this is not supported, however here is a recipe. from gt4sd.algorithms.conditional_generation.regression_transformer import RegressionTransformer, RegressionTransformerMolecules
from selfies import encoder # This RT works with selfies so we need to convert
# Define your target molecule
smi = "CC(C#C)N(C)C(=O)NC1=CC=C(Cl)C=C1"
target = f"<esol>[MASK][MASK][MASK][MASK][MASK]|{encoder(smi)}"
# Set up model config (standard way)
config = RegressionTransformerMolecules(algorithm_version="solubility", search="greedy")
dummy_model = RegressionTransformer(configuration=config, target=target) That is the standard way, normally would now run the model = config.generator.model.model
# Retrieve tokenizer and collator
tokenizer = config.generator.model.tokenizer
collator = config.generator.collator
inputs = collator([tokenizer(target)])
# Run model standalone
result = model(**inputs, output_hidden_states=True)
print(len(result), result.logits.shape)
Hope this helps! Closing this as complete but feel free to reopen/comment if needed |
Is there a way of obtaining embeddings with the fully trained model? |
This is. You can control the checkpoint by selecting the |
To learn more about the different versions, please read the paper or check the explanations in the GradIO app: https://huggingface.co/spaces/GT4SD/regression_transformer |
I see now, thank you! |
I have one more question. I use "uspto" algorithm version and pass reaction SMILES string to the model. It fails with error: ValueError: The context CC(C#C)N(C)C(=O)NC1=CC=C(Cl)C=C1>>CC(C#C)N(C)C(=O)NC1=CC=C(Cl)C=C1 is not a valid SMILES string. Am I doing it wrong or there is no way of calculating embeddings for reactions (not just molecules)? |
For reaction fingerprints there are surely better methods like RXNFP or DRFP The reason for the error is that the algorithm is parsing a reaction in a position where it is expecting a molecule |
Ok, I see. In the Nature Machine Intelligence paper experiments on predicting reaction yields are provided. That’s why I thought it is possible to use the whole reaction as an input. |
You can use whole reactions as input and predict e.g., their yield, yes |
I’m sorry, I still do not really get, why reactions embeddings generation is not possible despite the ability to predict yields. Is it due to the RegressionTransformerMolecules config? |
How can I obtain embeddings from the last hidden state of Regression Transformer?
The text was updated successfully, but these errors were encountered: