Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Embeddings generation #254

Closed
AsyaOrlova opened this issue Dec 5, 2024 · 10 comments
Closed

Embeddings generation #254

AsyaOrlova opened this issue Dec 5, 2024 · 10 comments
Labels
enhancement New feature or request

Comments

@AsyaOrlova
Copy link

How can I obtain embeddings from the last hidden state of Regression Transformer?

@AsyaOrlova AsyaOrlova added the enhancement New feature or request label Dec 5, 2024
@jannisborn
Copy link
Contributor

Hi @AsyaOrlova thanks for the interest, that is an interesting one.

Per default this is not supported, however here is a recipe.

from gt4sd.algorithms.conditional_generation.regression_transformer import RegressionTransformer, RegressionTransformerMolecules
from selfies import encoder # This RT works with selfies so we need to convert

# Define your target molecule
smi = "CC(C#C)N(C)C(=O)NC1=CC=C(Cl)C=C1"
target = f"<esol>[MASK][MASK][MASK][MASK][MASK]|{encoder(smi)}"

# Set up model config (standard way)
config = RegressionTransformerMolecules(algorithm_version="solubility", search="greedy")
dummy_model = RegressionTransformer(configuration=config, target=target)

That is the standard way, normally would now run the dummy_model you would run the model but that only supports sampling molecules with desired properties and runs many forward passes followed by filtering steps, thus there is no way to access the embeddings. In essence you want to run a single forward pass, which happens inside the generate_batch_regression function but that does not save the predictions/embeddings as attribute either.

model = config.generator.model.model

# Retrieve tokenizer and collator
tokenizer = config.generator.model.tokenizer
collator = config.generator.collator
inputs = collator([tokenizer(target)])

# Run model standalone
result = model(**inputs, output_hidden_states=True)
print(len(result), result.logits.shape)

result.logits is a tensor of shape BS x T x VocabSize, so there you go!
You can also access result.hidden_states and result.mems, for further details on those variables see the HF docs: https://huggingface.co/docs/transformers/en/model_doc/xlnet

Hope this helps! Closing this as complete but feel free to reopen/comment if needed

@AsyaOrlova
Copy link
Author

Is there a way of obtaining embeddings with the fully trained model?

@jannisborn
Copy link
Contributor

This is. You can control the checkpoint by selecting the algorithm_version

@jannisborn
Copy link
Contributor

To learn more about the different versions, please read the paper or check the explanations in the GradIO app: https://huggingface.co/spaces/GT4SD/regression_transformer

@AsyaOrlova
Copy link
Author

I see now, thank you!

@AsyaOrlova
Copy link
Author

I have one more question. I use "uspto" algorithm version and pass reaction SMILES string to the model. It fails with error:

ValueError: The context CC(C#C)N(C)C(=O)NC1=CC=C(Cl)C=C1>>CC(C#C)N(C)C(=O)NC1=CC=C(Cl)C=C1 is not a valid SMILES string.

Am I doing it wrong or there is no way of calculating embeddings for reactions (not just molecules)?

@jannisborn
Copy link
Contributor

For reaction fingerprints there are surely better methods like RXNFP or DRFP

The reason for the error is that the algorithm is parsing a reaction in a position where it is expecting a molecule

@AsyaOrlova
Copy link
Author

Ok, I see. In the Nature Machine Intelligence paper experiments on predicting reaction yields are provided. That’s why I thought it is possible to use the whole reaction as an input.

@jannisborn
Copy link
Contributor

You can use whole reactions as input and predict e.g., their yield, yes

@AsyaOrlova
Copy link
Author

I’m sorry, I still do not really get, why reactions embeddings generation is not possible despite the ability to predict yields. Is it due to the RegressionTransformerMolecules config?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

2 participants