Embeddings generation #254

AsyaOrlova · 2024-12-05T19:23:02Z

How can I obtain embeddings from the last hidden state of Regression Transformer?

jannisborn · 2024-12-07T09:44:43Z

Hi @AsyaOrlova thanks for the interest, that is an interesting one.

Per default this is not supported, however here is a recipe.

from gt4sd.algorithms.conditional_generation.regression_transformer import RegressionTransformer, RegressionTransformerMolecules
from selfies import encoder # This RT works with selfies so we need to convert

# Define your target molecule
smi = "CC(C#C)N(C)C(=O)NC1=CC=C(Cl)C=C1"
target = f"<esol>[MASK][MASK][MASK][MASK][MASK]|{encoder(smi)}"

# Set up model config (standard way)
config = RegressionTransformerMolecules(algorithm_version="solubility", search="greedy")
dummy_model = RegressionTransformer(configuration=config, target=target)

That is the standard way, normally would now run the dummy_model you would run the model but that only supports sampling molecules with desired properties and runs many forward passes followed by filtering steps, thus there is no way to access the embeddings. In essence you want to run a single forward pass, which happens inside the generate_batch_regression function but that does not save the predictions/embeddings as attribute either.

model = config.generator.model.model

# Retrieve tokenizer and collator
tokenizer = config.generator.model.tokenizer
collator = config.generator.collator
inputs = collator([tokenizer(target)])

# Run model standalone
result = model(**inputs, output_hidden_states=True)
print(len(result), result.logits.shape)

result.logits is a tensor of shape BS x T x VocabSize, so there you go!
You can also access result.hidden_states and result.mems, for further details on those variables see the HF docs: https://huggingface.co/docs/transformers/en/model_doc/xlnet

Hope this helps! Closing this as complete but feel free to reopen/comment if needed

AsyaOrlova · 2024-12-11T15:19:33Z

Is there a way of obtaining embeddings with the fully trained model?

jannisborn · 2024-12-11T15:20:56Z

This is. You can control the checkpoint by selecting the algorithm_version

jannisborn · 2024-12-11T15:21:45Z

To learn more about the different versions, please read the paper or check the explanations in the GradIO app: https://huggingface.co/spaces/GT4SD/regression_transformer

AsyaOrlova · 2024-12-11T15:24:55Z

I see now, thank you!

AsyaOrlova · 2024-12-12T11:46:27Z

I have one more question. I use "uspto" algorithm version and pass reaction SMILES string to the model. It fails with error:

ValueError: The context CC(C#C)N(C)C(=O)NC1=CC=C(Cl)C=C1>>CC(C#C)N(C)C(=O)NC1=CC=C(Cl)C=C1 is not a valid SMILES string.

Am I doing it wrong or there is no way of calculating embeddings for reactions (not just molecules)?

jannisborn · 2024-12-12T12:21:05Z

For reaction fingerprints there are surely better methods like RXNFP or DRFP

The reason for the error is that the algorithm is parsing a reaction in a position where it is expecting a molecule

AsyaOrlova · 2024-12-12T12:25:16Z

Ok, I see. In the Nature Machine Intelligence paper experiments on predicting reaction yields are provided. That’s why I thought it is possible to use the whole reaction as an input.

jannisborn · 2024-12-12T12:31:37Z

You can use whole reactions as input and predict e.g., their yield, yes

AsyaOrlova · 2024-12-12T13:13:58Z

I’m sorry, I still do not really get, why reactions embeddings generation is not possible despite the ability to predict yields. Is it due to the RegressionTransformerMolecules config?

AsyaOrlova added the enhancement label Dec 5, 2024

jannisborn closed this as completed Dec 7, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Embeddings generation #254

Embeddings generation #254

AsyaOrlova commented Dec 5, 2024

jannisborn commented Dec 7, 2024

AsyaOrlova commented Dec 11, 2024

jannisborn commented Dec 11, 2024

jannisborn commented Dec 11, 2024

AsyaOrlova commented Dec 11, 2024

AsyaOrlova commented Dec 12, 2024

jannisborn commented Dec 12, 2024

AsyaOrlova commented Dec 12, 2024

jannisborn commented Dec 12, 2024

AsyaOrlova commented Dec 12, 2024

Embeddings generation #254

Embeddings generation #254

Comments

AsyaOrlova commented Dec 5, 2024

jannisborn commented Dec 7, 2024

AsyaOrlova commented Dec 11, 2024

jannisborn commented Dec 11, 2024

jannisborn commented Dec 11, 2024

AsyaOrlova commented Dec 11, 2024

AsyaOrlova commented Dec 12, 2024

jannisborn commented Dec 12, 2024

AsyaOrlova commented Dec 12, 2024

jannisborn commented Dec 12, 2024

AsyaOrlova commented Dec 12, 2024