Skip to content

Commit

Permalink
changed version tracking
Browse files Browse the repository at this point in the history
highlighted in v2 file
also reordered pics
  • Loading branch information
Damien Coupry committed Jan 13, 2022
1 parent c1e2d57 commit cc96e47
Show file tree
Hide file tree
Showing 6 changed files with 1 addition and 1 deletion.
File renamed without changes.
File renamed without changes.
File renamed without changes.
Binary file modified Paper/chemdist.pdf
Binary file not shown.
2 changes: 1 addition & 1 deletion Paper/chemdist.tex
Original file line number Diff line number Diff line change
Expand Up @@ -308,7 +308,7 @@ \subsection*{Failure points of circular fingerprints}
% Adding more explanation
The usual dissimilarity cutoff values in case of ECFP4 fingerprints are between 0.2-0.4 (anything below this is considered to be similar). At these low values (structures 2 and 3 on Figure~\ref{fig:Similarity_study_cases}) the triplet embedding distance agrees well with ECFP4 dissimilarity. Structures 6, 9-12, 17 and 20 are largely dissimilar according to ECFP4, having a dissimilarity at least or above 0.8. As we can see the triplet embedding distance discriminates between these structures much more than ECFP4. It prefers generally the aromatic structures with similar arrangements against the aliphatic rings, what is expected from the nature of reduced graphs. The 5-membered aromatic rings (e.g. structure 13-16) are closer based on the triplet embedding to the original Reference than the similarly arranged structures with at least 2 aliphatic rings (structures 9-12). This is not so clear in case of ECFP4, which does not distinguish between structures 9 and 10 (both having and ECFP4 dissimilarity of 0.80, whereas the triplet embedding clearly showing that more aliphatic rings are less similar 42.03 vs. 62.87 for 9 and 10, respectively) and creates a large difference between similar structures 7 and 8 (0.71 vs. 0.50 for ECFP4). The 2nd most dissimilar structure based on ECFP4 is structure 6 (with a large dissimilarity of 0.92), whereas the triplet embedding shows a not too large dissimilarity (12.32). This later is not surprising, although the arrangment of the ring systems is the same and the molecular shape is similar, the non-featurized ECFP4 only understands that the rings changed completely between the reference and structure 6 and it does not find a lot of similarity between the benzene and triazine rings.

To show further differences between the ECFP4 and the triplet embedding a set of (?randomly selected) 100,000 triplets not used in the training process was utilized to calculate both the ECFP4 dissimilarities and the triplet embedding distances for the positive and negative controls in respect to the reference (anchor). The experiment showed that both ECFP4 dissimilarity and the triplet embedding determined the correct order (positive control has lower distance than the negative) for 89,133 triplets, showing that in most cases both work fine for most of the cases. Not surprisinlgy, ECFP4 failed more often (9911 cases), whereas the triplet embedding failed only for 956 cases. There are 428 cases were both failed. Although this is not a quantitative performance investigation for the two distance metrics, it can give us insight about their weak points. In Figure~\ref{fig:Unseen_Fails} we show 4 examples (the whole list is in the github repository) where one of the descriptors failed to give the correct order. In case of triplet 1), ECFP4 predicted that the neative control (right hand side) is closer to the reference than the positive one (middle). Since in case of the negative control the left hand side of the molecule changes (4-membered ring is changed to a 6-membered ring), for a chemical series point of view this change is larger than the changes in the side chains, which can be seen in case of the positive control.
To show further differences between the ECFP4 and the triplet embedding a randomly selected set of 100,000 triplets unused in the training process was utilized to calculate both the ECFP4 dissimilarities and the triplet embedding distances for the positive and negative controls in respect to the reference (anchor). The experiment showed that both ECFP4 dissimilarity and the triplet embedding determined the correct order (positive control has lower distance than the negative) for 89,133 triplets, showing that in most cases both work fine for most of the cases. Not surprisinlgy, ECFP4 failed more often (9911 cases), whereas the triplet embedding failed only for 956 cases. There are 428 cases were both failed. Although this is not a quantitative performance investigation for the two distance metrics, it can give us insight about their weak points. In Figure~\ref{fig:Unseen_Fails} we show 4 examples (the whole list is in the github repository) where one of the descriptors failed to give the correct order. In case of triplet 1), ECFP4 predicted that the neative control (right hand side) is closer to the reference than the positive one (middle). Since in case of the negative control the left hand side of the molecule changes (4-membered ring is changed to a 6-membered ring), for a chemical series point of view this change is larger than the changes in the side chains, which can be seen in case of the positive control.
Triplet 2) shows a similar case, where ECFP4 fails to properly give the order. Here the failure is caused both by feature repetition and a relatively small change of the ring size. In case of the negative control, the piperidine ring appears two times in the molecule. The ECFP4 used here (and in many virtual screening and similarity searching experiments) does not contain feature counts, therefore the sensitivity to feature repetition is low (see Figure~2 in reference~\cite{flower1998properties}). Triplet 3) shows also an example where the ring size changes, but here the ECFP4 dissimilarities are almost identical, although there are not only changes in the side chains, but a linker extension, a ring extension and pyrazole ring is changed to an imidazole ring in case of the negative control.
A different case is triplet 4), where the triplet embedding failed to properly determine the order. As it can be seen, both negative and positive controls have larger changes, although the two rings on the right hand side are the same for the positive control, their connection is different. The amide bond is reversed in both the positive and negative controls compared with the reference structure, the linker has the same length, but different groups and the left hand side ring system is largely different for both the positive and negative controls. Both ECFP4 and the triplet embedding gave a larger distance for these two structures. The insensitvity on the orientation of the amide groups is a well known issue of the reduced graphs. Triplet 4) can be considered as a bad example, since both for positive and negative control there are large changes in the core of the structure. Large part of those structures where the triplet embedding failed are similar to this, i.e. the positive controls and the negative controls are both in not too close distance to the reference and in some cases they are more similar to each other than to the reference. Preparing better the training set might solve part of the issues, but a small number of ``wrong'' examples might always get into the data set.

Expand Down
Binary file modified Paper/chemdist_vtrack_2.pdf
Binary file not shown.

0 comments on commit cc96e47

Please sign in to comment.