-
Notifications
You must be signed in to change notification settings - Fork 39
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Which CIGAR is returned when multiple alignments have equal scores? #59
Comments
Hi, Let me answer the question in a different order:
Usually, the traceback is implemented to retrieve only one of the optimal alignments (as there can be many). Each implementation takes arbitrary decisions on how to solve ties. Thus, each implementation "prefers" a different optimal alignment. In the case of WFA2, "priorities" are hardcoded here. As you see, WFA2 prefers deletions in the query from insertions.
It seems that you want to go beyond minimizing the edit distance. From all the edit-optimal alignments (let's say with distance No, I haven't implemented that here, sorry. Chances are that, inverting the query and text arguments, the returned WFA2 CIGARs are closer to what you are expecting (or redefining the priorities here). |
Thanks for the reply! That pointer to the code definitely helps me to understand. To clarify a little, for me "query" and "text" are interchangeable; I have a large number of sequences, and I want to find pairwise ID between all of them where the distance is less than a threshold. Thus the "alignment length" I am concerned with is defined symmetrically for the two sequences, If I understand correctly, the priorities you linked to are such that a higher number means that WFA2 prefers to select that operation during traceback? So when a mismatch is a valid option, it chooses that; if no mismatch is a valid option it chooses a deletion, and then if no deletion is possible it chooses an insertion? If I am reading the corresponding code in edlib correctly, this is exactly opposite to the prioritization of edlib, and that makes sense given the difference in the results I get between them. WFA2 prioritizes mismatches over indels, so at each step it is greedily "trying" to make the alignment as short as possible; edlib prioritizes indels over mismatches, so at each step it is greedily "trying" to make the alignment as long as possible. Neither priority is guaranteed to lead to the true minimum/maximum alignment length, since sometimes choosing an indel instead of a mismatch now will lead to an opportunity later in the traceback to take two mismatches instead of two indels; actually maximizing or minimizing the length would require visiting all cells that can be on any score-optimal alignment path, and WPA2 does not do that (nor edlib) because it is unnecessary for most uses. I can envision a modified algorithm that tracks the minimum and maximum length for each cell during the initial calculation, and could then report those along with the optimum score without needing to do a traceback. I may try to implement that at some point, but for now I think you've answered my question. Thanks! |
Yes.
In case of ties (i.e., the traceback, at a given position, can choose between multiple operations; I, D, X), the WFA backtrace chooses X first, then D, then I.
I don't know more than you. Perhaps @Martinsos had a reason to program those priorities. I don't know.
Both correct. |
It is generally the case that there may be multiple alignments with the same score, and of course all of these are equally correct. However, my main purpose of alignment is to calculate the pairwise identity score as used in, e.g., BLAST, which is equivalent to$1 - \frac{\text{edit distance}}{\text{alignment length}}$ . I am using end2end, i.e. fully global, alignment.
Actually maximizing this pairwise ID score would involve a variable gap penalty depending on the total number of gaps, which is not what I am interested in. However, it would be nice to, after calculating the edit distance, be able to calculate the pairwise identity using either the minimum or maximum alignment length corresponding to that edit distance. I have noticed in my testing that WFA2 usually, but not always, gives shorter alignments than edlib; this suggests that both are using traceback algorithms which simply try to find one alignment, and do not worry about any limits on the length.
The text was updated successfully, but these errors were encountered: