-
Notifications
You must be signed in to change notification settings - Fork 41
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Inconsistent matrix generation in odgi paths
#484
Comments
Hi Sivico, can you parse your GFA in order to put in there just one path that triggers your issue? Compressing it with zstd it might become small enough to be 'shareable'. Also, to be 100% sure I am following, it would be helpful for me to see all exact steps in your codes/pipeline/commands that give you the two numbers that should be identical (but aren't for whatever reason). |
While preparing the files to be shared, I realized the issue was in the way I was parsing the matrix. In short, I assumed the output of The only thing that is perplexing to me is how this problem did not arise before. The only way would be that all the previous graphs I tried produced binary matrices (which seems to be the case). Biologically, this is hard to believe, especially because a couple of previous graphs were a subset of the current graph. This leads me to ask you the following: was the behavior of |
Oh, that is so reassuring! Thanks @AndreaGuarracino. Now everything squares again. The only thing I would point out, is that the current help of Other than that I think we can considered a solved issue. Thanks again for the fast response and furthermore for developing the pangenomics tools ecosystem. Cheers |
Oops, I've checked the code and you're right, it hasn't been put as an option again! I guess you are already easily parsing the haplotype matrix by putting 1 if the value is greater than 0. Glad to have helped! 🖖 |
Hi odgi team,
I have been playing and getting into the world of pangenomes for an ongoing research project. Part of the way we are trying to analyze our pangenome is by looking at shared nodes among the paths of the graph. We found the binary matrix generated by
odgi paths
particularly useful in this regard:In a previous discussion (see #444 ), I thought the elements of the matrix along the path should sum up to the
node.counts
number reported in the output. @subwaystation explained to me that this is not the case given that a path can go through the same node several times (which would not change the matrix), so you have to take the unique nodes the path goes through. Thenode.counts
was then changed topath.step.count
which is more transparent. I adjusted my code accordingly to remove the duplicates and count only the unique elements and it worked like a charm.Now, recently we scaled up and generated our largest pangenome so far (getting very close to what we actually need for our project). This pangenome is composed of 7 plant species of the same genera and their genome size is ~ 4Gb. We built the pangenome using:$\rightarrow$ $\rightarrow$ $\rightarrow$ $\rightarrow$ $\rightarrow$
cactus
hal2vg
vg construct
smoothxg
gfaffix
odgi build
. The last step was just used to optimize the node numbering.In any case, for this pangenome, I observe that
odgi paths
is giving inconsistent results, similar to what I thought the problem was in #444, but this time the problem is real. In other words, when I sum the number of unique nodes traversed by a path they are not equal to the number of ones I see in the matrix produced byodgi paths
for that path.To give you an idea, here are some numbers:
So each difference goes up for several million nodes for each path (If I go to base space, each path is missing ~ 20 Mb). Curiously enough, the first lines of the matrix (
path.length
andpath.step.count
) are correct: the lengths correspond to the chr sizes and when I calculate thestep.counts
myself, I get the same numbers.Finally, I would also like to point out that this did not happen for any of my smaller graphs. All the previous ones I have tried have matched counts from the matrix and paths perfectly (as they should). Nevertheless, all have been smaller too, so this phenomenon is definitively specific to this pangenome or to this scale.
Do you have any idea what may be causing it? Do you think there is something flawed with the pangenome that can explain it? what could that be? On the other hand, can you think on something that could go wrong with the way
odgi paths
is filling the matrix at this scale? It is clearly traversing the paths just fine but seems to be forgetting to mark some nodes as True.Your thoughts are very much appreciated
Cheers,
Sivico
P.S: I am using
odgi
version v0.8.2-0-g8715c55The text was updated successfully, but these errors were encountered: