Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Inconsistent matrix generation in odgi paths #484

Closed
sivico26 opened this issue Mar 11, 2023 · 5 comments
Closed

Inconsistent matrix generation in odgi paths #484

sivico26 opened this issue Mar 11, 2023 · 5 comments

Comments

@sivico26
Copy link

sivico26 commented Mar 11, 2023

Hi odgi team,

I have been playing and getting into the world of pangenomes for an ongoing research project. Part of the way we are trying to analyze our pangenome is by looking at shared nodes among the paths of the graph. We found the binary matrix generated by odgi paths particularly useful in this regard:

odgi paths -H -i $graph > out_depths.tsv

In a previous discussion (see #444 ), I thought the elements of the matrix along the path should sum up to the node.counts number reported in the output. @subwaystation explained to me that this is not the case given that a path can go through the same node several times (which would not change the matrix), so you have to take the unique nodes the path goes through. The node.counts was then changed to path.step.count which is more transparent. I adjusted my code accordingly to remove the duplicates and count only the unique elements and it worked like a charm.

Now, recently we scaled up and generated our largest pangenome so far (getting very close to what we actually need for our project). This pangenome is composed of 7 plant species of the same genera and their genome size is ~ 4Gb. We built the pangenome using: cactus $\rightarrow$ hal2vg $\rightarrow$ vg construct $\rightarrow$ smoothxg $\rightarrow$ gfaffix $\rightarrow$ odgi build. The last step was just used to optimize the node numbering.

In any case, for this pangenome, I observe that odgi paths is giving inconsistent results, similar to what I thought the problem was in #444, but this time the problem is real. In other words, when I sum the number of unique nodes traversed by a path they are not equal to the number of ones I see in the matrix produced by odgi paths for that path.

To give you an idea, here are some numbers:

metric \ path Hbog_1_chr1H Hbog_1_chr2H Hbog_1_chr3H Hbog_1_chr4H Hbog_1_chr5H Hbog_1_chr6H Hbog_1_chr7H Hbul_1_chr1H Hbul_1_chr2H Hbul_1_chr3H Hbul_1_chr4H Hbul_1_chr5H Hbul_1_chr6H Hbul_1_chr7H Hcom_1_chr1H Hcom_1_chr2H Hcom_1_chr3H Hcom_1_chr4H Hcom_1_chr5H Hcom_1_chr6H Hcom_1_chr7H Hpat_1_chr1H Hpat_1_chr2H Hpat_1_chr3H Hpat_1_chr4H Hpat_1_chr5H Hpat_1_chr6H Hpat_1_chr7H Hpub_1_chr1H Hpub_1_chr2H Hpub_1_chr3H Hpub_1_chr4H Hpub_1_chr5H Hpub_1_chr6H Hpub_1_chr7H Hros_1_chr1H Hros_1_chr2H Hros_1_chr3H Hros_1_chr4H Hros_1_chr5H Hros_1_chr6H Hros_1_chr7H Hvul_1_chr1H Hvul_1_chr2H Hvul_1_chr3H Hvul_1_chr4H Hvul_1_chr5H Hvul_1_chr6H Hvul_1_chr7H
counts_from_matrix 40013635 51080242 47308280 48409226 45205705 43126193 47786226 27145526 33295379 32046408 28985375 30384263 27000338 31841931 49846490 62844605 57514560 58607022 54966157 51131936 56796324 50602652 64329344 58854243 60060591 55095802 53159627 59801282 50517900 64443906 59256654 60538411 54665891 52769978 60092384 41648229 53190780 49130247 50571148 47491601 45810267 49241330 30096603 36435678 34724820 32597701 33829583 31659356 34976865
counts from paths 47610903 61041120 55894462 56577495 54123465 51481961 57280960 33876035 42533879 40915547 36502045 38294435 34276830 41298023 59716154 76294540 68512746 69463292 66068752 60449919 67832685 61029400 77879102 70075096 70926707 65602059 62808902 71750276 60067392 77465006 70290999 71439540 64359258 62186862 72023288 50217304 64627423 58984477 59539378 57275698 55408686 60049533 38417929 48155750 45438779 43055644 43848907 41774623 46337590
Path - matrix count 7597268 9960878 8586182 8168269 8917760 8355768 9494734 6730509 9238500 8869139 7516670 7910172 7276492 9456092 9869664 13449935 10998186 10856270 11102595 9317983 11036361 10426748 13549758 11220853 10866116 10506257 9649275 11948994 9549492 13021100 11034345 10901129 9693367 9416884 11930904 8569075 11436643 9854230 8968230 9784097 9598419 10808203 8321326 11720072 10713959 10457943 10019324 10115267 11360725

So each difference goes up for several million nodes for each path (If I go to base space, each path is missing ~ 20 Mb). Curiously enough, the first lines of the matrix (path.length and path.step.count) are correct: the lengths correspond to the chr sizes and when I calculate the step.counts myself, I get the same numbers.

path.name       Hbog.Hbog_1_chr1H       Hbog.Hbog_1_chr2H       Hbog.Hbog_1_chr3H       Hbog.Hbog_1_chr4H       Hbog.Hbog_1_chr5H       Hbog.Hbog_1_chr6H       Hbog.Hbog_1_chr7H   Hbul.Hbul_1_chr1H        Hbul.Hbul_1_chr2H       Hbul.Hbul_1_chr3H       Hbul.Hbul_1_chr4H       Hbul.Hbul_1_chr5H       Hbul.Hbul_1_chr6H       Hbul.Hbul_1_chr7H       Hcom.Hcom_1_chr1H    Hcom.Hcom_1_chr2H       Hcom.Hcom_1_chr3H       Hcom.Hcom_1_chr4H       Hcom.Hcom_1_chr5H       Hcom.Hcom_1_chr6H       Hcom.Hcom_1_chr7H       Hpat.Hpat_1_chr1H       Hpat.Hpat_1_chr2H    Hpat.Hpat_1_chr3H       Hpat.Hpat_1_chr4H       Hpat.Hpat_1_chr5H       Hpat.Hpat_1_chr6H       Hpat.Hpat_1_chr7H       Hpub.Hpub_1_chr1H       Hpub.Hpub_1_chr2H   Hpub.Hpub_1_chr3H        Hpub.Hpub_1_chr4H       Hpub.Hpub_1_chr5H       Hpub.Hpub_1_chr6H       Hpub.Hpub_1_chr7H       Hros.Hros_1_chr1H       Hros.Hros_1_chr2H       Hros.Hros_1_chr3H    Hros.Hros_1_chr4H       Hros.Hros_1_chr5H       Hros.Hros_1_chr6H       Hros.Hros_1_chr7H       Hvul.Hvul_1_chr1H       Hvul.Hvul_1_chr2H       Hvul.Hvul_1_chr3H       Hvul.Hvul_1_chr4H    Hvul.Hvul_1_chr5H       Hvul.Hvul_1_chr6H       Hvul.Hvul_1_chr7H
path.length     516861028       625193401       566529538       582845390       584658029       518243093       595411020       454441500       579809636       550374318       490948109    518445003       441130372       551897936       483188487       626454363       553180402       532550445       546151533       476679221       545836583       491473501   635029434        558261035       546780088       534214442       493425742       572269863       451220449       576782196       527833585       519710227       481793244       459253727    540421200       508915870       624952175       572232012       538789501       566580883       547329317       582254920       516505932       665585731       621516506   610333535        588218686       561794515       632540561
path.step.count 58360225        75265465        67762690        67931753        66585693        63034319        70795854        45039157        58026011        55847208        48638558     51323088        46236725        57348201        75045755        97796031        85090596        85388899        83843124        74345031        84325238        76747989    99242179 86838234        86784703        81539690        76855330        89806594        73908601        97405943        86810196        87249401        78619958        75997847    89974405 62522183        81211957        72831785        71697394        70865995        69488407        75566504        52570037        69564794        64903508        62303667    61350013 59789961        67294721

Finally, I would also like to point out that this did not happen for any of my smaller graphs. All the previous ones I have tried have matched counts from the matrix and paths perfectly (as they should). Nevertheless, all have been smaller too, so this phenomenon is definitively specific to this pangenome or to this scale.

Do you have any idea what may be causing it? Do you think there is something flawed with the pangenome that can explain it? what could that be? On the other hand, can you think on something that could go wrong with the way odgi paths is filling the matrix at this scale? It is clearly traversing the paths just fine but seems to be forgetting to mark some nodes as True.

Your thoughts are very much appreciated

Cheers,
Sivico

P.S: I am using odgi version v0.8.2-0-g8715c55

@AndreaGuarracino
Copy link
Member

Hi Sivico, can you parse your GFA in order to put in there just one path that triggers your issue? Compressing it with zstd it might become small enough to be 'shareable'.

Also, to be 100% sure I am following, it would be helpful for me to see all exact steps in your codes/pipeline/commands that give you the two numbers that should be identical (but aren't for whatever reason).

@sivico26
Copy link
Author

Hi @AndreaGuarracino

While preparing the files to be shared, I realized the issue was in the way I was parsing the matrix. In short, I assumed the output of odgi paths -H was a binary matrix. However, it actually is an integer matrix where the values are proportional to the depth of each path on each node. The problem was that I was ignoring values different than 1 (this should be fine for a binary matrix).

The only thing that is perplexing to me is how this problem did not arise before. The only way would be that all the previous graphs I tried produced binary matrices (which seems to be the case). Biologically, this is hard to believe, especially because a couple of previous graphs were a subset of the current graph. This leads me to ask you the following: was the behavior of odgi paths -H to produce binary matrices in previous versions of odgi?

@AndreaGuarracino
Copy link
Member

Hi @sivico26, here there is a change that could totally explain your experience! c5b545b

Before, the output of odgi paths -H was a binary matrix by default, now it is a path coverage matrix by default, and optionally a binary matrix.

@sivico26
Copy link
Author

Oh, that is so reassuring! Thanks @AndreaGuarracino. Now everything squares again.

The only thing I would point out, is that the current help of odgi paths's help does not say how to activate the binary matrix output.

Other than that I think we can considered a solved issue. Thanks again for the fast response and furthermore for developing the pangenomics tools ecosystem.

Cheers

@AndreaGuarracino
Copy link
Member

AndreaGuarracino commented Mar 14, 2023

Oops, I've checked the code and you're right, it hasn't been put as an option again! I guess you are already easily parsing the haplotype matrix by putting 1 if the value is greater than 0.

Glad to have helped! 🖖

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants