odgi extract error #516

GeorgeBGM · 2023-07-03T18:35:09Z

Hi,I got the following error with the command（odgi extract -i GRAPH -b BED -o - -c 0 -d 10 --full-range -t 30 -P ）, how should I fix it?

Besides, how can I keep more additional information (not just location information) in the BED file to the sequence ID in using the B command?

AndreaGuarracino · 2023-07-03T21:21:17Z

Hi @George-du, can you share the input graph and BED files to reproduce the issue?

Which kind of information do you want to keep and where?

GeorgeBGM · 2023-07-04T12:54:34Z

  Hi, the **input graph file** comes from project **_HPRC_**(https://s3-us-west-2.amazonaws.com/human-pangenomics /pangenomes/freeze/freeze1/minigraph-cactus/hprc-v1.0-mc-grch38.gfa.gz)，**the BED files** can be downloaded via this link（https://www.aliyundrive.com/s/2zvSR8iXaUH）.

The operational process is as follows:

Step1:
vg convert -gfW hprc-v1.0-mc-grch38.gfa(input graph file) -t 40 >hprc-v1.0-mc-grch38_Non-W.gfa
odgi build -g hprc-v1.0-mc-grch38_Non-W.gfa -o hprc-v1.0-mc-grch38_Non-W.og --threads 40 -P
Step2:
odgi extract -i hprc-v1.0-mc-grch38_Non-W.og -b BED(the BED files) -o - -c 0 -d 10 --full-range -t 30 -P|odgi paths -t 30 -i - -f >need.fa

The first question, the program produces the above error, how should I solve it?

The second question, how can I keep the fourth column of annotation information(Bolded parts) in the BED file into the Reads ID of the finally extracted FASTA sequence?

The BED file format is as follows：
GRCh38.chr1 38256 40294 GRCh38.1.chr1:38256-40294
GRCh38.chr1 41394 42273 GRCh38.2.chr1:41394-42273

The finally extracted FASTA sequence:
>sample:1-1000 GRCh38.1.chr1:38256-40294
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN

I'm looking forward for your reply. Thanks.

GeorgeBGM · 2023-07-10T06:25:48Z

Hi, I'm curious if I described the problem clearly and if there are some suggestions about the solution to this problem?

GeorgeBGM · 2023-07-17T17:11:16Z

Hi, I'm concerned about a question and would like to know if there are any suggestions? Is this error being reported because of a haplotype breakpoint region?

Thank you in advance.

subwaystation · 2023-07-19T13:58:38Z

Hi @George-du,

you have a mc graph as input, I am not sure it will work with our tools, but I am investigating.
Regarding your 2nd question, we don't support this functionality at the moment. odgi extract will format the path names according to your BED query. What you want as annotation should already be the final path name, if I am not mistaken. So no need to worry.

GeorgeBGM · 2023-07-20T07:15:17Z

Hi, I'm glad to hear from you.
I also used the PGGB graph as input（_https://s3-us-west-2.amazonaws.com/human-pangenomics/pangenomes/freeze/freeze1/pggb/hprc-v1.0-pggb.gfa.gz_），but it‘s outputting similar error message. Is it true that maybe this error reporting is related to some haplotype breakpoint region and how should we fix this issue, thanks a lot for some attempts being made from your side.
As for the 2nd question, I'm aiming for sequential de-redundancy for specific regions, so I want to keep some annotation information. For example, for chr1:1-200 and chr1:300-500 regions, I would like to perform de-redundancy on the pan-genomic sequences extracted from chr1:1-200 and chr1:300-500 regions of the genome respectively, but the extraction of each region is very slow, and if all the regions are extracted together, I can't distinguish which pan-genomic sequences belong to the chr1:1-200 region and those sequences belong to chr1:300-500 region.

Thank you very much for your efforts in trying to solve this problem.
Best,Du

GeorgeBGM · 2023-07-30T16:33:20Z

hi @subwaystation @AndreaGuarracino ,

Dear Developer，
I would like to ask if there is some progress on this issue as of now? Or is there another solution suggested?

subwaystation · 2023-07-31T14:58:59Z

Hi @George-du,

sorry for the long wait.
What I was able to do so far:

wget https://s3-us-west-2.amazonaws.com/human-pangenomics /pangenomes/freeze/freeze1/minigraph-cactus/hprc-v1.0-mc-grch38.gfa.gz
gunzip hprc-v1.0-mc-grch38.gfa.gz
vg convert -gfW hprc-v1.0-mc-grch38.gfa -t 28 > hprc-v1.0-mc-grch38_Non-W.gfa
odgi build -g hprc-v1.0-mc-grch38_Non-W.gfa -o hprc-v1.0-mc-grch38_Non-W.og -t 28 -P

However, I can't download your BED file, since the font is an Asian language and don't know where to click.

Regarding your 2nd question: I still can't follow. Could you please prove a tiny step-by-step example of what you want to do and what your expectations are? Thanks!

GeorgeBGM · 2023-08-02T04:21:30Z

Hi @subwaystation @AndreaGuarracino,

I'm glad to hear from you.

As for the first question, the BED file can be downloaded via this link (https://github.com/George-du/Human_PAN/blob/main/Tissue_All_Non-Coding-region.bed).It is worth noting that some changes were made to the chromosome names in the BED file for matching with the GRAPH file.

As for the second question, my expected results are shown in the following steps.

Step1: head -n 1 Tissue_All_Non-Coding-region.bed(https://github.com/George-du/Human_PAN/blob/main/Tissue_All_Non-Coding-region.bed)

  The BED file format is as follows：
   GRCh38.chr1	38256	40294	GRCh38.1.chr1:38256-40294
   GRCh38.chr1	41394	42273	GRCh38.2.chr1:41394-42273

Step2:
odgi extract -i hprc-v1.0-mc-grch38_Non-W.og -b Tissue_All_Non-Coding-region.bed -o - -c 0 -d 10 --full-range -t 30 -P |odgi paths -t 30 -i - -f >need.fa

   **The finally extracted FASTA sequence:**
   >**GRCh38.1.chr1:38256-40294**
    NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
   >sample:1-1000 
    NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
   >sample2:4-3000 
    NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
   >sample3:1000-2000 
    NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
   >**GRCh38.2.chr1:41394-42273**
    NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
   >sample:4-1090 
    NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
   >sample2:1-3020 
    NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
   >sample3:1010-2100 
    NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
    
    For the subsequent subinterval de-redundant sequences, I hope the output format as follows（keep annotation information beyond the third column in the BED file to the Fasta file）.
    
   **The expected finally extracted FASTA sequence:**
   >**GRCh38.1.chr1:38256-40294**
    NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
   >sample:1-1000  **GRCh38.1.chr1:38256-40294**
    NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
   >sample2:4-3000  **GRCh38.1.chr1:38256-40294**
    NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
   >sample3:1000-2000  **GRCh38.1.chr1:38256-40294**
    NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
   >**GRCh38.2.chr1:41394-42273**
    NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
   >sample:4-1090  **GRCh38.2.chr1:41394-42273**
    NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
   >sample2:1-3020  **GRCh38.2.chr1:41394-42273**
    NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
   >sample3:1010-2100  **GRCh38.2.chr1:41394-42273**
    NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN

Thank you in advance.

Best,Du

GeorgeBGM · 2023-08-08T15:26:14Z

Hi, developers! @subwaystation

Has there been any progress on this issue so far? Or some suggestions?

I'm looking forward for your reply. Thanks!

subwaystation · 2023-08-17T12:58:35Z

Hi @George-du,
I am sorry you have to wait so long. Vacation time.... I will come back to you next week!

GeorgeBGM · 2023-08-18T05:17:15Z

Hi, developers! @subwaystation,
Nice to hear from you! Thanks!

subwaystation · 2023-08-22T09:50:04Z

Dear @George-du,

First question.

When running odgi extract one runs into a Segmentation fault (core dumped):

odgi extract -i hprc-v1.0-mc-grch38_Non-W.og -b Tissue_All_Non-Coding-region.bed -o hprc-v1.0-mc-grch38_Non-W.og.extract.og -c 0 -d 10 --full-range -t 28 -P
[odgi::extract] extracting path ranges 100.00% @ 4.34e+01 bp/s elapsed: 00:00:11:21 remain: 00:00:00:00
[odgi::extract] collecting all nodes in the path range 100.00% @ 2.40e+06 bp/s elapsed: 00:00:00:03 remain: 00:00:00:00
[odgi::extract] merge subpaths closer than 10 bps - iteration 1 (max 3) 100.00% @ 4.70e+02 bp/s elapsed: 00:00:00:57 remain: 00:00:00:00
[odgi::extract] merge subpaths closer than 10 bps - iteration 2 (max 3) 100.00% @ 4.66e+02 bp/s elapsed: 00:00:00:58 remain: 00:00:00:00
[odgi::extract] merge subpaths closer than 10 bps - iteration 3 (max 3) 100.00% @ 4.70e+02 bp/s elapsed: 00:00:00:57 remain: 00:00:00:00
Segmentation fault (core dumped)

I am re-running with GDB now.
@AndreaGuarracino Do you have any idea why? Would a duplicate entry in the BED file cause such a problem?
@George-du Since you are extracting from 2 chromosomes only, you can try odgi explode and then work with smaller graphs.

I finally understood your 2nd question. You need this because you have two BED entries which target the same region in the reference, but you annotated them from different samples.

grep 77707167 Tissue_All_Non-Coding-region.bed
GRCh38.chr1     77707167        77707582        GRCh38.9835.chr1:77707167-77707582
GRCh38.chr1     77707167        77707582        GRCh38.9836.chr1:77707167-77707582

However, odgi does only store the path name which usually corresponds to the FASTA identifier. We don't store additional meta information. I am not sure how odgi extract behaves when you enter the same identifier twice. Maybe @AndreaGuarracino knows more?
So after the extraction and the FASTA conversion, you have to add your meta information in the BED manually to the FASTA identifier. I am afraid there is currently no other way. Else we would have to change the whole odgi data structure model.

subwaystation · 2023-08-22T12:19:01Z

The full stack trace.

odgi: /home/ubuntu/software/odgi/git/master/src/odgi.cpp:201: odgi::graph_t::path_metadata_t& odgi::graph_t::get_path_metadata(const handlegraph::path_handle_t&) const: Assertion `false' failed.
--Type <RET> for more, q to quit, c to continue without paging--

Thread 27 "odgi" received signal SIGABRT, Aborted.
[Switching to Thread 0x7fe93b5e6700 (LWP 715612)]
__GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:50
50      ../sysdeps/unix/sysv/linux/raise.c: No such file or directory.
(gdb) bt
#0  __GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:50
#1  0x00007ffff774c859 in __GI_abort () at abort.c:79
#2  0x00007ffff774c729 in __assert_fail_base (fmt=0x7ffff78e2588 "%s%s%s:%u: %s%sAssertion `%s' failed.\n%n", assertion=0x555555bc29e3 "false", file=0x555555bc29b0 "/home/ubuntu/software/odgi/git/master/src/odgi.cpp", line=201,
    function=<optimized out>) at assert.c:92
#3  0x00007ffff775dfd6 in __GI___assert_fail (assertion=0x555555bc29e3 "false", file=0x555555bc29b0 "/home/ubuntu/software/odgi/git/master/src/odgi.cpp", line=201,
    function=0x555555bc2940 "odgi::graph_t::path_metadata_t& odgi::graph_t::get_path_metadata(const handlegraph::path_handle_t&) const") at assert.c:101
#4  0x000055555559df02 in odgi::graph_t::get_path_metadata (this=0x7fffffffb540, path=...) at /home/ubuntu/software/odgi/git/master/src/odgi.cpp:201
#5  0x00005555555a4c88 in odgi::graph_t::append_step (this=0x7fffffffb540, path=..., to_append=...) at /home/ubuntu/software/odgi/git/master/src/odgi.cpp:1326
#6  0x00005555558a387f in odgi::<lambda(odgi::graph_t&, std::vector<handlegraph::path_handle_t, std::allocator<handlegraph::path_handle_t> >*, const std::vector<handlegraph::path_handle_t, std::allocator<handlegraph::path_handle_t> >&, odgi::graph_t&, std::vector<odgi::path_range_t, std::allocator<odgi::path_range_t> >, std::vector<std::pair<long unsigned int, long unsigned int>, std::allocator<std::pair<long unsigned int, long unsigned int> > >, uint64_t, uint64_t, bool, bool, uint64_t, uint64_t, uint64_t, bool, bool)>::<lambda(const handlegraph::handle_t&)>::operator()(const handlegraph::handle_t &) const (__closure=0x7fe939600000, handle=...)
    at /home/ubuntu/software/odgi/git/master/src/subcommand/extract_main.cpp:635
#7  0x00005555558ac575 in std::_Function_handler<void(const handlegraph::handle_t&), odgi::main_extract(int, char**)::<lambda(odgi::graph_t&, std::vector<handlegraph::path_handle_t>*, const std::vector<handlegraph::path_handle_t>&, odgi::graph_t&, std::vector<odgi::path_range_t>, std::vector<std::pair<long unsigned int, long unsigned int> >, uint64_t, uint64_t, bool, bool, uint64_t, uint64_t, uint64_t, bool, bool)>::<lambda(const handlegraph::handle_t&)> >::_M_invoke(const std::_Any_data &, const handlegraph::handle_t &) (__functor=..., __args#0=...) at /usr/include/c++/9/bits/std_function.h:300
#8  0x00005555555eb183 in std::function<void (handlegraph::handle_t const&)>::operator()(handlegraph::handle_t const&) const (this=0x7fe93b5e4440, __args#0=...) at /usr/include/c++/9/bits/std_function.h:688
#9  0x00005555555e0379 in odgi::algorithms::for_handle_in_path_range(odgi::graph_t const&, handlegraph::path_handle_t, long, long, std::function<void (handlegraph::handle_t const&)> const&) (source=..., path_handle=..., start=77707167,
    end=77707582, lambda=...) at /home/ubuntu/software/odgi/git/master/src/algorithms/subgraph/extract.cpp:207
#10 0x00005555558aeb26 in odgi::<lambda(odgi::graph_t&, std::vector<handlegraph::path_handle_t, std::allocator<handlegraph::path_handle_t> >*, const std::vector<handlegraph::path_handle_t, std::allocator<handlegraph::path_handle_t> >&, odgi::graph_t&, std::vector<odgi::path_range_t, std::allocator<odgi::path_range_t> >, std::vector<std::pair<long unsigned int, long unsigned int>, std::allocator<std::pair<long unsigned int, long unsigned int> > >, uint64_t, uint64_t, bool, bool, uint64_t, uint64_t, uint64_t, bool, bool)>::_ZZN4odgi12main_extractEiPPcENKUlRNS_7graph_tEPSt6vectorIN11handlegraph13path_handle_tESaIS6_EERKS8_S3_S4_INS_12path_range_tESaISC_EES4_ISt4pairImmESaISG_EEmmbbmmmbbE1_clES3_S9_SB_S3_SE_SI_mmbbmmmbb._omp_fn.2(void) () at /home/ubuntu/software/odgi/git/master/src/subcommand/extract_main.cpp:628
#11 0x00007ffff797686e in ?? () from /lib/x86_64-linux-gnu/libgomp.so.1
--Type <RET> for more, q to quit, c to continue without paging--
#12 0x00007ffff7924609 in start_thread (arg=<optimized out>) at pthread_create.c:477
#13 0x00007ffff7849133 in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95

AndreaGuarracino · 2023-08-22T13:08:45Z

Duplicated path ranges are not supported. Just pushed a fix to check them #525

subwaystation · 2023-08-22T13:15:21Z

Indeed, after removing the duplicated BED entries it works.

odgi extract -i hprc-v1.0-mc-grch38_Non-W.og -b Tissue_All_Non-Coding-region.uniq.bed -o hprc-v1.0-mc-grch38_Non-W.og.extract.og -c 0 -d 10 --full-range -t 28 -P
[odgi::extract] extracting path ranges 100.00% @ 4.22e+01 bp/s elapsed: 00:00:11:41 remain: 00:00:00:00
[odgi::extract] collecting all nodes in the path range 100.00% @ 2.40e+06 bp/s elapsed: 00:00:00:03 remain: 00:00:00:00
[odgi::extract] merge subpaths closer than 10 bps - iteration 1 (max 3) 100.00% @ 4.43e+02 bp/s elapsed: 00:00:01:01 remain: 00:00:00:00
[odgi::extract] merge subpaths closer than 10 bps - iteration 2 (max 3) 100.00% @ 4.43e+02 bp/s elapsed: 00:00:01:01 remain: 00:00:00:00
[odgi::extract] merge subpaths closer than 10 bps - iteration 3 (max 3) 100.00% @ 4.43e+02 bp/s elapsed: 00:00:01:01 remain: 00:00:00:00
[odgi::extract] adding connecting edges 100.00% @ 1.31e+06 bp/s elapsed: 00:00:00:05 remain: 00:00:00:00
[odgi::extract] adding subpaths 100.00% @ 9.48e+02 bp/s elapsed: 00:00:01:25 remain: 00:00:00:00
[odgi::extract] checking missing edges and empty subpaths 100.00% @ 3.77e+03 bp/s elapsed: 00:00:00:09 remain: 00:00:00:00

GeorgeBGM · 2023-08-22T14:32:18Z

Dear @subwaystation @AndreaGuarracino,

Wow, that sounds good! I'll try it as soon as possible. I will also try the odgi explode method. Thanks！

Best ,Du

GeorgeBGM · 2023-08-31T03:55:37Z

Dear @subwaystation @AndreaGuarracino,

I would like to ask if extracting the sequence based on intervals (-b BED file) will extract the excess sequence?(#526)

When there are more intervals, a large number of sequences will be extracted, how can I de-redundancy these sequences? Is it possible to divide the data into several parts and redundant them separately or is there any other suggestion? Or do you recommend using VCF files directly?

subwaystation · 2023-08-31T08:57:32Z

Dear @George-du,

as far as I understood only with -E you will extract excess sequence.

I think there are several options for you:

You select a larger value for -d, --max-distance-subpaths
You specify -s, --split-subgraphs. This ensures that all your intervals are written to their own graph.

How would like the VCF file to interact with the graph? Or would you like to project the graph into a VCF and then work with it? If you want to take a look at the variants within your ranges, I would rather first project into a VCF using vg deconstruct and then take a look at the variants. So you don't miss something.

GeorgeBGM · 2023-09-01T10:52:34Z

Dear @subwaystation @AndreaGuarracino,

I will test it as you suggest.
Best, Du

ekg · 2023-09-01T11:09:58Z

There is odgi procbed to project path annotations into the final path names after extract.

…

On Fri, Sep 1, 2023, 05:52 George-du ***@***.***> wrote: Dear @subwaystation <https://github.com/subwaystation> @AndreaGuarracino <https://github.com/AndreaGuarracino>, I will test it as you suggest. Best, Du — Reply to this email directly, view it on GitHub <#516 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AABDQELHRTSHLQ4YKYIE6VDXYG473ANCNFSM6AAAAAAZ4ZCSVU> . You are receiving this because you are subscribed to this thread.Message ID: ***@***.***>

GeorgeBGM · 2023-09-05T14:04:34Z

Dear @subwaystation @AndreaGuarracino @ekg

I tried it as your suggested and found that this is much slower, which seems to extract one region at a time.
Using the following command,1289 regions can be extracted in 20h.
odgi extract -i $GRAPH -b Tissue_All_Non-HERV_Coding-region_Merge_Anno_need_shuffle.bed -c 0 -d 0 -t 30 -P -s -E
Maybe splitting chromosomes regions (odgi explode) or even cutting the region BED file will speed up this operation, but It will generate a lot of intermediate processing files.

Best, Du

subwaystation · 2023-10-09T12:12:19Z

@AndreaGuarracino Any ideas how to speed this up?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

odgi extract error #516

odgi extract error #516

GeorgeBGM commented Jul 3, 2023

AndreaGuarracino commented Jul 3, 2023

GeorgeBGM commented Jul 4, 2023

GeorgeBGM commented Jul 10, 2023

GeorgeBGM commented Jul 17, 2023

subwaystation commented Jul 19, 2023

GeorgeBGM commented Jul 20, 2023

GeorgeBGM commented Jul 30, 2023

subwaystation commented Jul 31, 2023

GeorgeBGM commented Aug 2, 2023

GeorgeBGM commented Aug 8, 2023 •

edited

Loading

subwaystation commented Aug 17, 2023

GeorgeBGM commented Aug 18, 2023

subwaystation commented Aug 22, 2023

subwaystation commented Aug 22, 2023 •

edited

Loading

AndreaGuarracino commented Aug 22, 2023

subwaystation commented Aug 22, 2023

GeorgeBGM commented Aug 22, 2023 •

edited

Loading

GeorgeBGM commented Aug 31, 2023 •

edited

Loading

subwaystation commented Aug 31, 2023

GeorgeBGM commented Sep 1, 2023

ekg commented Sep 1, 2023 via email

GeorgeBGM commented Sep 5, 2023 •

edited

Loading

subwaystation commented Oct 9, 2023

odgi extract error #516

odgi extract error #516

Comments

GeorgeBGM commented Jul 3, 2023

AndreaGuarracino commented Jul 3, 2023

GeorgeBGM commented Jul 4, 2023

GeorgeBGM commented Jul 10, 2023

GeorgeBGM commented Jul 17, 2023

subwaystation commented Jul 19, 2023

GeorgeBGM commented Jul 20, 2023

GeorgeBGM commented Jul 30, 2023

subwaystation commented Jul 31, 2023

GeorgeBGM commented Aug 2, 2023

GeorgeBGM commented Aug 8, 2023 • edited Loading

subwaystation commented Aug 17, 2023

GeorgeBGM commented Aug 18, 2023

subwaystation commented Aug 22, 2023

subwaystation commented Aug 22, 2023 • edited Loading

AndreaGuarracino commented Aug 22, 2023

subwaystation commented Aug 22, 2023

GeorgeBGM commented Aug 22, 2023 • edited Loading

GeorgeBGM commented Aug 31, 2023 • edited Loading

subwaystation commented Aug 31, 2023

GeorgeBGM commented Sep 1, 2023

ekg commented Sep 1, 2023 via email

GeorgeBGM commented Sep 5, 2023 • edited Loading

subwaystation commented Oct 9, 2023

GeorgeBGM commented Aug 8, 2023 •

edited

Loading

subwaystation commented Aug 22, 2023 •

edited

Loading

GeorgeBGM commented Aug 22, 2023 •

edited

Loading

GeorgeBGM commented Aug 31, 2023 •

edited

Loading

GeorgeBGM commented Sep 5, 2023 •

edited

Loading