Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

odgi extract error #516

Open
GeorgeBGM opened this issue Jul 3, 2023 · 23 comments
Open

odgi extract error #516

GeorgeBGM opened this issue Jul 3, 2023 · 23 comments

Comments

@GeorgeBGM
Copy link

Hi,I got the following error with the command(odgi extract -i GRAPH -b BED -o - -c 0 -d 10 --full-range -t 30 -P ), how should I fix it?

image

Besides, how can I keep more additional information (not just location information) in the BED file to the sequence ID in using the B command?

@AndreaGuarracino
Copy link
Member

Hi @George-du, can you share the input graph and BED files to reproduce the issue?

Which kind of information do you want to keep and where?

@GeorgeBGM
Copy link
Author

  Hi, the **input graph file** comes from project **_HPRC_**(https://s3-us-west-2.amazonaws.com/human-pangenomics /pangenomes/freeze/freeze1/minigraph-cactus/hprc-v1.0-mc-grch38.gfa.gz),**the BED files** can be downloaded via this link(https://www.aliyundrive.com/s/2zvSR8iXaUH).

The operational process is as follows:

Step1:
vg convert -gfW hprc-v1.0-mc-grch38.gfa(input graph file) -t 40 >hprc-v1.0-mc-grch38_Non-W.gfa
odgi build -g hprc-v1.0-mc-grch38_Non-W.gfa -o hprc-v1.0-mc-grch38_Non-W.og --threads 40 -P
Step2:
odgi extract -i hprc-v1.0-mc-grch38_Non-W.og -b BED(the BED files) -o - -c 0 -d 10 --full-range -t 30 -P|odgi paths -t 30 -i - -f >need.fa

The first question, the program produces the above error, how should I solve it?

The second question, how can I keep the fourth column of annotation information(Bolded parts) in the BED file into the Reads ID of the finally extracted FASTA sequence?

The BED file format is as follows:
GRCh38.chr1 38256 40294 GRCh38.1.chr1:38256-40294
GRCh38.chr1 41394 42273 GRCh38.2.chr1:41394-42273

The finally extracted FASTA sequence:
>sample:1-1000 GRCh38.1.chr1:38256-40294
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN

I'm looking forward for your reply. Thanks.

@GeorgeBGM
Copy link
Author

Hi, I'm curious if I described the problem clearly and if there are some suggestions about the solution to this problem?

@GeorgeBGM
Copy link
Author

Hi, I'm concerned about a question and would like to know if there are any suggestions? Is this error being reported because of a haplotype breakpoint region?

Thank you in advance.

@subwaystation
Copy link
Member

Hi @George-du,

you have a mc graph as input, I am not sure it will work with our tools, but I am investigating.
Regarding your 2nd question, we don't support this functionality at the moment. odgi extract will format the path names according to your BED query. What you want as annotation should already be the final path name, if I am not mistaken. So no need to worry.

@GeorgeBGM
Copy link
Author

Hi, I'm glad to hear from you.
I also used the PGGB graph as input(_https://s3-us-west-2.amazonaws.com/human-pangenomics/pangenomes/freeze/freeze1/pggb/hprc-v1.0-pggb.gfa.gz_),but it‘s outputting similar error message. Is it true that maybe this error reporting is related to some haplotype breakpoint region and how should we fix this issue, thanks a lot for some attempts being made from your side.
As for the 2nd question, I'm aiming for sequential de-redundancy for specific regions, so I want to keep some annotation information. For example, for chr1:1-200 and chr1:300-500 regions, I would like to perform de-redundancy on the pan-genomic sequences extracted from chr1:1-200 and chr1:300-500 regions of the genome respectively, but the extraction of each region is very slow, and if all the regions are extracted together, I can't distinguish which pan-genomic sequences belong to the chr1:1-200 region and those sequences belong to chr1:300-500 region.

Thank you very much for your efforts in trying to solve this problem.
Best,Du

@GeorgeBGM
Copy link
Author

hi @subwaystation @AndreaGuarracino ,

Dear Developer,
I would like to ask if there is some progress on this issue as of now? Or is there another solution suggested?

@subwaystation
Copy link
Member

Hi @George-du,

sorry for the long wait.
What I was able to do so far:

wget https://s3-us-west-2.amazonaws.com/human-pangenomics /pangenomes/freeze/freeze1/minigraph-cactus/hprc-v1.0-mc-grch38.gfa.gz
gunzip hprc-v1.0-mc-grch38.gfa.gz
vg convert -gfW hprc-v1.0-mc-grch38.gfa -t 28 > hprc-v1.0-mc-grch38_Non-W.gfa
odgi build -g hprc-v1.0-mc-grch38_Non-W.gfa -o hprc-v1.0-mc-grch38_Non-W.og -t 28 -P

However, I can't download your BED file, since the font is an Asian language and don't know where to click.

Regarding your 2nd question: I still can't follow. Could you please prove a tiny step-by-step example of what you want to do and what your expectations are? Thanks!

@GeorgeBGM
Copy link
Author

Hi @subwaystation @AndreaGuarracino,

I'm glad to hear from you.

As for the first question, the BED file can be downloaded via this link (https://github.com/George-du/Human_PAN/blob/main/Tissue_All_Non-Coding-region.bed).It is worth noting that some changes were made to the chromosome names in the BED file for matching with the GRAPH file.

As for the second question, my expected results are shown in the following steps.

Step1: head -n 1 Tissue_All_Non-Coding-region.bed(https://github.com/George-du/Human_PAN/blob/main/Tissue_All_Non-Coding-region.bed)

  The BED file format is as follows:
   GRCh38.chr1	38256	40294	GRCh38.1.chr1:38256-40294
   GRCh38.chr1	41394	42273	GRCh38.2.chr1:41394-42273

Step2:
odgi extract -i hprc-v1.0-mc-grch38_Non-W.og -b Tissue_All_Non-Coding-region.bed -o - -c 0 -d 10 --full-range -t 30 -P |odgi paths -t 30 -i - -f >need.fa

   **The finally extracted FASTA sequence:**
   >**GRCh38.1.chr1:38256-40294**
    NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
   >sample:1-1000 
    NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
   >sample2:4-3000 
    NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
   >sample3:1000-2000 
    NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
   >**GRCh38.2.chr1:41394-42273**
    NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
   >sample:4-1090 
    NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
   >sample2:1-3020 
    NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
   >sample3:1010-2100 
    NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
    
    For the subsequent subinterval de-redundant sequences, I hope the output format as follows(keep annotation information beyond the third column in the BED file to the Fasta file).
    
   **The expected finally extracted FASTA sequence:**
   >**GRCh38.1.chr1:38256-40294**
    NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
   >sample:1-1000  **GRCh38.1.chr1:38256-40294**
    NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
   >sample2:4-3000  **GRCh38.1.chr1:38256-40294**
    NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
   >sample3:1000-2000  **GRCh38.1.chr1:38256-40294**
    NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
   >**GRCh38.2.chr1:41394-42273**
    NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
   >sample:4-1090  **GRCh38.2.chr1:41394-42273**
    NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
   >sample2:1-3020  **GRCh38.2.chr1:41394-42273**
    NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
   >sample3:1010-2100  **GRCh38.2.chr1:41394-42273**
    NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN

Thank you in advance.

Best,Du

@GeorgeBGM
Copy link
Author

GeorgeBGM commented Aug 8, 2023

Hi, developers! @subwaystation

Has there been any progress on this issue so far? Or some suggestions?

I'm looking forward for your reply. Thanks!

@subwaystation
Copy link
Member

Hi @George-du,
I am sorry you have to wait so long. Vacation time.... I will come back to you next week!

@GeorgeBGM
Copy link
Author

Hi, developers! @subwaystation,
Nice to hear from you! Thanks!

@subwaystation
Copy link
Member

Dear @George-du,

First question.

When running odgi extract one runs into a Segmentation fault (core dumped):

odgi extract -i hprc-v1.0-mc-grch38_Non-W.og -b Tissue_All_Non-Coding-region.bed -o hprc-v1.0-mc-grch38_Non-W.og.extract.og -c 0 -d 10 --full-range -t 28 -P
[odgi::extract] extracting path ranges 100.00% @ 4.34e+01 bp/s elapsed: 00:00:11:21 remain: 00:00:00:00
[odgi::extract] collecting all nodes in the path range 100.00% @ 2.40e+06 bp/s elapsed: 00:00:00:03 remain: 00:00:00:00
[odgi::extract] merge subpaths closer than 10 bps - iteration 1 (max 3) 100.00% @ 4.70e+02 bp/s elapsed: 00:00:00:57 remain: 00:00:00:00
[odgi::extract] merge subpaths closer than 10 bps - iteration 2 (max 3) 100.00% @ 4.66e+02 bp/s elapsed: 00:00:00:58 remain: 00:00:00:00
[odgi::extract] merge subpaths closer than 10 bps - iteration 3 (max 3) 100.00% @ 4.70e+02 bp/s elapsed: 00:00:00:57 remain: 00:00:00:00
Segmentation fault (core dumped)

I am re-running with GDB now.
@AndreaGuarracino Do you have any idea why? Would a duplicate entry in the BED file cause such a problem?
@George-du Since you are extracting from 2 chromosomes only, you can try odgi explode and then work with smaller graphs.

I finally understood your 2nd question. You need this because you have two BED entries which target the same region in the reference, but you annotated them from different samples.

grep 77707167 Tissue_All_Non-Coding-region.bed
GRCh38.chr1     77707167        77707582        GRCh38.9835.chr1:77707167-77707582
GRCh38.chr1     77707167        77707582        GRCh38.9836.chr1:77707167-77707582

However, odgi does only store the path name which usually corresponds to the FASTA identifier. We don't store additional meta information. I am not sure how odgi extract behaves when you enter the same identifier twice. Maybe @AndreaGuarracino knows more?
So after the extraction and the FASTA conversion, you have to add your meta information in the BED manually to the FASTA identifier. I am afraid there is currently no other way. Else we would have to change the whole odgi data structure model.

@subwaystation
Copy link
Member

subwaystation commented Aug 22, 2023

The full stack trace.

odgi: /home/ubuntu/software/odgi/git/master/src/odgi.cpp:201: odgi::graph_t::path_metadata_t& odgi::graph_t::get_path_metadata(const handlegraph::path_handle_t&) const: Assertion `false' failed.
--Type <RET> for more, q to quit, c to continue without paging--

Thread 27 "odgi" received signal SIGABRT, Aborted.
[Switching to Thread 0x7fe93b5e6700 (LWP 715612)]
__GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:50
50      ../sysdeps/unix/sysv/linux/raise.c: No such file or directory.
(gdb) bt
#0  __GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:50
#1  0x00007ffff774c859 in __GI_abort () at abort.c:79
#2  0x00007ffff774c729 in __assert_fail_base (fmt=0x7ffff78e2588 "%s%s%s:%u: %s%sAssertion `%s' failed.\n%n", assertion=0x555555bc29e3 "false", file=0x555555bc29b0 "/home/ubuntu/software/odgi/git/master/src/odgi.cpp", line=201,
    function=<optimized out>) at assert.c:92
#3  0x00007ffff775dfd6 in __GI___assert_fail (assertion=0x555555bc29e3 "false", file=0x555555bc29b0 "/home/ubuntu/software/odgi/git/master/src/odgi.cpp", line=201,
    function=0x555555bc2940 "odgi::graph_t::path_metadata_t& odgi::graph_t::get_path_metadata(const handlegraph::path_handle_t&) const") at assert.c:101
#4  0x000055555559df02 in odgi::graph_t::get_path_metadata (this=0x7fffffffb540, path=...) at /home/ubuntu/software/odgi/git/master/src/odgi.cpp:201
#5  0x00005555555a4c88 in odgi::graph_t::append_step (this=0x7fffffffb540, path=..., to_append=...) at /home/ubuntu/software/odgi/git/master/src/odgi.cpp:1326
#6  0x00005555558a387f in odgi::<lambda(odgi::graph_t&, std::vector<handlegraph::path_handle_t, std::allocator<handlegraph::path_handle_t> >*, const std::vector<handlegraph::path_handle_t, std::allocator<handlegraph::path_handle_t> >&, odgi::graph_t&, std::vector<odgi::path_range_t, std::allocator<odgi::path_range_t> >, std::vector<std::pair<long unsigned int, long unsigned int>, std::allocator<std::pair<long unsigned int, long unsigned int> > >, uint64_t, uint64_t, bool, bool, uint64_t, uint64_t, uint64_t, bool, bool)>::<lambda(const handlegraph::handle_t&)>::operator()(const handlegraph::handle_t &) const (__closure=0x7fe939600000, handle=...)
    at /home/ubuntu/software/odgi/git/master/src/subcommand/extract_main.cpp:635
#7  0x00005555558ac575 in std::_Function_handler<void(const handlegraph::handle_t&), odgi::main_extract(int, char**)::<lambda(odgi::graph_t&, std::vector<handlegraph::path_handle_t>*, const std::vector<handlegraph::path_handle_t>&, odgi::graph_t&, std::vector<odgi::path_range_t>, std::vector<std::pair<long unsigned int, long unsigned int> >, uint64_t, uint64_t, bool, bool, uint64_t, uint64_t, uint64_t, bool, bool)>::<lambda(const handlegraph::handle_t&)> >::_M_invoke(const std::_Any_data &, const handlegraph::handle_t &) (__functor=..., __args#0=...) at /usr/include/c++/9/bits/std_function.h:300
#8  0x00005555555eb183 in std::function<void (handlegraph::handle_t const&)>::operator()(handlegraph::handle_t const&) const (this=0x7fe93b5e4440, __args#0=...) at /usr/include/c++/9/bits/std_function.h:688
#9  0x00005555555e0379 in odgi::algorithms::for_handle_in_path_range(odgi::graph_t const&, handlegraph::path_handle_t, long, long, std::function<void (handlegraph::handle_t const&)> const&) (source=..., path_handle=..., start=77707167,
    end=77707582, lambda=...) at /home/ubuntu/software/odgi/git/master/src/algorithms/subgraph/extract.cpp:207
#10 0x00005555558aeb26 in odgi::<lambda(odgi::graph_t&, std::vector<handlegraph::path_handle_t, std::allocator<handlegraph::path_handle_t> >*, const std::vector<handlegraph::path_handle_t, std::allocator<handlegraph::path_handle_t> >&, odgi::graph_t&, std::vector<odgi::path_range_t, std::allocator<odgi::path_range_t> >, std::vector<std::pair<long unsigned int, long unsigned int>, std::allocator<std::pair<long unsigned int, long unsigned int> > >, uint64_t, uint64_t, bool, bool, uint64_t, uint64_t, uint64_t, bool, bool)>::_ZZN4odgi12main_extractEiPPcENKUlRNS_7graph_tEPSt6vectorIN11handlegraph13path_handle_tESaIS6_EERKS8_S3_S4_INS_12path_range_tESaISC_EES4_ISt4pairImmESaISG_EEmmbbmmmbbE1_clES3_S9_SB_S3_SE_SI_mmbbmmmbb._omp_fn.2(void) () at /home/ubuntu/software/odgi/git/master/src/subcommand/extract_main.cpp:628
#11 0x00007ffff797686e in ?? () from /lib/x86_64-linux-gnu/libgomp.so.1
--Type <RET> for more, q to quit, c to continue without paging--
#12 0x00007ffff7924609 in start_thread (arg=<optimized out>) at pthread_create.c:477
#13 0x00007ffff7849133 in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95

@AndreaGuarracino
Copy link
Member

Duplicated path ranges are not supported. Just pushed a fix to check them #525

@subwaystation
Copy link
Member

Indeed, after removing the duplicated BED entries it works.

odgi extract -i hprc-v1.0-mc-grch38_Non-W.og -b Tissue_All_Non-Coding-region.uniq.bed -o hprc-v1.0-mc-grch38_Non-W.og.extract.og -c 0 -d 10 --full-range -t 28 -P
[odgi::extract] extracting path ranges 100.00% @ 4.22e+01 bp/s elapsed: 00:00:11:41 remain: 00:00:00:00
[odgi::extract] collecting all nodes in the path range 100.00% @ 2.40e+06 bp/s elapsed: 00:00:00:03 remain: 00:00:00:00
[odgi::extract] merge subpaths closer than 10 bps - iteration 1 (max 3) 100.00% @ 4.43e+02 bp/s elapsed: 00:00:01:01 remain: 00:00:00:00
[odgi::extract] merge subpaths closer than 10 bps - iteration 2 (max 3) 100.00% @ 4.43e+02 bp/s elapsed: 00:00:01:01 remain: 00:00:00:00
[odgi::extract] merge subpaths closer than 10 bps - iteration 3 (max 3) 100.00% @ 4.43e+02 bp/s elapsed: 00:00:01:01 remain: 00:00:00:00
[odgi::extract] adding connecting edges 100.00% @ 1.31e+06 bp/s elapsed: 00:00:00:05 remain: 00:00:00:00
[odgi::extract] adding subpaths 100.00% @ 9.48e+02 bp/s elapsed: 00:00:01:25 remain: 00:00:00:00
[odgi::extract] checking missing edges and empty subpaths 100.00% @ 3.77e+03 bp/s elapsed: 00:00:00:09 remain: 00:00:00:00

@GeorgeBGM
Copy link
Author

GeorgeBGM commented Aug 22, 2023

Dear @subwaystation @AndreaGuarracino,

Wow, that sounds good! I'll try it as soon as possible. I will also try the odgi explode method. Thanks!

Best ,Du

@GeorgeBGM
Copy link
Author

GeorgeBGM commented Aug 31, 2023

Dear @subwaystation @AndreaGuarracino,

I would like to ask if extracting the sequence based on intervals (-b BED file) will extract the excess sequence?(#526)

When there are more intervals, a large number of sequences will be extracted, how can I de-redundancy these sequences? Is it possible to divide the data into several parts and redundant them separately or is there any other suggestion? Or do you recommend using VCF files directly?

@subwaystation
Copy link
Member

Dear @George-du,

as far as I understood only with -E you will extract excess sequence.

I think there are several options for you:

  • You select a larger value for -d, --max-distance-subpaths
  • You specify -s, --split-subgraphs. This ensures that all your intervals are written to their own graph.

How would like the VCF file to interact with the graph? Or would you like to project the graph into a VCF and then work with it? If you want to take a look at the variants within your ranges, I would rather first project into a VCF using vg deconstruct and then take a look at the variants. So you don't miss something.

@GeorgeBGM
Copy link
Author

Dear @subwaystation @AndreaGuarracino,

I will test it as you suggest.
Best, Du

@ekg
Copy link
Contributor

ekg commented Sep 1, 2023 via email

@GeorgeBGM
Copy link
Author

GeorgeBGM commented Sep 5, 2023

Dear @subwaystation @AndreaGuarracino @ekg

I tried it as your suggested and found that this is much slower, which seems to extract one region at a time.
Using the following command,1289 regions can be extracted in 20h.
odgi extract -i $GRAPH -b Tissue_All_Non-HERV_Coding-region_Merge_Anno_need_shuffle.bed -c 0 -d 0 -t 30 -P -s -E
Maybe splitting chromosomes regions (odgi explode) or even cutting the region BED file will speed up this operation, but It will generate a lot of intermediate processing files.

Best, Du

@subwaystation
Copy link
Member

@AndreaGuarracino Any ideas how to speed this up?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants