Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

minigraph requires GFA 1.0 overlaps, which are optional in the spec #6

Open
edawson opened this issue Oct 13, 2019 · 4 comments
Open
Labels
question Further information is requested

Comments

@edawson
Copy link

edawson commented Oct 13, 2019

The GFA specs state that overlaps for L lines are optional, but minigraph seems to require these. I think this happens because there is no way to avoid parsing a CIGAR string in this code block.

For GFA 2.0, the "*" placeholder is used to denote a lack of an overlap CIGAR. GFA 1.0 doesn't specify a placeholder, just that the field is optional. I have always assumed that a GFA with no CIGAR in the L/E lines implies a non-overlap (i.e., a CIGAR of 0M).

Would it be possible to adjust the default condition so that the parser can handle all valid GFA 1 files?

@lh3 lh3 added the duplicate This issue or pull request already exists label Oct 13, 2019
@lh3
Copy link
Owner

lh3 commented Oct 13, 2019

Duplicate of #1.

Minigraph is more for mapping to a reference graph. I don't think a reference graph should allow overlaps. Also, it is tricky to work with overlaps. It will take time to implement the feature. Minigraph may support overlaps in future, but that won't happen soon unfortunately.

@edawson
Copy link
Author

edawson commented Oct 13, 2019

I think these two issues are related but maybe not duplicates. The issue isn't the graph structure but its format in this case.

I have a graph with no overlaps (CIGAR 0M), but the CIGAR is not included in the Link lines:

L       2112999 +       2113002 +     

This is valid GFA1 and my graph is a reference graph (i.e., constructed from a reference genome backbone with reference-relative variation added). minigraph just refused to parse it because it expects the sixth field (CIGAR) to be present.

Edit: I agree though that reference graphs should not have overlaps. That would make everything a lot harder!

@lh3 lh3 added question Further information is requested and removed duplicate This issue or pull request already exists labels Oct 22, 2019
@lh3
Copy link
Owner

lh3 commented Oct 22, 2019

Sorry for misreading your question. My initial intention was to require CIGAR because L-lines may have tags. Making CIGAR optional will complicate tags. In addition, 0M doesn't waste much space anyway. I didn't realize that GFA1 makes this field optional, which IMO is not ideal.

@edawson
Copy link
Author

edawson commented Oct 24, 2019

Making CIGAR optional will complicate tags

Yes - I can file an issue on the spec to clarify whether "CIGAR optional" means no field present or CIGAR field == empty string. I have only ever had GFA files where, if tags are present, there is also a CIGAR. There's some room for clarification on that.

I think a good fix would be to note in the README/docs:

  1. That input is GFA 1.0 (or rGFA), and
  2. That CIGAR strings are required and overlaps (i.e., CIGAR != "0M") are discouraged
  3. That tags are permitted IFF a CIGAR is present.

I think that would sufficient to prevent any users from falling into traps, without having to change the source code.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question Further information is requested
Projects
None yet
Development

No branches or pull requests

2 participants