Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CLI: Adding subcommand structure and unifying file suffixes #87

Open
karel-brinda opened this issue Feb 6, 2025 · 6 comments · May be fixed by #90
Open

CLI: Adding subcommand structure and unifying file suffixes #87

karel-brinda opened this issue Feb 6, 2025 · 6 comments · May be fixed by #90
Assignees

Comments

@karel-brinda
Copy link
Collaborator

karel-brinda commented Feb 6, 2025

This is an updated version of the ticket #78, incorporating all the associated discussion into the proposal. We need to restructure the CLI to well separate subcommands, as well as make some associated format-related changes. This will be a major update of KmerCamel.

File formats

Consistent suffixes:

  • masked-case MS: FASTA .msfa
  • superstring: TXT .s
  • mask: TXT .m

CLI improvement

The use should be more simple, eg:

# MS computation
kmercamel ms -k 31 input.fa > dataset.msfa # (global) greedy default
kmercamel ms -k 31 -a streaming input.fa > dataset.msfa # -a switchers between global / local / streaming
kmercamel ms -k 31 -m dataset.minone_mask -M dataset.maxone_mask input.fa > dataset.msfa # option for additionally adding max-one/min-one/both masks in separate files


# mask optimization
kmercamel maskopt -a maxone dataset.msfa > dataset_m1.msfa
kmercamel maskopt -a minone dataset.msfa > dataset_m2.msfa
kmercamel maskopt -a minrun dataset.msfa > dataset_m3.msfa

# format conversion
kmercamel ms2msfa -m dataset.m -s dataset.s > dataset.msfa # M and S -> mask-cased MS in msfa
kmercamel msfa2ms -m dataset.m -s dataset.s < dataset.msfa # mask-cased MS -> M and S
kmercamel msfa2spss < dataset.msfa > dataset.rspss # splitting MS in msfa into rSPSS in fa
kmercamel spss2msfa -k 31 < dataset.fa > dataset.msfa # rspss/general fasta to its corresponding ms

# lower bound
kmercamel lowerbound -k 31 input.fa # compute the lowerbound on length of any ms representation

Notes

  • -c should always be on by default (canonical)
  • streaming even for maxone can be done in one pass
  • -k should be parsed automatically from the superstring (by default) and this extraction should appear in a well visible message (e.g.,"KMER SIZE EXTRACTED: 31")
  • if -k is provided and is not consistent with the suffix-encoded k, the program should fail (optional checking if user wants)
  • as @PavelVesely noted, the previous point may cause failures already now with min-one masks; it's not an issue as we can always omit using the parameter; in the future, we will want the min-one always ensure our masks to be consistent with our convention (i.e., have always the last k-mer switched on)
@karel-brinda karel-brinda changed the title CLI: Adding subcommand structure and improving file formats CLI: Adding subcommand structure and unifying file suffixes Feb 6, 2025
@OndrejSladky
Copy link
Owner

kmercamel ms -k 31 --maxone dataset.m_alt input.fa > dataset.msfa # option for additionally adding max-one mask

I'm unsure if the library for input parsing I'm using supports double dashed full name arguments. Would need to check. Alternatively we can have it as -M maxone

@karel-brinda
Copy link
Collaborator Author

karel-brinda commented Feb 6, 2025

-M is probably ok, even though --maxone or --max-one would be more explicit. If it's getopt, it should support long option names (see eg https://suchprogramming.com/command-line-c/).

@PavelVesely
Copy link
Collaborator

On second thought, using 'streaming' instead of 'online' would look better to people working on applications. Actually, it is a streaming (sublinear-space) algorithm in the sense that we use space proportional to # distinct k-mers, not the input length (if implemented in one pass).

@karel-brinda
Copy link
Collaborator Author

On second thought, using 'streaming' instead of 'online' would look better to people working on applications.

I've incorporated this into the proposal.

@karel-brinda
Copy link
Collaborator Author

karel-brinda commented Feb 6, 2025

kmercamel ms -k 31 --maxone dataset.m_alt input.fa > dataset.msfa # option for additionally adding max-one mask

I'm unsure if the library for input parsing I'm using supports double dashed full name arguments. Would need to check. Alternatively we can have it as -M maxone

I've incorporated it into the proposal with the following additional change: the user may actually require exporting both basic types of masks - min-one or max-one - right during the MS computation (it will be actually extremely helpful even for us as in many situations we want to have both masks explicitly stored and this will quite significantly reduce the number of commands to execute).

@OndrejSladky
Copy link
Owner

kmercamel ms -k 31 --maxone dataset.m_alt input.fa > dataset.msfa # option for additionally adding max-one mask

I'm unsure if the library for input parsing I'm using supports double dashed full name arguments. Would need to check. Alternatively we can have it as -M maxone

I've incorporated it into the proposal with the following additional change: the user may actually require exporting both basic types of masks - min-one or max-one - right during the MS computation (it will be actually extremely helpful even for us as in many situations we want to have both masks explicitly stored and this will quite significantly reduce the number of commands to execute).

Thanks, that's actually a pretty good idea. Does it make sense to have a parameter for minimizing ones in mask during the construction, when the default mask already minimizes the number of ones?
Also, I'd avoid the use of -m argument with two different meanings in different subcommands (i.e. once for ones minimization and once for mask separation output).

@OndrejSladky OndrejSladky linked a pull request Feb 12, 2025 that will close this issue
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants