Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Automatic filename generation and making Drake even more cool #26

Open
aboytsov opened this issue Jan 28, 2013 · 10 comments
Open

Automatic filename generation and making Drake even more cool #26

aboytsov opened this issue Jan 28, 2013 · 10 comments
Assignees
Labels

Comments

@aboytsov
Copy link
Contributor

Would like to hear everyone's thought on this one.

Design, spec out and implement automated filename generation for cases where filenames are not important. We can use _ symbol to specify it. The filenames would still be persistent - they should be a function of information in the step, for example (probably in that order), the method used, other (named) outputs, tags used, or step's numeric position (worse). Even though these scheme can never guarantee changing the workflow wouldn't change the filenames, we should try to minimize these cases. Example:

_ <- input
  grep -v BAD_ENTRY $INPUT > $OUTPUT

_ <- _
  sort $INPUT > $OUTPUT

output <- _
  uniq $INPUT > $OUTPUT

Or in combination with methods:

filter()
  grep -v BAD_ENTRY $INPUT > $OUTPUT

sort()
  sort $INPUT > $OUTPUT

uniq()
  uniq $INPUT > $OUTPUT

_ <- input            filter()
_ <- _ [retries:5]    sort()
_ <- _ [my-option:66] uniq()
output <- _           filter()          ; can be used several times, why not?

It is mostly useful for very simple relationship (single input, single output), but can be used in a more complicated context as well:

output1, _ <- input        ; two outputs, don't much care about naming of the second one
   ....

_ <- _
   ....

result <- output1, _       ; referring to the output1 directly
   ....

We could even add a special symbol (+) as a shortcut for (_ <- _):

+
  grep -v BAD_ENTRY $INPUT > $OUTPUT

+ 
  sort $INPUT > $OUTPUT

+ 
  unique $INPUT > $OUTPUT

And if we relax requirement for each step to begin with a new line (which is only important when the body is defined), in combination with methods we could arrive at the following equivalent:

filter()
  grep -v BAD_ENTRY $INPUT > $OUTPUT

sort()
  sort $INPUT > $OUTPUT

unique()
  unique $INPUT > $OUTPUT

+ filter + sort + unique

And we can also introduce some rules that the very first input _ is replaced with $in environment variable, and the very last output _ with (optional) $out environment variable, then the script above could be invoked as:

drake -v in=my_input,out=my_output

and we can use Drake to create quick ad-hoc data processing pipelines without caring about naming intermediate data files.

For truly temporary files that should be deleted, we can use _?. The benefits of this is less obvious, because if the file is truly temporary, Drake will always run steps linked through such files together (there would never be a state where only one of them is up-to-date). It could still be convenient if you want a temporary file anyway, just want something else (Drake) to take care of its creation and deletion.

+1 if you like. Your feedback is appreciated.

@ghost ghost assigned aboytsov Jan 28, 2013
@dirtyvagabond
Copy link
Contributor

Is there any value in supporting multiple such steps in one workflow, that are explicitly unique? E.g., _1, _2, etc.?

@aboytsov
Copy link
Contributor Author

I thought about it and it doesn't seem so. Maybe with "truly" temporary filenames, i.e. _?1, _?2 it makes sense, but not with automatically generated ones. I couldn't find a reason why one would want to use _1 instead od simply 1. :)

@larsyencken
Copy link

I know it's less pretty, but it would feel like less magic if this were a kind of suffix rule instead. Then it would be completely predictable at a glance (and would incidentally solve a different problem too, namely families of dependencies).

_.filtered <- _.input
  grep -v BAD_ENTRY $INPUT > $OUTPUT

_.sorted <- _.filtered
  sort $INPUT > $OUTPUT

_.output <- _.sorted
  uniq $INPUT > $OUTPUT

Too Makefile-ish?

@aboytsov
Copy link
Contributor Author

I think it looks fine, but I'm not sure exactly what it accomplishes. If you can give each input or output a unique identifier, you can give it a filename as well - there's not much difference. The only difference would be if you want these files to be gone (deleted) when the workflow is finished, but as mentioned above, this is often undesirable, and also this is what _? is proposed to accomplish.

The very idea of _ is that you don't have to come up with any name. This allows for quick chaining like:

; Drake handles filenames for you!

_ <- in
  echo Step1

+
  echo Step2

+
  echo Step3

out <- _
  echo Step4

Otherwise, it seem to become a mechanism to only facilitate coming up with filenames, not replace it. For example, the condition of uniqueness would still hold. But it's arguable whether we need a separate mechanism for facilitation at all, because we already have variables as well as $BASE variable:

; Equivalent to your example

BASE=/tmp/my-prefix.

filtered <- input

; or
BASE=
p=/tmp/my-prefix.

$[p]filtered <- $[p]input

Please let me know if you feel I'm missing something or wrong somewhere. Or did I simply misunderstood you?

@aboytsov
Copy link
Contributor Author

Maybe you were referring to suffix rules in a sense that Drake would automatically apply them to the existing files?

@larsyencken
Copy link

Yeah, that's what I meant. Your example works for a single file, but not for multiple input files, each having the full workflow to transform it to an output.

But also, I find having a minimal naming scheme for the intermediate files useful, so that you know that they are by filename alone.

Apologies if I'm hijacking this feature suggestion though. If you like, I can write a separate feature request for working with families of files.

@aboytsov
Copy link
Contributor Author

I see. Yes, if you meant automatic application to the existing files, it seems to be a separate feature, related to the recent discussion on the mailing list. Take a look at this thread and see if that's what you had in mind? I think it definitely deserves a feature request.

As for the minimal naming scheme for the intermediate files (a somewhat unrelated thing, correct?), I still can't see how it's different from using filenames - in both cases you have to come up with some unique identifiers that would later be translated into filenames, either directly or indirectly. So not sure what adding indirection could bring to the table?

@larsyencken
Copy link

The mailing list discussion is right on. I'll think about a separate feature proposal.

As for suffixes for the single-file case, you're basically saying that if you go to the effort of writing suffix rules, you may as well have named your intermediate files (or named your steps). Now that I've thought about it, I understand and agree. Any alternative syntax I can think of for this kind of checkpointing seems worse. I'm on board :)

@aboytsov
Copy link
Contributor Author

You're spot on. Thanks for the productive discussion! Would be quite interested to hear your thoughts on the templates/globbing as well (in another ticket, probably).

@aboytsov
Copy link
Contributor Author

aboytsov commented Feb 3, 2013

For the (not strictly related) issue of automatically creating rules for existing files, a feature request has been filed: #41

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants