Automatic filename generation and making Drake even more cool #26

aboytsov · 2013-01-28T23:27:24Z

Would like to hear everyone's thought on this one.

Design, spec out and implement automated filename generation for cases where filenames are not important. We can use _ symbol to specify it. The filenames would still be persistent - they should be a function of information in the step, for example (probably in that order), the method used, other (named) outputs, tags used, or step's numeric position (worse). Even though these scheme can never guarantee changing the workflow wouldn't change the filenames, we should try to minimize these cases. Example:

_ <- input
  grep -v BAD_ENTRY $INPUT > $OUTPUT

_ <- _
  sort $INPUT > $OUTPUT

output <- _
  uniq $INPUT > $OUTPUT

Or in combination with methods:

filter()
  grep -v BAD_ENTRY $INPUT > $OUTPUT

sort()
  sort $INPUT > $OUTPUT

uniq()
  uniq $INPUT > $OUTPUT

_ <- input            filter()
_ <- _ [retries:5]    sort()
_ <- _ [my-option:66] uniq()
output <- _           filter()          ; can be used several times, why not?

It is mostly useful for very simple relationship (single input, single output), but can be used in a more complicated context as well:

output1, _ <- input        ; two outputs, don't much care about naming of the second one
   ....

_ <- _
   ....

result <- output1, _       ; referring to the output1 directly
   ....

We could even add a special symbol (+) as a shortcut for (_ <- _):

+
  grep -v BAD_ENTRY $INPUT > $OUTPUT

+ 
  sort $INPUT > $OUTPUT

+ 
  unique $INPUT > $OUTPUT

And if we relax requirement for each step to begin with a new line (which is only important when the body is defined), in combination with methods we could arrive at the following equivalent:

filter()
  grep -v BAD_ENTRY $INPUT > $OUTPUT

sort()
  sort $INPUT > $OUTPUT

unique()
  unique $INPUT > $OUTPUT

+ filter + sort + unique

And we can also introduce some rules that the very first input _ is replaced with $in environment variable, and the very last output _ with (optional) $out environment variable, then the script above could be invoked as:

drake -v in=my_input,out=my_output

and we can use Drake to create quick ad-hoc data processing pipelines without caring about naming intermediate data files.

For truly temporary files that should be deleted, we can use _?. The benefits of this is less obvious, because if the file is truly temporary, Drake will always run steps linked through such files together (there would never be a state where only one of them is up-to-date). It could still be convenient if you want a temporary file anyway, just want something else (Drake) to take care of its creation and deletion.

+1 if you like. Your feedback is appreciated.

The text was updated successfully, but these errors were encountered:

dirtyvagabond · 2013-01-29T14:44:10Z

Is there any value in supporting multiple such steps in one workflow, that are explicitly unique? E.g., _1, _2, etc.?

aboytsov · 2013-01-29T21:35:52Z

I thought about it and it doesn't seem so. Maybe with "truly" temporary filenames, i.e. _?1, _?2 it makes sense, but not with automatically generated ones. I couldn't find a reason why one would want to use _1 instead od simply 1. :)

larsyencken · 2013-01-31T00:27:45Z

I know it's less pretty, but it would feel like less magic if this were a kind of suffix rule instead. Then it would be completely predictable at a glance (and would incidentally solve a different problem too, namely families of dependencies).

_.filtered <- _.input
  grep -v BAD_ENTRY $INPUT > $OUTPUT

_.sorted <- _.filtered
  sort $INPUT > $OUTPUT

_.output <- _.sorted
  uniq $INPUT > $OUTPUT

Too Makefile-ish?

aboytsov · 2013-01-31T00:40:14Z

I think it looks fine, but I'm not sure exactly what it accomplishes. If you can give each input or output a unique identifier, you can give it a filename as well - there's not much difference. The only difference would be if you want these files to be gone (deleted) when the workflow is finished, but as mentioned above, this is often undesirable, and also this is what _? is proposed to accomplish.

The very idea of _ is that you don't have to come up with any name. This allows for quick chaining like:

; Drake handles filenames for you!

_ <- in
  echo Step1

+
  echo Step2

+
  echo Step3

out <- _
  echo Step4

Otherwise, it seem to become a mechanism to only facilitate coming up with filenames, not replace it. For example, the condition of uniqueness would still hold. But it's arguable whether we need a separate mechanism for facilitation at all, because we already have variables as well as $BASE variable:

; Equivalent to your example

BASE=/tmp/my-prefix.

filtered <- input

; or
BASE=
p=/tmp/my-prefix.

$[p]filtered <- $[p]input

Please let me know if you feel I'm missing something or wrong somewhere. Or did I simply misunderstood you?

aboytsov · 2013-01-31T00:44:17Z

Maybe you were referring to suffix rules in a sense that Drake would automatically apply them to the existing files?

larsyencken · 2013-01-31T00:50:12Z

Yeah, that's what I meant. Your example works for a single file, but not for multiple input files, each having the full workflow to transform it to an output.

But also, I find having a minimal naming scheme for the intermediate files useful, so that you know that they are by filename alone.

Apologies if I'm hijacking this feature suggestion though. If you like, I can write a separate feature request for working with families of files.

aboytsov · 2013-01-31T01:13:34Z

I see. Yes, if you meant automatic application to the existing files, it seems to be a separate feature, related to the recent discussion on the mailing list. Take a look at this thread and see if that's what you had in mind? I think it definitely deserves a feature request.

As for the minimal naming scheme for the intermediate files (a somewhat unrelated thing, correct?), I still can't see how it's different from using filenames - in both cases you have to come up with some unique identifiers that would later be translated into filenames, either directly or indirectly. So not sure what adding indirection could bring to the table?

larsyencken · 2013-01-31T04:05:22Z

The mailing list discussion is right on. I'll think about a separate feature proposal.

As for suffixes for the single-file case, you're basically saying that if you go to the effort of writing suffix rules, you may as well have named your intermediate files (or named your steps). Now that I've thought about it, I understand and agree. Any alternative syntax I can think of for this kind of checkpointing seems worse. I'm on board :)

aboytsov · 2013-01-31T05:13:43Z

You're spot on. Thanks for the productive discussion! Would be quite interested to hear your thoughts on the templates/globbing as well (in another ticket, probably).

aboytsov · 2013-02-03T07:40:19Z

For the (not strictly related) issue of automatically creating rules for existing files, a feature request has been filed: #41

ghost assigned aboytsov Jan 28, 2013

aboytsov mentioned this issue Feb 3, 2013

Consider using () for method invocation #23

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Automatic filename generation and making Drake even more cool #26

Automatic filename generation and making Drake even more cool #26

aboytsov commented Jan 28, 2013

dirtyvagabond commented Jan 29, 2013

aboytsov commented Jan 29, 2013

larsyencken commented Jan 31, 2013

aboytsov commented Jan 31, 2013

aboytsov commented Jan 31, 2013

larsyencken commented Jan 31, 2013

aboytsov commented Jan 31, 2013

larsyencken commented Jan 31, 2013

aboytsov commented Jan 31, 2013

aboytsov commented Feb 3, 2013

Automatic filename generation and making Drake even more cool #26

Automatic filename generation and making Drake even more cool #26

Comments

aboytsov commented Jan 28, 2013

dirtyvagabond commented Jan 29, 2013

aboytsov commented Jan 29, 2013

larsyencken commented Jan 31, 2013

aboytsov commented Jan 31, 2013

aboytsov commented Jan 31, 2013

larsyencken commented Jan 31, 2013

aboytsov commented Jan 31, 2013

larsyencken commented Jan 31, 2013

aboytsov commented Jan 31, 2013

aboytsov commented Feb 3, 2013