-
Notifications
You must be signed in to change notification settings - Fork 110
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Automatic filename generation and making Drake even more cool #26
Comments
Is there any value in supporting multiple such steps in one workflow, that are explicitly unique? E.g., _1, _2, etc.? |
I thought about it and it doesn't seem so. Maybe with "truly" temporary filenames, i.e. |
I know it's less pretty, but it would feel like less magic if this were a kind of suffix rule instead. Then it would be completely predictable at a glance (and would incidentally solve a different problem too, namely families of dependencies).
Too Makefile-ish? |
I think it looks fine, but I'm not sure exactly what it accomplishes. If you can give each input or output a unique identifier, you can give it a filename as well - there's not much difference. The only difference would be if you want these files to be gone (deleted) when the workflow is finished, but as mentioned above, this is often undesirable, and also this is what The very idea of
Otherwise, it seem to become a mechanism to only facilitate coming up with filenames, not replace it. For example, the condition of uniqueness would still hold. But it's arguable whether we need a separate mechanism for facilitation at all, because we already have variables as well as
Please let me know if you feel I'm missing something or wrong somewhere. Or did I simply misunderstood you? |
Maybe you were referring to suffix rules in a sense that Drake would automatically apply them to the existing files? |
Yeah, that's what I meant. Your example works for a single file, but not for multiple input files, each having the full workflow to transform it to an output. But also, I find having a minimal naming scheme for the intermediate files useful, so that you know that they are by filename alone. Apologies if I'm hijacking this feature suggestion though. If you like, I can write a separate feature request for working with families of files. |
I see. Yes, if you meant automatic application to the existing files, it seems to be a separate feature, related to the recent discussion on the mailing list. Take a look at this thread and see if that's what you had in mind? I think it definitely deserves a feature request. As for the minimal naming scheme for the intermediate files (a somewhat unrelated thing, correct?), I still can't see how it's different from using filenames - in both cases you have to come up with some unique identifiers that would later be translated into filenames, either directly or indirectly. So not sure what adding indirection could bring to the table? |
The mailing list discussion is right on. I'll think about a separate feature proposal. As for suffixes for the single-file case, you're basically saying that if you go to the effort of writing suffix rules, you may as well have named your intermediate files (or named your steps). Now that I've thought about it, I understand and agree. Any alternative syntax I can think of for this kind of checkpointing seems worse. I'm on board :) |
You're spot on. Thanks for the productive discussion! Would be quite interested to hear your thoughts on the templates/globbing as well (in another ticket, probably). |
For the (not strictly related) issue of automatically creating rules for existing files, a feature request has been filed: #41 |
Would like to hear everyone's thought on this one.
Design, spec out and implement automated filename generation for cases where filenames are not important. We can use
_
symbol to specify it. The filenames would still be persistent - they should be a function of information in the step, for example (probably in that order), the method used, other (named) outputs, tags used, or step's numeric position (worse). Even though these scheme can never guarantee changing the workflow wouldn't change the filenames, we should try to minimize these cases. Example:Or in combination with methods:
It is mostly useful for very simple relationship (single input, single output), but can be used in a more complicated context as well:
We could even add a special symbol (
+
) as a shortcut for (_ <- _
):And if we relax requirement for each step to begin with a new line (which is only important when the body is defined), in combination with methods we could arrive at the following equivalent:
And we can also introduce some rules that the very first input
_
is replaced with$in
environment variable, and the very last output_
with (optional)$out
environment variable, then the script above could be invoked as:and we can use Drake to create quick ad-hoc data processing pipelines without caring about naming intermediate data files.
For truly temporary files that should be deleted, we can use
_?
. The benefits of this is less obvious, because if the file is truly temporary, Drake will always run steps linked through such files together (there would never be a state where only one of them is up-to-date). It could still be convenient if you want a temporary file anyway, just want something else (Drake) to take care of its creation and deletion.+1 if you like. Your feedback is appreciated.
The text was updated successfully, but these errors were encountered: