Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Checksum-based dependency evaluation #30

Open
aboytsov opened this issue Jan 29, 2013 · 9 comments
Open

Checksum-based dependency evaluation #30

aboytsov opened this issue Jan 29, 2013 · 9 comments
Labels

Comments

@aboytsov
Copy link
Contributor

There was a request for this.

Related to #11 (support general evaluator hooks).

It could be baked into Drake directly. This evaluator would ignore the timestamps of the input and output files, and only re-run the step if MD5 of the step's inputs have changed since the last run. The MD5's would probably have to be created alongside the files (input.md5-drake or something like that), and will need to be moved/renamed with the files when branching, backups, etc. Forced rebuild should probably update the MD5s?

@aboytsov
Copy link
Contributor Author

We might not have the time to implement it right now. I'd be more than happy to review other people's contributions. If you need this feature, please chime in and say +1.

@larsyencken
Copy link

Thanks for considering this. Actually, content-based dependencies is a means of getting general evaluation hooks into a system like Drake. That's the main use case.

For example, your workflow depends on code in a source tree, so you emit the current version to a file on every run, but only redo the downstream dependencies from that file if the content of that file changes (i.e. if the version changes).

@aboytsov
Copy link
Contributor Author

Then probably we should implement #11 first.

But in the meantime. In simple cases, you can probably just add your binary/JAR as a step's dependency. For more complicated ones, you can simulate the desired behavior the way I suggested in the mailing list:

should-i-run()
   .....
  # do all sorts of evaluation on inputs
  # if yes, touch the output
  touch $OUTPUT

input.trigger <- /some_source_or_whatever [method:should-i-run -timecheck]  
output <- input, input.trigger
  # real work
  ...

It's not ideal, because Drake would consider it a legitimate step and ask you to build it every time, instead of knowing that it's just evaluation and telling "Nothing to do" if everything is up-to-date. But it should work.

@larsyencken
Copy link

It does sound like solving the general case is more important. Thanks, I'll try the workaround.

@larsyencken
Copy link

Just a quick update. The workaround recipe above doesn't quite work, since a dependency like:

myfile.md5 <- myfile [method:calc-md5 -timecheck]

will only run when the md5 does not yet exist. However, the point of content-based dependency (rather than time-based) is to recalculate it every time. Here's a full working example:

;
;  recalculate md5 every time, overwite the old one only on change
;  XXX we have to use an implicit input name
;
maybe-md5()
    input="$(dirname $OUTPUT)/$(basename $OUTPUT .md5)"
    md5 "$input" >$OUTPUT.tmp
    if [ ! -f $OUTPUT.tmp ]; then
        mv -f $OUTPUT.tmp $OUTPUT
    elif [ "$(cat $OUTPUT)" != "$(cat $OUTPUT.tmp)" ]; then
        mv -f $OUTPUT.tmp $OUTPUT
    fi

somefile.in.md5 <- [method:maybe-md5]

somefile.out <- somefile.in.md5
    cat somefile.in >somefile.out

Is there a flag that can be used to force a rebuild of a particular step every time? That'd make this nicer, giving access to the $INPUT family of special variables.

@aboytsov
Copy link
Contributor Author

aboytsov commented Feb 2, 2013

Sorry, I was under the wrong impression that timecheck option is what you're looking for. Apparently, it's not. I don't think there's an option currently to always re-run some targets, and we should add one. Also, seems like check and timecheck option names might be a bit confusing, if even I made this mistake. We'll probably have to reconsider naming.

@aboytsov
Copy link
Contributor Author

aboytsov commented Feb 2, 2013

In the meantime, you can work around it by telling drake to force-rebuild every step that uses your md5 method, e.g.:

drake +=maybe-md5() your-other-targets

@aboytsov
Copy link
Contributor Author

aboytsov commented Feb 2, 2013

I was just about to advise that you rewrite this:

somefile.out <- somefile.in.md5
    cat somefile.in >somefile.out

as this:

somefile.out <- somefile.in, somefile.in.md5
    cat $INPUT >$OUTPUT

But then I realized that it would make Drake to run this step when somefile.in's timestamp was changed, even if the contents of the file weren't. Removing somefile.in from the list of inputs is OK, but this makes you hardcode the filename into the step's body which is no good. Maybe this calls for some syntax to include filenames as dependencies but exclude them from timestamped (or any other) evaluation.

@aboytsov
Copy link
Contributor Author

aboytsov commented Feb 2, 2013

I've submitted a couple of related issues: #38, #40

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants