Checksum-based dependency evaluation #30

aboytsov · 2013-01-29T07:04:44Z

Related to #11 (support general evaluator hooks).

It could be baked into Drake directly. This evaluator would ignore the timestamps of the input and output files, and only re-run the step if MD5 of the step's inputs have changed since the last run. The MD5's would probably have to be created alongside the files (input.md5-drake or something like that), and will need to be moved/renamed with the files when branching, backups, etc. Forced rebuild should probably update the MD5s?

aboytsov · 2013-01-29T07:05:28Z

We might not have the time to implement it right now. I'd be more than happy to review other people's contributions. If you need this feature, please chime in and say +1.

larsyencken · 2013-01-29T12:28:28Z

Thanks for considering this. Actually, content-based dependencies is a means of getting general evaluation hooks into a system like Drake. That's the main use case.

For example, your workflow depends on code in a source tree, so you emit the current version to a file on every run, but only redo the downstream dependencies from that file if the content of that file changes (i.e. if the version changes).

aboytsov · 2013-01-29T21:42:15Z

Then probably we should implement #11 first.

But in the meantime. In simple cases, you can probably just add your binary/JAR as a step's dependency. For more complicated ones, you can simulate the desired behavior the way I suggested in the mailing list:

should-i-run()
   .....
  # do all sorts of evaluation on inputs
  # if yes, touch the output
  touch $OUTPUT

input.trigger <- /some_source_or_whatever [method:should-i-run -timecheck]  
output <- input, input.trigger
  # real work
  ...

It's not ideal, because Drake would consider it a legitimate step and ask you to build it every time, instead of knowing that it's just evaluation and telling "Nothing to do" if everything is up-to-date. But it should work.

larsyencken · 2013-01-29T22:40:14Z

It does sound like solving the general case is more important. Thanks, I'll try the workaround.

larsyencken · 2013-02-02T02:57:36Z

Just a quick update. The workaround recipe above doesn't quite work, since a dependency like:

myfile.md5 <- myfile [method:calc-md5 -timecheck]

will only run when the md5 does not yet exist. However, the point of content-based dependency (rather than time-based) is to recalculate it every time. Here's a full working example:

;
;  recalculate md5 every time, overwite the old one only on change
;  XXX we have to use an implicit input name
;
maybe-md5()
    input="$(dirname $OUTPUT)/$(basename $OUTPUT .md5)"
    md5 "$input" >$OUTPUT.tmp
    if [ ! -f $OUTPUT.tmp ]; then
        mv -f $OUTPUT.tmp $OUTPUT
    elif [ "$(cat $OUTPUT)" != "$(cat $OUTPUT.tmp)" ]; then
        mv -f $OUTPUT.tmp $OUTPUT
    fi

somefile.in.md5 <- [method:maybe-md5]

somefile.out <- somefile.in.md5
    cat somefile.in >somefile.out

Is there a flag that can be used to force a rebuild of a particular step every time? That'd make this nicer, giving access to the $INPUT family of special variables.

aboytsov · 2013-02-02T04:46:47Z

Sorry, I was under the wrong impression that timecheck option is what you're looking for. Apparently, it's not. I don't think there's an option currently to always re-run some targets, and we should add one. Also, seems like check and timecheck option names might be a bit confusing, if even I made this mistake. We'll probably have to reconsider naming.

aboytsov · 2013-02-02T04:49:27Z

In the meantime, you can work around it by telling drake to force-rebuild every step that uses your md5 method, e.g.:

drake +=maybe-md5() your-other-targets

aboytsov · 2013-02-02T04:59:00Z

I was just about to advise that you rewrite this:

somefile.out <- somefile.in.md5
    cat somefile.in >somefile.out

as this:

somefile.out <- somefile.in, somefile.in.md5
    cat $INPUT >$OUTPUT

But then I realized that it would make Drake to run this step when somefile.in's timestamp was changed, even if the contents of the file weren't. Removing somefile.in from the list of inputs is OK, but this makes you hardcode the filename into the step's body which is no good. Maybe this calls for some syntax to include filenames as dependencies but exclude them from timestamped (or any other) evaluation.

aboytsov · 2013-02-02T05:59:16Z

I've submitted a couple of related issues: #38, #40

This was referenced Feb 2, 2013

Add a step option to require force-rebuild #38

Open

File-level evaluators #40

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Checksum-based dependency evaluation #30

Checksum-based dependency evaluation #30

aboytsov commented Jan 29, 2013

aboytsov commented Jan 29, 2013

larsyencken commented Jan 29, 2013

aboytsov commented Jan 29, 2013

larsyencken commented Jan 29, 2013

larsyencken commented Feb 2, 2013

aboytsov commented Feb 2, 2013

aboytsov commented Feb 2, 2013

aboytsov commented Feb 2, 2013

aboytsov commented Feb 2, 2013

Checksum-based dependency evaluation #30

Checksum-based dependency evaluation #30

Comments

aboytsov commented Jan 29, 2013

aboytsov commented Jan 29, 2013

larsyencken commented Jan 29, 2013

aboytsov commented Jan 29, 2013

larsyencken commented Jan 29, 2013

larsyencken commented Feb 2, 2013

aboytsov commented Feb 2, 2013

aboytsov commented Feb 2, 2013

aboytsov commented Feb 2, 2013

aboytsov commented Feb 2, 2013