Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Limit reads from the store #162

Open
YPares opened this issue Mar 10, 2020 · 1 comment
Open

Limit reads from the store #162

YPares opened this issue Mar 10, 2020 · 1 comment

Comments

@YPares
Copy link
Contributor

YPares commented Mar 10, 2020

From my knowledge, in the current state of funflow there are two kind of hashes for step/stepIO, those obtained from ContentHashable and those obtained from the binary representation of a serialized result (and they have no reason to converge).
In a cached step a :: (ContentHashable m x, Store y) => Flow ex eff x y, y is indexed in the store by using the hash of its binary representation. That means that if pure step a is used is this flow:

a >>> b

Then to know whether to re-execute b or not, we have to read y from the store and let b hash it in order to know if b should run or not. Am I correct on that? This might be different for external tasks as I understand, but I think for step/stepIO it's the case.
If we were to require y to be ContentHashable too, we would only use one type of hash, and unless some step is inserted between a and b that modifies or takes a subset of y, then b would not need to read anything as we would already know the hash of its input (automatically passing the hashes along the flow could be done transparently).

We have a potential use case in which we would heavily rely on remote caching, which I want to assess feasibility of, and limiting the reads in such a remote context would become important.

The same could be obtained by requiring input x to be Store, and always hashing the store representation (based on the logic that some step input is often another step's output, and needs to be serialized at some point anyway).

cc @nc6 @regnat

@nc6
Copy link
Member

nc6 commented Apr 14, 2020

Yes, I see the issue here. I'm not sure that changing the hashing mechanism (by itself) would help that much, because we still actually try to return the value from the store for the computation. What we could do instead is return something like:

data Cached o = Cached {
  -- | Deliberately lazy
  cachedValue :: ~o
  cachedHash :: ContentHash
}

The runner could then take advantage of this to "skip" steps if not needed. We would want to ensure that the hash:

  • Is deducible from the store path/metadata, and
  • Matches the hash given from serialisation

This should hopefully be the case, but would need to check.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants