You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
From my knowledge, in the current state of funflow there are two kind of hashes for step/stepIO, those obtained from ContentHashable and those obtained from the binary representation of a serialized result (and they have no reason to converge).
In a cached step a :: (ContentHashable m x, Store y) => Flow ex eff x y, y is indexed in the store by using the hash of its binary representation. That means that if pure step a is used is this flow:
a >>> b
Then to know whether to re-execute b or not, we have to read y from the store and let b hash it in order to know if b should run or not. Am I correct on that? This might be different for external tasks as I understand, but I think for step/stepIO it's the case.
If we were to require y to be ContentHashable too, we would only use one type of hash, and unless some step is inserted between a and b that modifies or takes a subset of y, then b would not need to read anything as we would already know the hash of its input (automatically passing the hashes along the flow could be done transparently).
We have a potential use case in which we would heavily rely on remote caching, which I want to assess feasibility of, and limiting the reads in such a remote context would become important.
The same could be obtained by requiring input x to be Store, and always hashing the store representation (based on the logic that some step input is often another step's output, and needs to be serialized at some point anyway).
Yes, I see the issue here. I'm not sure that changing the hashing mechanism (by itself) would help that much, because we still actually try to return the value from the store for the computation. What we could do instead is return something like:
data Cached o = Cached {
-- | Deliberately lazy
cachedValue :: ~o
cachedHash :: ContentHash
}
The runner could then take advantage of this to "skip" steps if not needed. We would want to ensure that the hash:
Is deducible from the store path/metadata, and
Matches the hash given from serialisation
This should hopefully be the case, but would need to check.
From my knowledge, in the current state of funflow there are two kind of hashes for step/stepIO, those obtained from ContentHashable and those obtained from the binary representation of a serialized result (and they have no reason to converge).
In a cached step
a :: (ContentHashable m x, Store y) => Flow ex eff x y
,y
is indexed in the store by using the hash of its binary representation. That means that if pure stepa
is used is this flow:a >>> b
Then to know whether to re-execute
b
or not, we have to ready
from the store and letb
hash it in order to know ifb
should run or not. Am I correct on that? This might be different for external tasks as I understand, but I think for step/stepIO it's the case.If we were to require
y
to beContentHashable
too, we would only use one type of hash, and unless some step is inserted betweena
andb
that modifies or takes a subset ofy
, thenb
would not need to read anything as we would already know the hash of its input (automatically passing the hashes along the flow could be done transparently).We have a potential use case in which we would heavily rely on remote caching, which I want to assess feasibility of, and limiting the reads in such a remote context would become important.
The same could be obtained by requiring input
x
to beStore
, and always hashing the store representation (based on the logic that some step input is often another step's output, and needs to be serialized at some point anyway).cc @nc6 @regnat
The text was updated successfully, but these errors were encountered: