You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I think it would be helpful if some documentation could be provided for how to monitor/diagnose/restart an ongoing pipeline, particularly in the Redis case.
Currently I connect to the redis coordinator with redis-cli and track job_running and jobs_queue keys to get a sense of what's happening.
If things break I try to clean restart with:
chmod +w store #give write access to the store
rm store/lock/* #remove all locks
ls -d store/pending* | parallel -j<ncores> 'chmod -R +w {}; rm -r {}' #remove pending jobs which can prevent running
redis-cli LTRIM "job_running" 0 0; redis-cli LPOP "job_running". As an aside I think instead I should probably use DEL or FLUSHALL, this is just my redis inexperience. If I don't do this, the "job_running" key grows with each run, potentially triggering the same job to run multiple times, and has caused failures.
This strategy was acquired through somewhat painful trial and error.
Documentation that would help me (and I suspect others):
How to inspect the contents of "job_running". GET returns some encoded data that I don't know how to decode.
Understanding what "jobs_queue" is for, everything seems to go to "job_running" after start up.
Explanation of how logging works, my STDOUT and STDERRs end up in my cluster log files, not in the store/metadata/hash-<hash>/{stdout,stderr}. Although this may be just a torque cluster idiosyncracy.
What's in metadata.db, I'm happy to poke at the SQLITE tables if that's what required. Alternatively a pointer to a human readable set of stages preferably divided into queued, running, completed, and failed.
Garbage collection and caching examples. I don't really have disk space for multiple copies of my pipeline. Alternatively I could avoid caching certain steps, but how to do so isn't immediately clear.
Tips for avoiding unnecessary re-runs. I've had to run the same pipeline on multiple hardware sets, and it seems that this has triggered re-runs (or my store manipulation tom-foolery, not sure).
The text was updated successfully, but these errors were encountered:
Thanks for raising this issue! We are no longer using external-executor, but I think the points here are important to keep in mind if we end up implementing distributed execution in future versions of funflow. As such I'll tag this issue and leave it up for reference.
Hi all,
I think it would be helpful if some documentation could be provided for how to monitor/diagnose/restart an ongoing pipeline, particularly in the Redis case.
Currently I connect to the redis coordinator with redis-cli and track
job_running
andjobs_queue
keys to get a sense of what's happening.If things break I try to clean restart with:
chmod +w store #give write access to the store
rm store/lock/* #remove all locks
ls -d store/pending* | parallel -j<ncores> 'chmod -R +w {}; rm -r {}' #remove pending jobs which can prevent running
redis-cli LTRIM "job_running" 0 0; redis-cli LPOP "job_running"
. As an aside I think instead I should probably useDEL
orFLUSHALL
, this is just my redis inexperience. If I don't do this, the "job_running" key grows with each run, potentially triggering the same job to run multiple times, and has caused failures.This strategy was acquired through somewhat painful trial and error.
Documentation that would help me (and I suspect others):
GET
returns some encoded data that I don't know how to decode.store/metadata/hash-<hash>/{stdout,stderr}
. Although this may be just a torque cluster idiosyncracy.metadata.db
, I'm happy to poke at the SQLITE tables if that's what required. Alternatively a pointer to a human readable set of stages preferably divided into queued, running, completed, and failed.The text was updated successfully, but these errors were encountered: