Ideally, job can be run on a big machine. For example, "sort | uniq | wc" for word count for large files. However, with super large files, doing the same thing distributedly requires lots of plumbing effort.
Glow aims to make it easy to distribute work to remote machines.
What's special?
- If any task can be done by unix pipe style, it can be done by Glow.
- Glow provides tools to distribute tasks: map by hashing, reduce by grouping.
- Each command is a json sent to leader. So more powerful tools can build on them.
What is the architecture?
- Each participating machine installs an agent talking to a leader.
- On any computer, user submits a job to leader.
- Agent picks up one role of the job, starts the process, setups the input/output. All participating agents thus forms a pipeline.
- user feed data into the pipeline.
- When all results arrived, user shuts down the pipeline, or leaves it running continuously by feeding more data.
What is the pipeline?
- Pipeline can flow 1
1, 1m, 1mn - pipeline can ajust flow ratio to scale up or down 1
m to 1m*n or n~m, on demand.