Skip to content

Commit

Permalink
docs: design
Browse files Browse the repository at this point in the history
Signed-off-by: vsoch <[email protected]>
  • Loading branch information
vsoch committed Oct 23, 2024
1 parent 7b52ccf commit 31bf039
Showing 1 changed file with 3 additions and 1 deletion.
4 changes: 3 additions & 1 deletion docs/getting_started/design.md
Original file line number Diff line number Diff line change
Expand Up @@ -8,7 +8,9 @@ This current design moves the responsibility to service one or more ensembles fr

![img/design.png](img/design.png)

In the above, an ensemble member (a Flux Framework MiniCluster) is deployed as a single member ensemble. The ensemble member will be running ensemble-python on the lead broker (index 0 of the indexed job), where it is installed on the fly, akin to how Flux is added to the application container on the fly. The ensemble follows the work (jobs) and rules that are defined in the user-provided ensemble.yaml file. The ensemble-python library provides a simple state machine that receives job events from flux, and also uses a heartbeat (at a user defined frequency) to look for changes in metric models that might warrant an action. As an example, a rule might say to grow the cluster if the pending time for a job group goes above a threshold. We need a heartbeat to check that. For Kubernetes logic, the ensemble service is a deployment that runs a GRPC service following the same protocol (gRPC) as ensemble python knows how to interact with. It can receive events from multiple ensemble members (not shown here) and eventually handle things like fair share, etc. A headless service was explicitly not chosen because ensemble members should not share a network. Rather, the GRPC service is provided via its own exposed ClusterIP that is provided to ensemble members. For the GRPC service to make changes to ensemble members (grow/shrink) it has a paired Role and Role Binding with a Service Account to control MiniClusters in the same namespace. This is a huge improvement on the first design (discussed below) because ensemble-python works outside of Kubernetes, and there is not a huge load on the operator to interact with ensemble members.
In the above, an ensemble member (a Flux Framework MiniCluster) is deployed as a single member ensemble. The ensemble member will be running ensemble-python on the lead broker (index 0 of the indexed job), where it is installed on the fly, akin to how Flux is added to the application container on the fly. The ensemble follows the work (jobs) and rules that are defined in the user-provided ensemble.yaml file. The ensemble-python library provides a simple state machine that receives job events from flux, and also uses a heartbeat (at a user defined frequency) to look for changes in metric models that might warrant an action. As an example, a rule might say to grow the cluster if the pending time for a job group goes above a threshold. We need a heartbeat to check that.

For Kubernetes logic, the ensemble service is a deployment that runs a GRPC service following the same protocol (gRPC) as ensemble python knows how to interact with. It can receive events from multiple ensemble members (not shown here) and eventually handle things like fair share, etc. A headless service was explicitly not chosen because ensemble members should not share a network. Rather, the GRPC service is provided via its own exposed ClusterIP that is provided to ensemble members. For the GRPC service to make changes to ensemble members (grow/shrink) it has a paired Role and Role Binding with a Service Account to control MiniClusters in the same namespace. This is a huge improvement on the first design (discussed below) because ensemble-python works outside of Kubernetes, and there is not a huge load on the operator to interact with ensemble members.

Note that while this is running in Kubernetes, it does not need to be - it works on "bare metal" Flux, but not all features can be supported. For features that are in the queue (see what I did there) please see the [ensemble-python](https://github.com/converged-computing/ensemble-python) README. Most development will happen there, as the operator doesn't need to do much aside from running it!

Expand Down

0 comments on commit 31bf039

Please sign in to comment.