This documents includes planned features for Skyshift in 2024, and early 2025. However, this roadmap is not comprehensive (feel free to propose new features/directions by opening a Git issue, discussion or PR!).
- Aggregate resource summarization.
- List jobs.
- General support for job inspection via exec.
- UI to view each cluster and the aggregate available/unavailable resources.
- Monitor status of jobs and services submitted through Skyshift.
- Accumulate usage statistics.
- Track jobs over time.
- Track SkyShift events.
- Track wait times by resource-group.
- Clean transparent install, start, stop, and uninstall (#84).
- Tests for Skyshift controllers.
- Tests for Skyshift compatibility layers (Slurm, Ray).
- Tests for Skyshift services.
- Automatic test generation.
- GitHub action integration.
- Autoscaler controller - provisions Skypilot clusters when jobs are unschedulable.
- Finegrained
update
operation on all Skyshift objects (i.e. a user modifies image in Skyshift job). - Key and secrets management.
- Automatic detection of K8 clusters.
- Automatic provisioning option.
- Support for additional variants of K8, such as Openshift.
- Support and test automation for enterprise-grade flavors (e.g. OpenShift).
- Automatic cluster detection.
- Automatic provisioning option.
- Support for container managers (Docker, Podman, Singularity, etc.) on top of Slurm.
- Service feature/networking layer for Slurm (e.g. using reverse proxy).
- Storage feature for Slurm.
- Exec feature for Slurm.
- Automatic cluster detection.
- Automatic provisioning for Ray cluster (native Skypilot support).
We invite the community to contribute to Skyshift's compatibility layer, such as one for LSF, Nomad, and Docker Swarm!
- Adaptive scheduling optimizations based on tracked workloads.
- Explicit waiting policies for job launch (how long a job waits until it fails over to another cluster).
- Leverage usage statistics and/or historic waiting times in placement decisions (see Observability and Monitoring).
- Blob storage support (MinIO, GCS, S3) for K8 and Slurm clusters.
- Distributed file synchronization across clusters.
- Automated fail-over option based on workload configuration and resource availability.
- Model serving with vLLM.
- Model fine-tuning with Llama 3.
- Workspace Deployment with SkyShift.
- Multi-cluster Data ETL pipeline (potentially with Spark).
- Multi-agent Deployment.