This is a HPC Machine learning Challenge/Experiment for Parallel & Distributed Vectorization/Embedding 1 Petabyte of raw data from Storage into VectorDB.
Allowed resources are
- OnPrem=Ubuntu with 10 x NVIDIA A100
- GCP a2-ultragpu-8g/g2-standard-96/nvidia-tesla-v100 https://cloud.google.com/compute/docs/gpus#a100-gpus
- AWS EKS/EC2 P4/G4/G4ad https://aws.amazon.com/blogs/aws/now-available-ec2-instances-g4-with-nvidia-t4-tensor-core-gpus/
- Kubernetes (Auto Scaled Pods&Nodes)
- Ray
- Slurm ( https://github.com/SchedMD/slurm )
- Python
- Tensorflow
Slurm Helm Chart https://github.com/stackhpc/slurm-k8s-cluster/tree/main/slurm-cluster-chart https://github.com/SchedMD/slurm
Use Python decorators Use MultiThreading Use asyncio
- Read Storage (./storage/read.py)
- Train
- Store
- Available draw-io diagram base for your use
https://medium.com/@55_learning/integrate-slurm-with-kubernetes-2637d9250fdd