remote

Aug 6, 2021

3084540 · Aug 6, 2021

This branch is 1186 commits behind labmlai/labml:master.

Name	Name	Last commit message	Last commit date
parent directory ..
labml_remote	labml_remote	remote	Jul 9, 2021
MANIFEST.in	MANIFEST.in	remote	Jul 9, 2021
Makefile	Makefile	remote	Jul 9, 2021
readme.md	readme.md	🧹 slack	Aug 6, 2021
setup.py	setup.py	fix broken links	Jul 26, 2021

readme.md

labml_remote is a very simple tool that lets you setup python and run python on remote computers. It's mainly intended for deep learning training. It doesn't use layers and technologies such as docker, terraform or slurm. It simply SSH's into the remote computers and run commands, and jobs with nohup, and synchronises using rsync. labml_remote comes with a easy-to-use CLI. You can also use the API to launch customized distributed training sessions. Here is a sample.

Install from PIP

pip install labml_remote

Initialization

Go to your project folder.

cd [PATH TO YOUR PROJECT FOLDER]

Initialize for remote execution

labml_remote init

Configurations

labml_remote init asks for your SSH credentials and creates two files .remote/configs.yaml and .remote/exclude.txt. .remote/configs.yaml keeps the remote configurations for the project.

Here's a sample .remote/configs.yaml:

name: sample
servers:
  primary:
    hostname: 3.19.32.53
    private_key: ./.remote/private_key
    username: ubuntu
  secondary:
    hostname: ec2-3-20-234-50.us-east-2.compute.amazonaws.com
    private_key: ./.remote/private_key

.remote/exclude.txt is like .gitignore - it specifies the files and folders that you dont need to sync up with the remote server. The excludes generated by labml_remote --init excludes things like .git, .remote, logs and __pycache__. You should edit this if you have things that you don't want to be synced with your remote computer.

CLI

Get the command line interface help with,

labml_remote --help

Use the flag --help with any command to get the help for that command.

Prepare the servers

labml_remote prepare

This will install Conda on the servers, rsync your project content and install the pip packages, based on your requirements.txt or Pipfile.

Run a command

labml_remote run --cmd 'python my_script.py'

This will execute the command on the server and show you the outputs of it.

Start a job

labml_remote job-run --cmd 'python my_script.py' --tag my-job

List jobs

labml_remote job-list --rsync

--rysnc flag will sync up the job information from server to your local computer before listing.

Tail a job output

labml_remote job-tail --tag my-job

This will keep on tailing the output of the job.

Kill jobs

labml_remote job-kill --tag my-job

Launch a PyTorch distributed training session

labml_remote helper-torch-launch --cmd 'train.py' --nproc-per-node 2 --env GLOO_SOCKET_IFNAME enp1s0

Here train.py is the training script. We are using computers with 2 GPUs, we want two processes per computer so --nproc-per-node is 2. --env GLOO_SOCKET_IFNAME enp1s0 sets environment variable GLOO_SOCKET_IFNAME to enp1s0. You can specify multiple environment variables with --env.

How it works

It sets up miniconda if it is not already installed and create a new environment for the project. Then it creates a folder by the name of the project inside home folder and synchronises the contents of your local folder with the remote computer. It syncs using rsync so subsequent synchronisations only need to send the changes. Then it installs packages from requirements.txt or with pipenv if a Pipfile is found. It will use pipenv to run your commands if a Pipfile is present. The outputs of commands are streamed backed to the local computer and the outputs of jobs redirected to files on the server which are synchronized back to the local computer using rsync.

What it doesn't do

This won't install things like drivers or CUDA. So if you need them you should pick an image that comes with those for your instance. For example, on AWS pick a deep learning AMI if you want to use an instance with GPUs.

Hope this helps!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Files

remote

remote

readme.md

Install from PIP

Initialization

Configurations

CLI

Prepare the servers

Run a command

Start a job

List jobs

Tail a job output

Kill jobs

Launch a PyTorch distributed training session

How it works

What it doesn't do

Files

remote

Directory actions

More options

Directory actions

More options

Latest commit

History

remote

Folders and files

parent directory

readme.md

Install from PIP

Initialization

Configurations

CLI

Prepare the servers

Run a command

Start a job

List jobs

Tail a job output

Kill jobs

Launch a PyTorch distributed training session

How it works

What it doesn't do