Job config file
Prepare a job config file as described here, for example,
. -
HTTP POST your username and password to get an access token from:
For example, with curl, you can execute below command line:
curl -H "Content-Type: application/x-www-form-urlencoded" \ -X POST http://restserver/api/v1/token \ -d "username=YOUR_USERNAME" -d "password=YOUR_PASSWORD"
Submit a job
HTTP POST the config file as json with access token in header to:
For example, you can execute below command line:
curl -H "Content-Type: application/json" \ -H "Authorization: Bearer YOUR_ACCESS_TOKEN" \ -X POST http://restserver/api/v1/user/:username/jobs \ -d @exampleJob.json
Monitor the job
Check the list of jobs at:
Check your exampleJob status at:
Get the job config JSON content:
Get the job's SSH info:
Configure the rest server port in services-configuration.yaml.
Authenticated and get an access token in the system.
POST /api/v1/token
"username": "your username",
"password": "your password",
"expiration": "expiration time in seconds"
Response if succeeded
Status: 200
"token": "your access token",
"user": "username",
"admin": true if user is admin
Response if user does not exist
Status: 400
"code": "NoUserError",
"message": "User $username is not found."
Response if password is incorrect
Status: 400
"code": "IncorrectPassworkError",
"message": "Password is incorrect."
Response if a server error occured
Status: 500
"code": "UnknownError",
"message": "*Upstream error messages*"
Update a user in the system. Administrator can add user or change other user's password; user can change his own password.
PUT /api/v1/user
Authorization: Bearer <ACCESS_TOKEN>
"username": "username in [_A-Za-z0-9]+ format",
"password": "password at least 6 characters",
"admin": true | false,
"modify": true | false
Response if succeeded
Status: 201
"message": "update successfully"
Response if not authorized
Status: 401
"code": "UnauthorizedUserError",
"message": "Guest is not allowed to do this operation."
Response if current user has no permission
Status: 403
"code": "ForbiddenUserError",
"message": "Non-admin is not allow to do this operation."
Response if updated user does not exist
Status: 404
"code": "NoUserError",
"message": "User $username is not found."
Response if created user has a duplicate name
Status: 409
"code": "ConflictUserError",
"message": "User name $username already exists."
Response if a server error occured
Status: 500
"code": "UnknownError",
"message": "*Upstream error messages*"
Remove a user in the system.
DELETE /api/v1/user
Authorization: Bearer <ACCESS_TOKEN>
"username": "username to be removed"
Response if succeeded
Status: 200
"message": "remove successfully"
Response if not authorized
Status: 401
"code": "UnauthorizedUserError",
"message": "Guest is not allowed to do this operation."
Response if user has no permission
Status: 403
"code": "ForbiddenUserError",
"message": "Non-admin is not allow to do this operation."
Response if an admin will be removed
Status: 403
"code": "RemoveAdminError",
"message": "Admin $username is not allowed to remove."
Response if updated user does not exist
Status: 404
"code": "NoUserError",
"message": "User $username is not found."
Response if a server error occured
Status: 500
"code": "UnknownError",
"message": "*Upstream error messages*"
Administrators can update user's virtual cluster. Administrators can access all virtual clusters, all users can access default virtual cluster.
PUT /api/v1/user/:username/virtualClusters
Authorization: Bearer <ACCESS_TOKEN>
"virtualClusters": "virtual cluster list separated by commas (e.g. vc1,vc2)"
Response if succeeded
Status: 201
"message": "update user virtual clusters successfully"
Response if the virtual cluster does not exist.
Status: 400
"code": "NoVirtualClusterError",
"message": "Virtual cluster $vcname is not found."
Response if not authorized
Status: 401
"code": "UnauthorizedUserError",
"message": "Guest is not allowed to do this operation."
Response if user has no permission
Status: 403
"code": "ForbiddenUserError",
"message": "Non-admin is not allow to do this operation."
Response if user does not exist.
Status: 404
"code": "NoUserError",
"message": "User $username is not found."
Response if a server error occured
Status: 500
"code": "UnknownError",
"message": "*Upstream error messages*"
Get the list of jobs.
GET /api/v1/jobs
"username": "filter jobs with username"
Response if succeeded
Status: 200
[ ... ]
Response if a server error occured
Status: 500
"code": "UnknownError",
"message": "*Upstream error messages*"
Get the list of jobs of user.
GET /api/v1/user/:username/jobs
Response if succeeded
Status: 200
[ ... ]
Response if a server error occured
Status: 500
"code": "UnknownError",
"message": "*Upstream error messages*"
Get job status in the system.
GET /api/v1/user/:username/jobs/:jobName
Response if succeeded
Status: 200
name: "jobName",
jobStatus: {
username: "username",
virtualCluster: "virtualCluster",
state: "jobState",
// raw frameworkState from frameworklauncher
subState: "frameworkState",
createdTime: "createdTimestamp",
completedTime: "completedTimestamp",
executionType: "executionType",
// sum of retries
retries: retries,
retryDetails: {
// Job failed due to user or unknown error
user: userRetries,
// Job failed due to platform error
platform: platformRetries,
// Job cannot get required resource to run within timeout
resource: resourceRetries,
appId: "applicationId",
appProgress: "applicationProgress",
appTrackingUrl: "applicationTrackingUrl",
appLaunchedTime: "applicationLaunchedTimestamp",
appCompletedTime: "applicationCompletedTimestamp",
appExitCode: applicationExitCode,
appExitDiagnostics: "applicationExitDiagnostics"
appExitType: "applicationExitType"
taskRoles: {
// Name-details map
"taskRoleName": {
taskRoleStatus: {
name: "taskRoleName"
taskStatuses: {
taskIndex: taskIndex,
taskState: taskState,
containerId: "containerId",
containerIp: "containerIp",
containerPorts: {
// Protocol-port map
"protocol": "portNumber"
containerGpus: containerGpus,
containerLog: containerLogHttpAddress,
Response if the job does not exist
Status: 404
"code": "NoJobError",
"message": "Job $jobname is not found."
Response if a server error occured
Status: 500
"code": "UnknownError",
"message": "*Upstream error messages*"
Submit a job in the system.
POST /api/v1/user/:username/jobs
Authorization: Bearer <ACCESS_TOKEN>
Response if succeeded
Status: 202
"message": "update job $jobName successfully"
Response if the virtual cluster does not exist.
Status: 400
"code": "NoVirtualClusterError",
"message": "Virtual cluster $vcname is not found."
Response if user has no permission
Status: 403
"code": "ForbiddenUserError",
"message": "User $username is not allowed to add job to $vcname
Response if there is a duplicated job submission
Status: 409
"code": "ConflictJobError",
"message": "Job name $jobname already exists."
Response if a server error occured
Status: 500
"code": "UnknownError",
"message": "*Upstream error messages*"
Get job config JSON content.
GET /api/v1/user/:username/jobs/:jobName/config
Response if succeeded
Status: 200
"jobName": "test",
"image": "pai.run.tensorflow",
Response if the job does not exist
Status: 404
"code": "NoJobError",
"message": "Job $jobname is not found."
Response if the job config does not exist
Status: 404
"code": "NoJobConfigError",
"message": "Config of job $jobname is not found."
Response if a server error occured
Status: 500
"code": "UnknownError",
"message": "*Upstream error messages*"
Get job SSH info.
GET /api/v1/user/:username/jobs/:jobName/ssh
Response if succeeded
Status: 200
"containers": [
"id": "<container id>",
"sshIp": "<ip to access the container's ssh service>",
"sshPort": "<port to access the container's ssh service>"
"keyPair": {
"folderPath": "HDFS path to the job's ssh folder",
"publicKeyFileName": "file name of the public key file",
"privateKeyFileName": "file name of the private key file",
"privateKeyDirectDownloadLink": "HTTP URL to download the private key file"
Response if the job does not exist
Status: 404
"code": "NoJobError",
"message": "Job $jobname is not found."
Response if the job SSH info does not exist
Status: 404
"code": "NoJobSshInfoError",
"message": "SSH info of job $jobname is not found."
Response if a server error occured
Status: 500
"code": "UnknownError",
"message": "*Upstream error messages*"
Start or stop a job.
PUT /api/v1/user/:username/jobs/:jobName/executionType
Authorization: Bearer <ACCESS_TOKEN>
"value": "START" | "STOP"
Response if succeeded
Status: 200
"message": "execute job $jobName successfully"
Response if the job does not exist
Status: 404
"code": "NoJobError",
"message": "Job $jobname is not found."
Response if a server error occured
Status: 500
"code": "UnknownError",
"message": "*Upstream error messages*"
Get the list of virtual clusters.
GET /api/v1/virtual-clusters
Response if succeeded
Status: 200
Response if a server error occured
Status: 500
"code": "UnknownError",
"message": "*Upstream error messages*"
Get virtual cluster status in the system.
GET /api/v1/virtual-clusters/:vcName
Response if succeeded
Status: 200
//capacity percentage this virtual cluster can use of entire cluster
//max capacity percentage this virtual cluster can use of entire cluster
// used capacity percentage this virtual cluster can use of entire cluster
Response if the virtual cluster does not exist
Status: 404
"code": "NoVirtualClusterError",
"message": "Virtual cluster $vcname is not found."
Response if a server error occured
Status: 500
"code": "UnknownError",
"message": "*Upstream error messages*"
Add or update virtual cluster quota in the system, don't allow to operate "default" vc.
PUT /api/v1/virtual-clusters/:vcName
Authorization: Bearer <ACCESS_TOKEN>
"vcCapacity": new capacity,
"vcMaxCapacity": new max capacity, range of [vcCapacity, 100]
Response if succeeded
Status: 201
"message": "Update vc: $vcName to capacity: $vcCapacity successfully."
Response if try to update "default" vc
Status: 403
"code": "ForbiddenUserError",
"message": "Don't allow to update default vc"
Response if current user has no permission
Status: 403
"code": "ForbiddenUserError",
"message": "Non-admin is not allow to do this operation."
Response if no enough quota
Status: 403
"code": "NoEnoughQuotaError",
"message": "No enough quota in default vc."
Response if "default" virtual cluster does not exist
Status: 404
"code": "NoVirtualClusterError",
"message": "Default virtual cluster is not found, can't allocate or free resource."
Response if a server error occured
Status: 500
"code": "UnknownError",
"message": "*Upstream error messages*"
remove virtual cluster in the system, don't allow to operate "default" vc.
DELETE /api/v1/virtual-clusters/:vcName
Authorization: Bearer <ACCESS_TOKEN>
Response if succeeded
Status: 201
"message": "Remove vc: $vcName successfully."
Response if current user has no permission
Status: 403
"code": "ForbiddenUserError",
"message": "Non-admin is not allow to do this operation."
Response if try to update "default" vc
Status: 403
"code": "ForbiddenUserError",
"message": "Don't allow to remove default vc."
Response if the virtual cluster does not exist
Status: 404
"code": "NoVirtualClusterError",
"message": "Virtual cluster $vcname is not found."
Response if "default" virtual cluster does not exist
Status: 404
"code": "NoVirtualClusterError",
"message": "Default virtual cluster is not found, can't allocate or free resource."
Response if a server error occured
Status: 500
"code": "UnknownError",
"message": "*Upstream error messages*"
Change virtual cluster status, don't allow to operate "default" vc.
PUT /api/v1/virtual-clusters/:vcName/status
Authorization: Bearer <ACCESS_TOKEN>
"vcStatus": "running" | "stopped"
Response if succeeded
Status: 201
"message": "Update vc: $vcName to status: $vcStatus successfully."
Response if try to update "default" vc
Status: 403
"code": "ForbiddenUserError",
"message": "Don't allow to update default vc"
Response if current user has no permission
Status: 403
"code": "ForbiddenUserError",
"message": "Non-admin is not allow to do this operation."
Response if the virtual cluster does not exist
Status: 404
"code": "NoVirtualClusterError",
"message": "Virtual cluster $vcname is not found."
Response if a server error occured
Status: 500
"code": "UnknownError",
"message": "*Upstream error messages*"
Since Framework ACL is enabled since this version, jobs will have a namespace with job-creater's username. However there were still some jobs created before the version upgrade, which has no namespaces. They are called "legacy jobs", which can be retrieved, stopped, but cannot be created. To figure out them, there is a "legacy: true" field of them in list apis.
In the next versions, all operations of legacy jobs may be disabled, so please re-create them as namespaced job as soon as possible.