There are 3 basic types of cloud storages:
- Import Storages (aka Source Cloud Storages)
- Export Storages (aka Target Cloud Storages)
- Dataset Storages (available in enterprise)
Also Label Studio has Persistent storages where LS storage export files, user avatars and UI uploads. Do not confuse Cloud Storages
and Persistent Storage
, they have completely different codebase and tasks. Cloud Storages are implemented in io_storages
, Persistent Storage uses django-storages and it is installed in Django settings environment variables (see base.py
).
This diagram is based on Google Cloud Storage (GCS) and other storages are implemented the same way.
graph TD;
Storage-->ImportStorage;
ProjectStorageMixin-->GCSImportStorage;
ImportStorage-->GCSImportStorageBase;
GCSImportStorageBase-->GCSImportStorage;
GCSImportStorageBase-->GCSDatasetStorage;
DatasetStorageMixin-->GCSDatasetStorage;
subgraph Google Cloud Storage
GCSImportStorage;
GCSImportStorageBase;
GCSDatasetStorage;
end
Run this command with try/except:
- Get client
- Get bucket
- For source storage only: get any file from specified prefix
- For target storage: we don't need to check prefix, because it should be created automatically when annotation is written
Target storages use the same validate_connection() function, but without any prefix.
There are two methods for setting GCS credentials:
-
Through the Project => Cloud Storage settings in the Label Studio user interface.
-
Through Google Application Default Credentials (ADC). This involves the following steps:
2.1. Leave the Google Application Credentials field in the Label Studio UI blank.
2.2. Set an environment variable which will apply to all Cloud Storages. This can be done using the following command:
export GOOGLE_APPLICATION_CREDENTIALS=google_credentials.json
2.3. Alternatively, use the following command:
gcloud auth application-default login
2.4. Another option is to use credentials provided by the Google App Engine or Google Compute Engine metadata server, if the code is running on either GAE or GCE.
Note: If Cloud Storage credentials are set in the Label Studio UI, these will take precedence over other methods.
Storage (Import and Export) have different statuses of synchronization (see class StorageInfo.Status
):
- Initialized: storage was added, but never synced; sufficient for starting URI link resolving
- Queued: sync job is in the queue, but not yet started
- In progress: sync job is running
- Failed: sync job stopped, some errors occurred
- Completed: sync job completed successfully
graph TD;
Initialized-->Queued;
Queued-->InProgress;
InProgress-->Failed;
InProgress-->Completed;
Additionally, StorageInfo contains counters and debug information that will be displayed in storages:
- last_sync - time of the last successful sync
- last_sync_count - number of objects that were successfully synced
- last_sync_job - rqworker job ID
- status - string with StorageInfo.Status.choices
- traceback - last error traceback
- meta - dictionary with advanced information, including:
- tasks_existed - pre-existing tasks in the project that will not be synced
- time_last_ping - the sync process can be lengthy, but it should update time_last_ping every 10 (settings.STORAGE_IN_PROGRESS_TIMER) seconds. When the Storage API tries to fetch the storage, it checks time_last_ping and marks the sync process as failed if there are no updates (see the section "Expected and Unexpected Sync Terminations").
All these states are present in both the open-source and enterprise editions for code compatibility. Status processing can be challenging, especially when the sync process is terminated unexpectedly. Typical situations when this happens include:
- An exception occurred, it's a soft termination and in this case the sync job has
Failed
status. - OOM error happened => RQ worker job was killed =>
storage_background_failure
wasn't called. - RQ workers were redeployed =>
storage_background_failure
wasn't called. - RQ workers were killed manually =>
storage_background_failure
wasn't called. - Job was removed from RQ Queue => it's not a failure, but we need to update storage status somehow.
To handle these cases correctly, all these conditions must be checked in ensure_storage_status when the Storage List API is retrieved.