Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Continue storage deletion when some fail #4454

Open
wants to merge 7 commits into
base: master
Choose a base branch
from

Conversation

andylizf
Copy link
Contributor

Fix #4050.

Make sky storage delete continue deleting remaining storages when some fail, instead of aborting the entire operation. Failed deletions are reported individually.

We Implemented this feature via adding a continue_on_error parameter in run_in_parallel

Tested (run the relevant ones):

  • Code formatting: bash format.sh
  • Any manual or new tests for this PR (please specify below)
  • All smoke tests: pytest tests/test_smoke.py
  • Relevant individual smoke tests: pytest tests/test_smoke.py::test_fill_in_the_name
  • Backward compatibility tests: conda deactivate; bash -i tests/backward_compatibility_tests.sh

@andylizf
Copy link
Contributor Author

@cblmemo PTAL, thanks!

@andylizf
Copy link
Contributor Author

andylizf commented Dec 10, 2024

I inserted five new records into the 'global_user_state' database: three records containing the previous storage handle value, and two records with null handle values.

andyl@DESKTOP-7FP6SMO ~/skypilot (fix-storage-recover) [1]> sky storage dele
te -a
Deleting 6 storages: skypilot-workdir-andyl-74d418df, test-storage-1, test-storage-2, test-storage-3, test-storage-4, test-storage-5. Proceed? [Y/n]: 
Error deleting storage test-storage-3: In global_user_state, storage name 'test-storage-3' does not match handle.storage_name 'skypilot-workdir-andyl-577894a6'
Error deleting storage test-storage-2: In global_user_state, storage name 'test-storage-2' does not match handle.storage_name 'skypilot-workdir-andyl-577894a6'
Error deleting storage test-storage-4: Storage name 'test-storage-4' not found.
Error deleting storage test-storage-1: In global_user_state, storage name 'test-storage-1' does not match handle.storage_name 'skypilot-workdir-andyl-577894a6'
Error deleting storage test-storage-5: Storage name 'test-storage-5' not found.
Deleted GCS bucket skypilot-workdir-andyl-74d418df.

Copy link
Collaborator

@cblmemo cblmemo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the prompt fix @andylizf ! left some discussions.

sky/cli.py Outdated Show resolved Hide resolved
sky/cli.py Outdated Show resolved Hide resolved
sky/global_user_state.py Outdated Show resolved Hide resolved
"""
# Reference: https://stackoverflow.com/questions/25790279/python-multiprocessing-early-termination # pylint: disable=line-too-long
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why not removing this reference

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You mean why remove the reference? I think it's trivial now, and the reference doesn't add much value to the code.

def run_in_parallel(func: Callable,
args: Iterable[Any],
num_threads: Optional[int] = None) -> List[Any]:
num_threads: Optional[int] = None,
continue_on_error: bool = False) -> List[Any]:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

adding this seems change the return value type from Value to Union[Value, Exception], which looks a little bit strange to me. Why do we need this anyway? aren't we already add the try-except in the func?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You are right. I thought about not adding a function to catch and log these exceptions, but instead handle all exceptions in run_in_parallel. Anyway, it made sense to have this function have an argument like continue_on_error.
But then I realized it violated the abstraction concept and would make it harder to print custom logs. So I ended up wrapping the function and logging it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[Storage] sky storage delete -a aborted when deletion of one storage failed
2 participants