Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Could not save checkpoint (python3.7), due to [enforce fail at inline_container.cc:445] . PytorchStreamWriter failed (EDIT not just on docker) #301

Open
plijnzaad opened this issue Oct 31, 2023 · 5 comments

Comments

@plijnzaad
Copy link

plijnzaad commented Oct 31, 2023

Dear all,

we have great difficulty installing / running cellbender (see also issues #212, #275 and #296 ).

I hoped that the Docker image would be failsafe, but that is also not the case unfortunately. Using this image:

us.gcr.io/broad-dsde-methods/cellbender   latest    56439f37d58e   2 months ago   4.98GB

and converting it to a Singularity image (we are not root on our HPC) results in the following crash (full log appended). Does anyone know a combination of versions of (1) cellbender , (2) torch and (3) python that is likely to work? And is this an issue that is specific to the cellbender remove-background invocation?

cellbender:remove-back
[LX385-err.txt](https://github.com/broadinstitute/CellBender/files/13215090/LX385-err.txt)
ground: Could not save checkpoint
cellbender:remove-background: Traceback (most recent call last):
  File "/opt/conda/lib/python3.7/site-packages/torch/serialization.py", line 423, in save
    _save(obj, opened_zipfile, pickle_module, pickle_protocol)
  File "/opt/conda/lib/python3.7/site-packages/torch/serialization.py", line 650, in _save
    zip_file.write_record(name, storage.data_ptr(), num_bytes)
RuntimeError: [enforce fail at inline_container.cc:445] . PytorchStreamWriter failed writing file data/9: file write failed

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/opt/conda/lib/python3.7/site-packages/cellbender/remove_background/checkpoint.py", line 115, in save_checkpoint
    torch.save(model_obj, filebase + '_model.torch')
  File "/opt/conda/lib/python3.7/site-packages/torch/serialization.py", line 424, in save
    return
  File "/opt/conda/lib/python3.7/site-packages/torch/serialization.py", line 290, in __exit__
    self.file_like.write_end_of_file()
RuntimeError: [enforce fail at inline_container.cc:325] . unexpected pos 90329472 vs 90329404
@plijnzaad
Copy link
Author

LX385-err.txt
(forgot to append the log)

@tilofrei
Copy link

Dear @plijnzaad I got the same error running from a singularity container - did work around that issue meanwhile? Thanks!

@maxozo
Copy link

maxozo commented Oct 8, 2024

same! any solution?

@maxozo
Copy link

maxozo commented Oct 8, 2024

ok, the issue in singularity container is that its trying to use TMP for checkpointing and in singularity you do not have permissins to write there unless you specifically mounted writable dir.
So to avoid this you can also set the TMP withing container to the curent dir:

    export TMPDIR=$PWD
    cellbender remove-background --input txd_input ${gpu_text_info} ${option1} --output ${outfile} --expected-cells \$(cat ${expected_cells}) --total-droplets-included \$(cat ${total_droplets_include}) --model full --z-dim ${zdims} --z-layers ${zlayers} --low-count-threshold ${low_count_threshold} --epochs ${epochs} --learning-rate ${learning_rate} --fpr ${fpr}

@plijnzaad plijnzaad changed the title Docker image: Could not save checkpoint (python3.7), due to [enforce fail at inline_container.cc:445] . PytorchStreamWriter failed Could not save checkpoint (python3.7), due to [enforce fail at inline_container.cc:445] . PytorchStreamWriter failed (EDIT not just on docker) Dec 3, 2024
@plijnzaad
Copy link
Author

plijnzaad commented Dec 3, 2024

Another data point: this also happens outside docker (hence my title change), and appears to be related to running it with --cuda. At the time of Could not save checkpoint, the very spacious $TMPDIR is completely empty.

With --cuda, no ckpt.tar.gz file is made, but without --cuda, a ckpt.tar.gz file is created (but then it takes way too long to run).

A puzzling error in the error log (appended) is:

AssertionError: Checkpoint file ckpt.tar.gz does not exist, presumably because saving of the checkpoint file has been manually interrupted. load_or_compute_posterior_and_save() will not work proper>

The 'manual interruption' is weird, can this be something done by the SLURM queueing system where the job runs?

BTW this is CellBender 0.3.2, Workflow hash 7c6e08fbec.

Any ideas, anyone?

LX385-err4.txt

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants