Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Report issue saving checkpoint #386

Open
acerdenno opened this issue Oct 1, 2024 · 5 comments
Open

Report issue saving checkpoint #386

acerdenno opened this issue Oct 1, 2024 · 5 comments

Comments

@acerdenno
Copy link

When running cellbender in slurm, two different errors prompt:
1.- when: Saving a checkpoint...
cellbender:remove-background: Could not save checkpoint
2.- TypeError: cannot pickle 'weakref' object
Any clues on how to solve them? Thanks!

@JThomasWatson
Copy link

JThomasWatson commented Oct 22, 2024

I'm encountering the same error as #2. Below is the error message, in cast it's helpful.

cellbender:remove-background: Saving a checkpoint...
cellbender:remove-background: Could not save checkpoint
cellbender:remove-background: Traceback (most recent call last):
  File "/projects/b1169/thomas/CellbenderEnv/CellBender/cellbender/remove_background/checkpoint.py", line 115, in save_checkpoint
    torch.save(model_obj, filebase + '_model.torch')
  File "/projects/b1169/thomas/CellbenderEnv/env/Cellbender/lib/python3.9/site-packages/torch/serialization.py", line 652, in save
    _save(obj, opened_zipfile, pickle_module, pickle_protocol, _disable_byteorder_record)
  File "/projects/b1169/thomas/CellbenderEnv/env/Cellbender/lib/python3.9/site-packages/torch/serialization.py", line 864, in _save
    pickler.dump(obj)
TypeError: cannot pickle 'weakref' object

cellbender:remove-background: 2024-10-22 14:36:38
cellbender:remove-background: Inference procedure complete.
Traceback (most recent call last):
  File "/projects/b1169/thomas/CellbenderEnv/env/Cellbender/bin/cellbender", line 8, in <module>
    sys.exit(main())
  File "/projects/b1169/thomas/CellbenderEnv/CellBender/cellbender/base_cli.py", line 118, in main
    cli_dict[args.tool].run(args)
  File "/projects/b1169/thomas/CellbenderEnv/CellBender/cellbender/remove_background/cli.py", line 193, in run
    return main(args)
  File "/projects/b1169/thomas/CellbenderEnv/CellBender/cellbender/remove_background/cli.py", line 227, in main
    posterior = run_remove_background(args)
  File "/projects/b1169/thomas/CellbenderEnv/CellBender/cellbender/remove_background/run.py", line 123, in run_remove_background
    posterior = load_or_compute_posterior_and_save(
  File "/projects/b1169/thomas/CellbenderEnv/CellBender/cellbender/remove_background/posterior.py", line 59, in load_or_compute_posterior_and_save
    assert os.path.exists(args.input_checkpoint_tarball), \
AssertionError: Checkpoint file ckpt.tar.gz does not exist, presumably because saving of the checkpoint file has been manually interrupted. load_or_compute_posterior_and_save() will not work properly without an existing checkpoint file. Please re-run and allow a checkpoint file to be saved.

Could this be an issue with torch version?

@mcsimenc
Copy link

mcsimenc commented Nov 2, 2024

I ran cellbender for the first time, using CPU, not using a cluster, and get the same error, with no output produced, although at the end of the log it says "Inference procedure complete.". The call and the log file output are below.

cellbender remove-background \
        --input raw_feature_bc_matrix.h5 \
        --output raw_feature_bc_matrix.nuclei.h5 \
        --cpu-threads 24 \
        >cb.out 2>cb.err
(base) [msimenc@KIWI outs]$ cat raw_feature_bc_matrix.nuclei.log 
cellbender:remove-background: Command:
cellbender remove-background --input raw_feature_bc_matrix.h5 --output raw_feature_bc_matrix.nuclei.h5 --cpu-threads 24
cellbender:remove-background: CellBender 0.3.0
cellbender:remove-background: (Workflow hash 8ebc86ffdb)
cellbender:remove-background: 2024-11-01 17:16:03
cellbender:remove-background: Running remove-background
cellbender:remove-background: Loading data from raw_feature_bc_matrix.h5
cellbender:remove-background: CellRanger v3 format
cellbender:remove-background: Features in dataset: 30940 Gene Expression
cellbender:remove-background: Trimming features for inference.
cellbender:remove-background: 24319 features have nonzero counts.
cellbender:remove-background: Prior on counts for cells is 911
cellbender:remove-background: Prior on counts for empty droplets is 198
cellbender:remove-background: Excluding 1976 features that are estimated to have <= 0.1 background counts in cells.
cellbender:remove-background: Including 22343 features in the analysis.
cellbender:remove-background: Trimming barcodes for inference.
cellbender:remove-background: Excluding barcodes with counts below 99
cellbender:remove-background: Using 3155 probable cell barcodes, plus an additional 9078 barcodes, and 49577 empty droplets.
cellbender:remove-background: Largest surely-empty droplet has 343 UMI counts.
cellbender:remove-background: Attempting to unpack tarball "ckpt.tar.gz" to /tmp/tmphjs6xrze
cellbender:remove-background: No saved checkpoint.
cellbender:remove-background: No checkpoint loaded.
cellbender:remove-background: Running inference...
cellbender:remove-background: [epoch 001]  average training loss: 2895.4787
cellbender:remove-background: [epoch 002]  average training loss: 2773.4995  (100.7 seconds per epoch)
cellbender:remove-background: Will checkpoint every 5 epochs
cellbender:remove-background: [epoch 003]  average training loss: 2684.2793
cellbender:remove-background: [epoch 004]  average training loss: 2610.8373
cellbender:remove-background: [epoch 005]  average training loss: 2557.5633
cellbender:remove-background: [epoch 005] average test loss: 2566.8680
cellbender:remove-background: Saving a checkpoint...
cellbender:remove-background: Could not save checkpoint
cellbender:remove-background: Traceback (most recent call last):
  File "/home/msimenc/software/miniforge3/envs/snake-cellranger/lib/python3.12/site-packages/cellbender/remove_background/checkpoint.py", line 115, in save_checkpoint
    torch.save(model_obj, filebase + '_model.torch')
  File "/home/msimenc/software/miniforge3/envs/snake-cellranger/lib/python3.12/site-packages/torch/serialization.py", line 850, in save
    _save(
  File "/home/msimenc/software/miniforge3/envs/snake-cellranger/lib/python3.12/site-packages/torch/serialization.py", line 1088, in _save
    pickler.dump(obj)
TypeError: cannot pickle 'weakref.ReferenceType' object
.
.
.
more epochs reports, more of the same error,
.
.
.
TypeError: cannot pickle 'weakref.ReferenceType' object

cellbender:remove-background: 2024-11-01 20:00:22
cellbender:remove-background: Inference procedure complete.

The /tmp dir is writable:

(base) [msimenc@KIWI outs]$ ls -l /
drwxrwxrwt.   16 root root    20480 Nov  1 22:13 tmp

I just installed cellbender using pip this afternoon. Any ideas?

@ezgiisenn
Copy link

I've been successfully running cellbender version 0.3.0 and 0.3.2 on our LSF-based computing cluster without issues until recently. However, in the past month, I’ve also started encountering the same error: TypeError: cannot pickle 'weakref.ReferenceType' object. Suggestions are appreciated to tackle the issue, thank you in advance!

@GFrosi
Copy link

GFrosi commented Nov 7, 2024

Hi,

I am getting the same error using cellbender 0.3.0. I installed it via pip (python 3.11.5) in the HPC. I did not run it on my data. I am just trying to use the example data from the github, and the error is there.

Any updates about the issue? It would be super helpful.

AssertionError: Checkpoint file ckpt.tar.gz does not exist, presumably because saving of the checkpoint file has been manually interrupted. load_or_compute_posterior_and_save() will not work properly without an existing checkpoint file. Please re-run and allow a checkpoint file to be saved.

Thank you.

@davidaguilaratx
Copy link

Installing this cellbender commit worked for me.

I used the code below to install it.
pip install --no-cache-dir -U git+https://github.com/broadinstitute/CellBender.git@4334e8966217c3591bf7c545f31ab979cdc6590d

Versions:
Name Version Build Channel
_libgcc_mutex 0.1 main
_openmp_mutex 5.1 1_gnu
attrs 24.2.0 pypi_0 pypi
blas 1.0 mkl anaconda
blosc 1.21.3 h6a678d5_0 anaconda
bzip2 1.0.8 h5eee18b_6
c-ares 1.19.1 h5eee18b_0 anaconda
c-blosc2 2.12.0 h80c7b02_0 anaconda
ca-certificates 2024.9.24 h06a4308_0
cellbender 0.3.2 pypi_0 pypi
filelock 3.16.1 pypi_0 pypi
fsspec 2024.10.0 pypi_0 pypi
hdf5 1.12.1 h2b7332f_3 anaconda
intel-openmp 2023.1.0 hdb19cb5_46306 anaconda
jinja2 3.1.4 pypi_0 pypi
krb5 1.20.1 h143b758_1 anaconda
ld_impl_linux-64 2.40 h12ee557_0
libcurl 7.88.1 h251f7ec_2 anaconda
libedit 3.1.20230828 h5eee18b_0 anaconda
libev 4.33 h7f8727e_1 anaconda
libffi 3.4.4 h6a678d5_1
libgcc-ng 11.2.0 h1234567_1
libgfortran-ng 11.2.0 h00389a5_1 anaconda
libgfortran5 11.2.0 h1234567_1 anaconda
libgomp 11.2.0 h1234567_1
libnghttp2 1.57.0 h2d74bed_0 anaconda
libssh2 1.11.0 h251f7ec_0 anaconda
libstdcxx-ng 11.2.0 h1234567_1
libuuid 1.41.5 h5eee18b_0
lxml-html-clean 0.4.1 pypi_0 pypi
lz4-c 1.9.4 h6a678d5_1 anaconda
lzo 2.10 h7b6447c_2 anaconda
markupsafe 3.0.2 pypi_0 pypi
mkl 2023.1.0 h213fc3f_46344 anaconda
mkl-service 2.4.0 py311h5eee18b_1 anaconda
mkl_fft 1.3.11 py311h5eee18b_0 anaconda
mkl_random 1.2.8 py311ha02d727_0 anaconda
mpmath 1.3.0 pypi_0 pypi
ncurses 6.4 h6a678d5_0
networkx 3.4.2 pypi_0 pypi
numexpr 2.10.1 py311h3c60e43_0 anaconda
numpy 1.26.4 py311h08b1b3b_0 anaconda
numpy-base 1.26.4 py311hf175353_0 anaconda
nvidia-cublas-cu12 12.4.5.8 pypi_0 pypi
nvidia-cuda-cupti-cu12 12.4.127 pypi_0 pypi
nvidia-cuda-nvrtc-cu12 12.4.127 pypi_0 pypi
nvidia-cuda-runtime-cu12 12.4.127 pypi_0 pypi
nvidia-cudnn-cu12 9.1.0.70 pypi_0 pypi
nvidia-cufft-cu12 11.2.1.3 pypi_0 pypi
nvidia-curand-cu12 10.3.5.147 pypi_0 pypi
nvidia-cusolver-cu12 11.6.1.9 pypi_0 pypi
nvidia-cusparse-cu12 12.3.1.170 pypi_0 pypi
nvidia-nccl-cu12 2.21.5 pypi_0 pypi
nvidia-nvjitlink-cu12 12.4.127 pypi_0 pypi
nvidia-nvtx-cu12 12.4.127 pypi_0 pypi
openssl 3.0.15 h5eee18b_0
packaging 24.1 py311h06a4308_0 anaconda
pip 24.2 py311h06a4308_0
platformdirs 4.3.6 pypi_0 pypi
py-cpuinfo 9.0.0 py311h06a4308_0 anaconda
pytables 3.10.1 py311h9d13977_0 anaconda
python 3.11.5 h955ad1f_0
readline 8.2 h5eee18b_0
setuptools 75.1.0 py311h06a4308_0
sqlite 3.45.3 h5eee18b_0
sympy 1.13.1 pypi_0 pypi
tbb 2021.8.0 hdb19cb5_0 anaconda
tk 8.6.14 h39e8969_0
torch 2.5.1 pypi_0 pypi
triton 3.1.0 pypi_0 pypi
typing-extensions 4.11.0 py311h06a4308_0 anaconda
typing_extensions 4.11.0 py311h06a4308_0 anaconda
tzdata 2024b h04d1e81_0
wheel 0.44.0 py311h06a4308_0
xz 5.4.6 h5eee18b_1
zlib 1.2.13 h5eee18b_1
zlib-ng 2.0.7 h5eee18b_0 anaconda
zstd 1.5.6 hc292b87_0 anaconda

When running cellbender in slurm, two different errors prompt: 1.- when: Saving a checkpoint... cellbender:remove-background: Could not save checkpoint 2.- TypeError: cannot pickle 'weakref' object Any clues on how to solve them? Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants