Skip to content

Commit

Permalink
Merged in feature/ultra_lazy_dicom_stack (pull request #460)
Browse files Browse the repository at this point in the history
Feature/ultra lazy dicom stack

Approved-by: Randy Taylor
  • Loading branch information
jrkerns committed Oct 28, 2024
2 parents 374e48e + c404bfa commit 617d687
Show file tree
Hide file tree
Showing 7 changed files with 282 additions and 22 deletions.
7 changes: 7 additions & 0 deletions docs/source/changelog.rst
Original file line number Diff line number Diff line change
Expand Up @@ -30,6 +30,13 @@ TRS-398
This change will affect absorbed dose TRS-398 calculations if you rely on the ``k_tp`` function. If you are using TRS-398, please verify that your
results are still accurate. We apologize for this oversight.

CT
^^

* There is a new parameter for CT-like constructor classes: ``is_zip``. This is mostly an internal
flag and is used when calling the ``.from_zip`` method. The default is ``False``. This is backwards-compatible
and should not affect users. This was done for internal refactoring reasons.

v 3.28.0
--------

Expand Down
1 change: 1 addition & 0 deletions docs/source/index.rst
Original file line number Diff line number Diff line change
Expand Up @@ -56,6 +56,7 @@
topics/nps
topics/mtf
topics/scale
topics/memory
topics/profiles
topics/image_metrics
topics/fluence
Expand Down
102 changes: 102 additions & 0 deletions docs/source/topics/memory.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,102 @@
=================
Memory Efficiency
=================

Generally speaking, Python is not the most memory-efficient language.
Most of the time this doesn't make a significant difference, but in
certain context with constraints it can be a problem.
It's unlikely that you'll actually need to worry about this, but this
provides some context. Within the RadMachine environment, memory can be
a precious commodity, so some work has been done to optimize memory usage.
It is explained here for clarity and completeness.

.. note::

Memory optimization is only relevant here for DICOM stacks (CT, MR, etc).

.. note::

"Space" here refers to both memory **and** disk space. In the RadMachine cloud environment,
disk space is the same as memory so for the following discussion there is no benefit to using memory vs disk.

There are 3 uses of space when analyzing a CT dataset:

* The original dataset as passed (e.g. a zip archive or a list of files)
* The extracted dataset (if applicable)
* The array(s) in memory

In the worst case scenario all three uses are employed.
E.g. a 5MB zip archive is passed.
It is then extracted as a 30MB directory,
and then each image is loaded into numpy as an array (usually uint16 or float).
Depending on the CT scanner this can be ~2.5MB/image.
So in total the we might be using 5 + 30 + 60 * 2.5 = 185MB of space,
assuming 60 images.
This doesn’t sound like that much space but this can be much larger depending on the
scanner and number of slices.

We can’t ever get rid of the first dataset, but we can in theory get rid of the latter two.

The loaded arrays in memory are the easiest to get rid of. To do so, we can simply
load the given image requested on demand, or, "lazily". This will result in only 1
image in memory at any given time. This is referred to as "lazy".

The extracted dataset can be more difficult. Assuming the original dataset is also
counted toward the space used, we want to minimize what we can. This means
we ideally want to be passed a compressed dataset (ZIP archive) and also load one
image lazily as requested. This can be done in vanilla Python via the ``zipfile`` module but suffers from
two problems:

- It can modify the data in-place. I.e. performing a modification actually modifies
the original data. This is undesirable.
- Changing or removing data from the archive is difficult and slow. Python can easily
read from a ZIP archive but writing to it or removing data requires rewriting the
entire archive. This is extremely slow.

The best solution so far is to create a "shadow" object that holds the data in memory,
but is also compressed. This is a compromise between the two. Generally speaking,
the total space usage for this method is around 2x the size of the compressed dataset.
This is referred to as "shadow object".

See the table below for a comparison:

.. note::

This assumes a ZIP archive is passed to the constructor. Numbers will vary depending on
the dataset and the computer.

+-----------------------------------+-------------------------+----------------+---------------+
| | non-optimized (default) | lazy | shadow object |
+===================================+=========================+================+===============+
| Slowdown (%) | | 5-15% | 10-20% |
+-----------------------------------+-------------------------+----------------+---------------+
| zip archive (MB) | 13 | 13 | 13 |
+-----------------------------------+-------------------------+----------------+---------------+
| extracted zip size (MB) | 49 | 49 | 0 |
+-----------------------------------+-------------------------+----------------+---------------+
| Shadow object (MB) | 0 | 0 | 13 |
+-----------------------------------+-------------------------+----------------+---------------+
| DICOM stack size at any time (MB) | 195 | 2.5 | 2.5 |
+-----------------------------------+-------------------------+----------------+---------------+
| Max usage (MB) | 257 | 64.5 | 28.5 |
+-----------------------------------+-------------------------+----------------+---------------+
| Memory usage vs dataset (x) | 19.8 | 5.0 | 2.2 |
+-----------------------------------+-------------------------+----------------+---------------+
| Memory optimizations (x) | | 4.0 | 9.0 |
+-----------------------------------+-------------------------+----------------+---------------+

Which one should you use? It depends on your environment, but if memory is no concern,
the default is fine. If you are memory-constrained but not disk-constrained, extracted lazy
is the best. If you are both memory and disk constrained, shadow object is the best.

How to enable memory optimizations
----------------------------------

Memory efficiency is not enabled by default for historical reasons. You generally
don't have to worry about implementation details. If you want to enable it,
you can do so in the constructor of any of the DICOM stack classes (CatPhan, ACR MR/CR, Cheese, etc)
by passing ``memory_efficient_mode=True``.

* When ``memory_efficient_mode=False`` (default), the non-optimized method is used and everything is loaded into memory.
* When ``memory_efficient_mode=True`` and the dataset is not a ZIP archive (a list of files already on disk), the lazy method is used.
* When ``memory_efficient_mode=True`` and the dataset is a ZIP archive, the shadow object method is used.
108 changes: 105 additions & 3 deletions pylinac/core/image.py
Original file line number Diff line number Diff line change
Expand Up @@ -9,13 +9,15 @@
import os.path as osp
import re
import warnings
import zlib
from collections import Counter
from collections.abc import Iterable, Sequence
from datetime import datetime
from functools import cached_property
from io import BufferedReader, BytesIO
from pathlib import Path
from typing import Any, BinaryIO, Union
from zipfile import ZipFile

import argue
import matplotlib.pyplot as plt
Expand Down Expand Up @@ -1692,7 +1694,7 @@ def __sub__(self, other):


class LazyDicomImageStack:
_image_path_keys: list[Path]
_image_path_keys: list[Path | str]
metadatas: list[pydicom.Dataset]

def __init__(
Expand Down Expand Up @@ -1743,7 +1745,7 @@ def __init__(
self._image_path_keys = [paths[i] for i in order]

@classmethod
def from_zip(cls, zip_path: str | Path, dtype: np.dtype | None = None):
def from_zip(cls, zip_path: str | Path, dtype: np.dtype | None = None, **kwargs):
"""Load a DICOM ZIP archive.
Parameters
Expand All @@ -1754,7 +1756,7 @@ def from_zip(cls, zip_path: str | Path, dtype: np.dtype | None = None):
The data type to cast the image data as. If None, will use whatever raw image format is.
"""
with TemporaryZipDirectory(zip_path, delete=False) as tmpzip:
obj = cls(tmpzip, dtype)
obj = cls(tmpzip, dtype, **kwargs)
return obj

def _get_common_uid_imgs(
Expand Down Expand Up @@ -1825,6 +1827,106 @@ def __len__(self):
return len(self._image_path_keys)


class LazyZipDicomImageStack(LazyDicomImageStack):
"""A variant of the lazy stack where a .zip archive is passed and
the archive is NOT extracted to disk. The most memory-efficient for use cases
like Cloud Run where disk=memory"""

def __init__(
self,
folder: str | Path,
dtype: np.dtype | None = None,
min_number: int = 39,
check_uid: bool = True,
):
"""Load a folder with DICOM CT images. This variant is more memory efficient than the standard DicomImageStack.
This is done by loading images from disk on the fly. This assumes all images remain on disk for the lifetime of the instance. This does not
need to be true for the original implementation.
See the documentation for DicomImageStack for parameter descriptions.
"""
self.dtype = dtype
self.zip_archive = Path(folder)
self.shadow_images = {}
with ZipFile(self.zip_archive) as zfile:
paths = zfile.namelist()
# we only want to read the metadata once
# so we read it here and then filter and sort
metadatas, paths = self._get_path_metadatas(paths)

# check that at least 1 image was loaded
if len(paths) < 1:
raise FileNotFoundError(
f"No files were found in the specified location: {folder}"
)

# error checking
if check_uid:
most_common_uid = self._get_common_uid_imgs(metadatas, min_number)
metadatas = [m for m in metadatas if m.SeriesInstanceUID == most_common_uid]
paths = [
p
for p, m in zip(paths, metadatas)
if m.SeriesInstanceUID == most_common_uid
]
# sort according to physical order
order = np.argsort([m.ImagePositionPatient[-1] for m in metadatas])
self.metadatas = [metadatas[i] for i in order]
self._image_path_keys = [paths[i] for i in order]
self.create_shadow(paths)

@classmethod
def from_zip(cls, zip_path: str | Path, dtype: np.dtype | None = None, **kwargs):
# the zip lazy stack assumes a zip file is passed
return cls(zip_path, dtype, **kwargs)

def create_shadow(self, paths: list[str]):
with ZipFile(self.zip_archive) as zfile:
for path in paths:
with zfile.open(path) as file:
data = file.read()
self.shadow_images[path] = {
"data": zlib.compress(data),
}

def _get_path_metadatas(
self, names: list[str]
) -> (list[pydicom.Dataset], list[str]):
"""Get the metadata for the images. This also filters out non-image files."""
metadata = []
matched_paths = []
zfile = ZipFile(self.zip_archive)
for name in names:
with zfile.open(name) as f:
try:
ds = pydicom.dcmread(f, force=True, stop_before_pixels=True)
if "Image Storage" in ds.SOPClassUID.name:
metadata.append(ds)
matched_paths.append(name)
except (InvalidDicomError, AttributeError, MemoryError):
pass
return metadata, matched_paths

def __getitem__(self, item: int) -> DicomImage:
data = self.shadow_images[self._image_path_keys[item]]
image = io.BytesIO(zlib.decompress(data["data"]))
return DicomImage(image, dtype=self.dtype)

def __setitem__(self, key: int, value: DicomImage):
"""Save the passed image to disk in place of the current image."""
with io.BytesIO() as stream:
value.save(stream)
self.shadow_images[self._image_path_keys[key]] = {
"data": zlib.compress(stream.getvalue()),
}

def __delitem__(self, key: int):
"""Delete the image from the shadow object"""
full_key = self._image_path_keys.pop(key)
self.shadow_images.pop(full_key)


class DicomImageStack(LazyDicomImageStack):
"""A class that loads and holds a stack of DICOM images (e.g. a CT dataset). The class can take
a folder or zip file and will read CT images. The images must all be the same size. Supports
Expand Down
42 changes: 23 additions & 19 deletions pylinac/ct.py
Original file line number Diff line number Diff line change
Expand Up @@ -43,7 +43,7 @@
from .core.contrast import Contrast
from .core.geometry import Line, Point
from .core.image import ArrayImage, DicomImageStack, ImageLike, z_position
from .core.io import TemporaryZipDirectory, get_url, retrieve_demo_file
from .core.io import get_url, retrieve_demo_file
from .core.mtf import MTF
from .core.nps import (
average_power,
Expand Down Expand Up @@ -1753,6 +1753,7 @@ def __init__(
folderpath: str | Sequence[str] | Path | Sequence[Path] | Sequence[BytesIO],
check_uid: bool = True,
memory_efficient_mode: bool = False,
is_zip: bool = False,
):
"""
Parameters
Expand All @@ -1774,17 +1775,23 @@ def __init__(
"""
self.origin_slice = 0
self.catphan_roll = 0
if isinstance(folderpath, (str, Path)):
if isinstance(folderpath, (str, Path)) and not is_zip:
if not osp.isdir(folderpath):
raise NotADirectoryError("Path given was not a Directory/Folder")
stack = (
image.DicomImageStack
if not memory_efficient_mode
else image.LazyDicomImageStack
)
self.dicom_stack = stack(
folderpath, check_uid=check_uid, min_number=self.min_num_images
)
if not memory_efficient_mode:
stack = image.DicomImageStack
elif memory_efficient_mode and is_zip:
stack = image.LazyZipDicomImageStack
else:
stack = image.LazyDicomImageStack
if is_zip:
self.dicom_stack = stack.from_zip(
folderpath, check_uid=check_uid, min_number=self.min_num_images
)
else:
self.dicom_stack = stack(
folderpath, check_uid=check_uid, min_number=self.min_num_images
)

@classmethod
def from_demo_images(cls):
Expand Down Expand Up @@ -1830,15 +1837,12 @@ def from_zip(
FileExistsError : If zip_file passed was not a legitimate zip file.
FileNotFoundError : If no CT images are found in the folder
"""
delete = not memory_efficient_mode
with TemporaryZipDirectory(zip_file, delete=delete) as temp_zip:
obj = cls(
temp_zip,
check_uid=check_uid,
memory_efficient_mode=memory_efficient_mode,
)
obj.was_from_zip = True
return obj
return cls(
folderpath=zip_file,
check_uid=check_uid,
memory_efficient_mode=memory_efficient_mode,
is_zip=True,
)

def plotly_analyzed_images(
self,
Expand Down
20 changes: 20 additions & 0 deletions tests_basic/core/test_image.py
Original file line number Diff line number Diff line change
Expand Up @@ -27,6 +27,7 @@
DicomImageStack,
FileImage,
LazyDicomImageStack,
LazyZipDicomImageStack,
LinacDicomImage,
_rescale_dicom_values,
_unscale_dicom_values,
Expand Down Expand Up @@ -844,6 +845,25 @@ def test_writing_back_to_lazy_stack(self):
# assert that writing back to the stack works
assert_array_almost_equal(manipulated_offset, original_offset)

def test_writing_back_to_zip_lazy_stack(self):
dstack_lazy = LazyZipDicomImageStack.from_zip(self.stack_location)
original_length = len(dstack_lazy)
original_offset = np.copy(dstack_lazy[0].array) + 50
first_img = dstack_lazy[0]
first_img.array += 50
dstack_lazy[0] = first_img
manipulated_offset = np.copy(dstack_lazy[0].array)
# assert that writing back to the stack works
assert_array_almost_equal(manipulated_offset, original_offset)
# also check the number of files has not changed
self.assertEqual(len(dstack_lazy), original_length)

def test_deleting_zip_lazy_from_stack(self):
lazy_stack = LazyZipDicomImageStack.from_zip(self.stack_location)
original_length = len(lazy_stack)
del lazy_stack[0]
self.assertEqual(len(lazy_stack), original_length - 1)

def test_slice_spacing_ct(self):
dstack = DicomImageStack.from_zip(self.stack_location)
self.assertAlmostEqual(dstack.slice_spacing, 2.5, delta=0.001)
Expand Down
Loading

0 comments on commit 617d687

Please sign in to comment.