Merged in feature/ultra_lazy_dicom_stack (pull request #460)

Feature/ultra lazy dicom stack Approved-by: Randy Taylor
jrkerns · Oct 28, 2024 · 617d687 · 617d687
2 parents 374e48e + c404bfa
commit 617d687
Show file tree

Hide file tree

Showing 7 changed files with 282 additions and 22 deletions.
diff --git a/docs/source/changelog.rst b/docs/source/changelog.rst
@@ -30,6 +30,13 @@ TRS-398
     This change will affect absorbed dose TRS-398 calculations if you rely on the ``k_tp`` function. If you are using TRS-398, please verify that your
     results are still accurate. We apologize for this oversight.
 
+CT
+^^
+
+* There is a new parameter for CT-like constructor classes: ``is_zip``. This is mostly an internal
+  flag and is used when calling the ``.from_zip`` method. The default is ``False``. This is backwards-compatible
+  and should not affect users. This was done for internal refactoring reasons.
+
 v 3.28.0
 --------
 

diff --git a/docs/source/index.rst b/docs/source/index.rst
@@ -56,6 +56,7 @@
    topics/nps
    topics/mtf
    topics/scale
+   topics/memory
    topics/profiles
    topics/image_metrics
    topics/fluence

diff --git a/docs/source/topics/memory.rst b/docs/source/topics/memory.rst
@@ -0,0 +1,102 @@
+=================
+Memory Efficiency
+=================
+
+Generally speaking, Python is not the most memory-efficient language.
+Most of the time this doesn't make a significant difference, but in
+certain context with constraints it can be a problem.
+It's unlikely that you'll actually need to worry about this, but this
+provides some context. Within the RadMachine environment, memory can be
+a precious commodity, so some work has been done to optimize memory usage.
+It is explained here for clarity and completeness.
+
+.. note::
+
+    Memory optimization is only relevant here for DICOM stacks (CT, MR, etc).
+
+.. note::
+
+    "Space" here refers to both memory **and** disk space. In the RadMachine cloud environment,
+    disk space is the same as memory so for the following discussion there is no benefit to using memory vs disk.
+
+There are 3 uses of space when analyzing a CT dataset:
+
+* The original dataset as passed (e.g. a zip archive or a list of files)
+* The extracted dataset (if applicable)
+* The array(s) in memory
+
+In the worst case scenario all three uses are employed.
+E.g. a 5MB zip archive is passed.
+It is then extracted as a 30MB directory,
+and then each image is loaded into numpy as an array (usually uint16 or float).
+Depending on the CT scanner this can be ~2.5MB/image.
+So in total the we might be using 5 + 30 + 60 * 2.5 = 185MB of space,
+assuming 60 images.
+This doesn’t sound like that much space but this can be much larger depending on the
+scanner and number of slices.
+
+We can’t ever get rid of the first dataset, but we can in theory get rid of the latter two.
+
+The loaded arrays in memory are the easiest to get rid of. To do so, we can simply
+load the given image requested on demand, or, "lazily". This will result in only 1
+image in memory at any given time. This is referred to as "lazy".
+
+The extracted dataset can be more difficult. Assuming the original dataset is also
+counted toward the space used, we want to minimize what we can. This means
+we ideally want to be passed a compressed dataset (ZIP archive) and also load one
+image lazily as requested. This can be done in vanilla Python via the ``zipfile`` module but suffers from
+two problems:
+
+- It can modify the data in-place. I.e. performing a modification actually modifies
+  the original data. This is undesirable.
+- Changing or removing data from the archive is difficult and slow. Python can easily
+  read from a ZIP archive but writing to it or removing data requires rewriting the
+  entire archive. This is extremely slow.
+
+The best solution so far is to create a "shadow" object that holds the data in memory,
+but is also compressed. This is a compromise between the two. Generally speaking,
+the total space usage for this method is around 2x the size of the compressed dataset.
+This is referred to as "shadow object".
+
+See the table below for a comparison:
+
+.. note::
+
+    This assumes a ZIP archive is passed to the constructor. Numbers will vary depending on
+    the dataset and the computer.
+
++-----------------------------------+-------------------------+----------------+---------------+
+|                                   | non-optimized (default) | lazy           | shadow object |
++===================================+=========================+================+===============+
+| Slowdown (%)                      |                         | 5-15%          | 10-20%        |
++-----------------------------------+-------------------------+----------------+---------------+
+| zip archive (MB)                  | 13                      | 13             | 13            |
++-----------------------------------+-------------------------+----------------+---------------+
+| extracted zip size (MB)           | 49                      | 49             | 0             |
++-----------------------------------+-------------------------+----------------+---------------+
+| Shadow object (MB)                | 0                       | 0              | 13            |
++-----------------------------------+-------------------------+----------------+---------------+
+| DICOM stack size at any time (MB) | 195                     | 2.5            | 2.5           |
++-----------------------------------+-------------------------+----------------+---------------+
+| Max usage (MB)                    | 257                     | 64.5           | 28.5          |
++-----------------------------------+-------------------------+----------------+---------------+
+| Memory usage vs dataset (x)       | 19.8                    | 5.0            | 2.2           |
++-----------------------------------+-------------------------+----------------+---------------+
+| Memory optimizations (x)          |                         | 4.0            | 9.0           |
++-----------------------------------+-------------------------+----------------+---------------+
+
+Which one should you use? It depends on your environment, but if memory is no concern,
+the default is fine. If you are memory-constrained but not disk-constrained, extracted lazy
+is the best. If you are both memory and disk constrained, shadow object is the best.
+
+How to enable memory optimizations
+----------------------------------
+
+Memory efficiency is not enabled by default for historical reasons. You generally
+don't have to worry about implementation details. If you want to enable it,
+you can do so in the constructor of any of the DICOM stack classes (CatPhan, ACR MR/CR, Cheese, etc)
+by passing ``memory_efficient_mode=True``.
+
+* When ``memory_efficient_mode=False`` (default), the non-optimized method is used and everything is loaded into memory.
+* When ``memory_efficient_mode=True`` and the dataset is not a ZIP archive (a list of files already on disk), the lazy method is used.
+* When ``memory_efficient_mode=True`` and the dataset is a ZIP archive, the shadow object method is used.
diff --git a/pylinac/core/image.py b/pylinac/core/image.py
@@ -9,13 +9,15 @@
 import os.path as osp
 import re
 import warnings
+import zlib
 from collections import Counter
 from collections.abc import Iterable, Sequence
 from datetime import datetime
 from functools import cached_property
 from io import BufferedReader, BytesIO
 from pathlib import Path
 from typing import Any, BinaryIO, Union
+from zipfile import ZipFile
 
 import argue
 import matplotlib.pyplot as plt
@@ -1692,7 +1694,7 @@ def __sub__(self, other):
 
 
 class LazyDicomImageStack:
-    _image_path_keys: list[Path]
+    _image_path_keys: list[Path | str]
     metadatas: list[pydicom.Dataset]
 
     def __init__(
@@ -1743,7 +1745,7 @@ def __init__(
         self._image_path_keys = [paths[i] for i in order]
 
     @classmethod
-    def from_zip(cls, zip_path: str | Path, dtype: np.dtype | None = None):
+    def from_zip(cls, zip_path: str | Path, dtype: np.dtype | None = None, **kwargs):
         """Load a DICOM ZIP archive.
 
         Parameters
@@ -1754,7 +1756,7 @@ def from_zip(cls, zip_path: str | Path, dtype: np.dtype | None = None):
             The data type to cast the image data as. If None, will use whatever raw image format is.
         """
         with TemporaryZipDirectory(zip_path, delete=False) as tmpzip:
-            obj = cls(tmpzip, dtype)
+            obj = cls(tmpzip, dtype, **kwargs)
         return obj
 
     def _get_common_uid_imgs(
@@ -1825,6 +1827,106 @@ def __len__(self):
         return len(self._image_path_keys)
 
 
+class LazyZipDicomImageStack(LazyDicomImageStack):
+    """A variant of the lazy stack where a .zip archive is passed and
+    the archive is NOT extracted to disk. The most memory-efficient for use cases
+    like Cloud Run where disk=memory"""
+
+    def __init__(
+        self,
+        folder: str | Path,
+        dtype: np.dtype | None = None,
+        min_number: int = 39,
+        check_uid: bool = True,
+    ):
+        """Load a folder with DICOM CT images. This variant is more memory efficient than the standard DicomImageStack.
+
+        This is done by loading images from disk on the fly. This assumes all images remain on disk for the lifetime of the instance. This does not
+        need to be true for the original implementation.
+
+        See the documentation for DicomImageStack for parameter descriptions.
+        """
+        self.dtype = dtype
+        self.zip_archive = Path(folder)
+        self.shadow_images = {}
+        with ZipFile(self.zip_archive) as zfile:
+            paths = zfile.namelist()
+        # we only want to read the metadata once
+        # so we read it here and then filter and sort
+        metadatas, paths = self._get_path_metadatas(paths)
+
+        # check that at least 1 image was loaded
+        if len(paths) < 1:
+            raise FileNotFoundError(
+                f"No files were found in the specified location: {folder}"
+            )
+
+        # error checking
+        if check_uid:
+            most_common_uid = self._get_common_uid_imgs(metadatas, min_number)
+            metadatas = [m for m in metadatas if m.SeriesInstanceUID == most_common_uid]
+            paths = [
+                p
+                for p, m in zip(paths, metadatas)
+                if m.SeriesInstanceUID == most_common_uid
+            ]
+        # sort according to physical order
+        order = np.argsort([m.ImagePositionPatient[-1] for m in metadatas])
+        self.metadatas = [metadatas[i] for i in order]
+        self._image_path_keys = [paths[i] for i in order]
+        self.create_shadow(paths)
+
+    @classmethod
+    def from_zip(cls, zip_path: str | Path, dtype: np.dtype | None = None, **kwargs):
+        # the zip lazy stack assumes a zip file is passed
+        return cls(zip_path, dtype, **kwargs)
+
+    def create_shadow(self, paths: list[str]):
+        with ZipFile(self.zip_archive) as zfile:
+            for path in paths:
+                with zfile.open(path) as file:
+                    data = file.read()
+                    self.shadow_images[path] = {
+                        "data": zlib.compress(data),
+                    }
+
+    def _get_path_metadatas(
+        self, names: list[str]
+    ) -> (list[pydicom.Dataset], list[str]):
+        """Get the metadata for the images. This also filters out non-image files."""
+        metadata = []
+        matched_paths = []
+        zfile = ZipFile(self.zip_archive)
+        for name in names:
+            with zfile.open(name) as f:
+                try:
+                    ds = pydicom.dcmread(f, force=True, stop_before_pixels=True)
+                    if "Image Storage" in ds.SOPClassUID.name:
+                        metadata.append(ds)
+                        matched_paths.append(name)
+                except (InvalidDicomError, AttributeError, MemoryError):
+                    pass
+        return metadata, matched_paths
+
+    def __getitem__(self, item: int) -> DicomImage:
+        data = self.shadow_images[self._image_path_keys[item]]
+        image = io.BytesIO(zlib.decompress(data["data"]))
+        return DicomImage(image, dtype=self.dtype)
+
+    def __setitem__(self, key: int, value: DicomImage):
+        """Save the passed image to disk in place of the current image."""
+        with io.BytesIO() as stream:
+            value.save(stream)
+            self.shadow_images[self._image_path_keys[key]] = {
+                "data": zlib.compress(stream.getvalue()),
+            }
+
+    def __delitem__(self, key: int):
+        """Delete the image from the shadow object"""
+        full_key = self._image_path_keys.pop(key)
+        self.shadow_images.pop(full_key)
+
+
 class DicomImageStack(LazyDicomImageStack):
     """A class that loads and holds a stack of DICOM images (e.g. a CT dataset). The class can take
     a folder or zip file and will read CT images. The images must all be the same size. Supports

diff --git a/pylinac/ct.py b/pylinac/ct.py
@@ -43,7 +43,7 @@
 from .core.contrast import Contrast
 from .core.geometry import Line, Point
 from .core.image import ArrayImage, DicomImageStack, ImageLike, z_position
-from .core.io import TemporaryZipDirectory, get_url, retrieve_demo_file
+from .core.io import get_url, retrieve_demo_file
 from .core.mtf import MTF
 from .core.nps import (
     average_power,
@@ -1753,6 +1753,7 @@ def __init__(
         folderpath: str | Sequence[str] | Path | Sequence[Path] | Sequence[BytesIO],
         check_uid: bool = True,
         memory_efficient_mode: bool = False,
+        is_zip: bool = False,
     ):
         """
         Parameters
@@ -1774,17 +1775,23 @@ def __init__(
         """
         self.origin_slice = 0
         self.catphan_roll = 0
-        if isinstance(folderpath, (str, Path)):
+        if isinstance(folderpath, (str, Path)) and not is_zip:
             if not osp.isdir(folderpath):
                 raise NotADirectoryError("Path given was not a Directory/Folder")
-        stack = (
-            image.DicomImageStack
-            if not memory_efficient_mode
-            else image.LazyDicomImageStack
-        )
-        self.dicom_stack = stack(
-            folderpath, check_uid=check_uid, min_number=self.min_num_images
-        )
+        if not memory_efficient_mode:
+            stack = image.DicomImageStack
+        elif memory_efficient_mode and is_zip:
+            stack = image.LazyZipDicomImageStack
+        else:
+            stack = image.LazyDicomImageStack
+        if is_zip:
+            self.dicom_stack = stack.from_zip(
+                folderpath, check_uid=check_uid, min_number=self.min_num_images
+            )
+        else:
+            self.dicom_stack = stack(
+                folderpath, check_uid=check_uid, min_number=self.min_num_images
+            )
 
     @classmethod
     def from_demo_images(cls):
@@ -1830,15 +1837,12 @@ def from_zip(
         FileExistsError : If zip_file passed was not a legitimate zip file.
         FileNotFoundError : If no CT images are found in the folder
         """
-        delete = not memory_efficient_mode
-        with TemporaryZipDirectory(zip_file, delete=delete) as temp_zip:
-            obj = cls(
-                temp_zip,
-                check_uid=check_uid,
-                memory_efficient_mode=memory_efficient_mode,
-            )
-        obj.was_from_zip = True
-        return obj
+        return cls(
+            folderpath=zip_file,
+            check_uid=check_uid,
+            memory_efficient_mode=memory_efficient_mode,
+            is_zip=True,
+        )
 
     def plotly_analyzed_images(
         self,

diff --git a/tests_basic/core/test_image.py b/tests_basic/core/test_image.py
@@ -27,6 +27,7 @@
     DicomImageStack,
     FileImage,
     LazyDicomImageStack,
+    LazyZipDicomImageStack,
     LinacDicomImage,
     _rescale_dicom_values,
     _unscale_dicom_values,
@@ -844,6 +845,25 @@ def test_writing_back_to_lazy_stack(self):
         # assert that writing back to the stack works
         assert_array_almost_equal(manipulated_offset, original_offset)
 
+    def test_writing_back_to_zip_lazy_stack(self):
+        dstack_lazy = LazyZipDicomImageStack.from_zip(self.stack_location)
+        original_length = len(dstack_lazy)
+        original_offset = np.copy(dstack_lazy[0].array) + 50
+        first_img = dstack_lazy[0]
+        first_img.array += 50
+        dstack_lazy[0] = first_img
+        manipulated_offset = np.copy(dstack_lazy[0].array)
+        # assert that writing back to the stack works
+        assert_array_almost_equal(manipulated_offset, original_offset)
+        # also check the number of files has not changed
+        self.assertEqual(len(dstack_lazy), original_length)
+
+    def test_deleting_zip_lazy_from_stack(self):
+        lazy_stack = LazyZipDicomImageStack.from_zip(self.stack_location)
+        original_length = len(lazy_stack)
+        del lazy_stack[0]
+        self.assertEqual(len(lazy_stack), original_length - 1)
+
     def test_slice_spacing_ct(self):
         dstack = DicomImageStack.from_zip(self.stack_location)
         self.assertAlmostEqual(dstack.slice_spacing, 2.5, delta=0.001)