Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Optimisation of directory type checking in file list #1381

Draft
wants to merge 3 commits into
base: dev
Choose a base branch
from

Conversation

confluence
Copy link
Collaborator

This is an attempt to improve the performance of generating a file list for directories with lots of subdirectories. The main performance bottlenecks in the current code are:

  1. Checking whether a directory is a known image format, and should be treated as a file,
  2. Counting the children of a subdirectory.

There are three options for displaying file lists: filtering by content, filtering by file extension, and no filtering. Filtering by content additionally slows down the file list generation because it reads the signature of each ordinary file to determine the type. Guessing the type from the file extension speeds up the processing of ordinary files, but not subdirectories, which are always checked with casacore's ImageOpener. Turning off filtering entirely has no additional performance impact on the image file list (because detection of directory-based images is still required), but speeds up the region file list (because if no filtering is required, all directories can be shown without additional checks).

This PR is an attempt to add a less expensive heuristic for directory-based images, to be used when the option to filter by content is not selected. It performs almost the same checks as ImageOpener, but only for directories, by looking for files inside the directory with fs::exists, and without distinguishing between different CASA image subtypes.

The existing code assumes that there are directory image formats which we do not support, and handles them differently. However, it's clear from the ImageOpener code that the GIPSY format is a pair of files, not a directory (so ImageOpener would never return that type for a directory), and the CAIPS and NEWSTAR types are obsolete and never returned by ImageOpener (at all). So I have removed this option from the code, and not implemented it in the alternative code.

The result: the alternative code appears to be slightly faster, but I don't know if it's faster enough for it to make sense to add it as an alternative to the casacore code. If this implementation is sufficient for our purposes (e.g. we don't need to read the table.info file because we don't need do distinguish between CASA sub-types here), then perhaps we should replace the casacore check with this (for a modest speed improvement in all cases).

Other optimizations we discussed:

  • Not counting subdirectory children
  • Disabling directory type-checking entirely (if not filtering by content?), and changing the frontend UI to fetch the type information when a directory is clicked on
  • Progressively requesting info for batches of directories as the user scrolls (this would require a more substantial redesign, with ICD changes)

I think we're planning to use the last option as our long-term solution, and I would suggest applying that strategy to all files: instead of loading file information up-front when generating the file list, we could initially return just a bare list of files and directories, and then return information for lists of files and directories as the frontend requests them.

Checklist

  • changelog updated / no changelog update needed
  • e2e test passing / corresponding fix added / new e2e test created
  • ICD test passing / corresponding fix added / new ICD test created
  • protobuf updated to the latest dev commit / no protobuf update needed
  • protobuf version bumped / protobuf version not bumped
  • added reviewers and assignee
  • GitHub Project estimate added

@confluence confluence self-assigned this Jun 14, 2024
@kswang1029
Copy link
Contributor

kswang1029 commented Jul 1, 2024

I did a test with a directory with 20000 fits image files, and a directory with 20000 casa image directories. The text computer is a local build with a fast SSD. I did purge disk cache for each test case.

The test results (in seconds) are summarized in the following table:
Screenshot 2024-07-01 at 11 06 30

We see improvements of ~2x for the casa image with the "filter by extension" and the "all files" modes. Is this expected in this improvement and consistent with what you tested and observed?

The psrecord plots are attached below as a record. The first plateau is casa image and the second is fits image.

old - content
old_content

new - content
new_content

old - extension
old_extension

new - extension
new_extension

old - all
old_all

new - all
new_all

It appears to me that once we access the context of a file, the CPU usage won't be 100% in my test case.

Copy link

Code Coverage

Package Line Rate Health
src.Cache 72%
src.DataStream 44%
src.FileList 67%
src.Frame 36%
src.HttpServer 42%
src.ImageData 28%
src.ImageFitter 83%
src.ImageGenerators 43%
src.ImageStats 75%
src.Logger 37%
src.Main 52%
src.Region 69%
src.Session 4%
src.Table 52%
src.ThreadingManager 67%
src.Timer 85%
src.Util 38%
Summary 46% (8638 / 18963)

Copy link
Collaborator

@pford pford left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just found a place in ProgramSettings.cc:338 which uses casacore::ImageOpener to check whether the positional folder argument is a CASA image to be opened. This should also use GuessImageDirectoryType.

@@ -10,6 +10,7 @@
#include <fstream>
#include <regex>

#include "Casacore.h"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If this include necessary?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Asking because I am looking at a file list bug and might include File.h in Casacore.h.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants