Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Data table directories are walked at startup #15307

Open
natefoo opened this issue Jan 13, 2023 · 17 comments
Open

Data table directories are walked at startup #15307

natefoo opened this issue Jan 13, 2023 · 17 comments

Comments

@natefoo
Copy link
Member

natefoo commented Jan 13, 2023

Describe the bug
On Galaxy load, paths referenced in location files referenced by data tables are fully walked. This can delay startup by minutes to hours if the reference data are on a slow filesystem and contain a huge number of files or directories. Walking just a subset of this data takes 94 minutes on the instance in question.

Does anyone know why we do this? Dynamic options? If it's necessary, can we put some kind of limitation on it or allow tool data tables to define that they should be excluded from this discovery process?

Galaxy Version and/or server at which you observed the bug
Galaxy Version: 23.0.dev
Commit: latest

To Reproduce
Steps to reproduce the behavior:

  1. Install data_manager_gtdbtk_database_installer
  2. Run the DM and select release 207.
  3. Upon completion, restart Galaxy. There will be a long wait on startup, depending on IO performance.

Expected behavior
Data table load only inspects files referenced in the location file or limits its traversal in some way.

Screenshots
Here's the stack during walking:

root@galaxy-psu-prod:~# ~glxyhpc/py-spy/bin/py-spy dump --pid 2065390 --full-filenames
Process 2065390: /corral/projects/Galaxy-PSU/lrn/galaxy/venv/bin/python3 /corral/projects/Galaxy-PSU/lrn/galaxy/venv/bin/gunicorn galaxy.webapps.galaxy.fast_factory:factory() --timeout 600 --pythonpath lib -k galaxy.webapps.galaxy.workers.Worker -b localhost:8080 --workers=4 --config python:galaxy.web_stack.gunicorn_config --preload
Python v3.9.15 (/corral/projects/Galaxy-PSU/lrn/conda/envs/[email protected]/bin/python3.9)

Thread 2065390 (idle): "MainThread"
    _walk (/corral/projects/Galaxy-PSU/lrn/conda/envs/[email protected]/lib/python3.9/os.py:357)
    _walk (/corral/projects/Galaxy-PSU/lrn/conda/envs/[email protected]/lib/python3.9/os.py:418)
    _walk (/corral/projects/Galaxy-PSU/lrn/conda/envs/[email protected]/lib/python3.9/os.py:418)
    _walk (/corral/projects/Galaxy-PSU/lrn/conda/envs/[email protected]/lib/python3.9/os.py:418)
    _walk (/corral/projects/Galaxy-PSU/lrn/conda/envs/[email protected]/lib/python3.9/os.py:418)
    _walk (/corral/projects/Galaxy-PSU/lrn/conda/envs/[email protected]/lib/python3.9/os.py:418)
    _walk (/corral/projects/Galaxy-PSU/lrn/conda/envs/[email protected]/lib/python3.9/os.py:418)
    _walk (/corral/projects/Galaxy-PSU/lrn/conda/envs/[email protected]/lib/python3.9/os.py:418)
    <listcomp> (/corral/projects/Galaxy-PSU/lrn/galaxy/server/lib/galaxy/tools/data/__init__.py:60)
    update_files (/corral/projects/Galaxy-PSU/lrn/galaxy/server/lib/galaxy/tools/data/__init__.py:60)
    tool_data_path_files (/corral/projects/Galaxy-PSU/lrn/galaxy/server/lib/galaxy/tools/data/__init__.py:51)
    exists (/corral/projects/Galaxy-PSU/lrn/galaxy/server/lib/galaxy/tools/data/__init__.py:75)
    configure_and_load (/corral/projects/Galaxy-PSU/lrn/galaxy/server/lib/galaxy/tools/data/__init__.py:518)
    __init__ (/corral/projects/Galaxy-PSU/lrn/galaxy/server/lib/galaxy/tools/data/__init__.py:438)
    from_elem (/corral/projects/Galaxy-PSU/lrn/galaxy/server/lib/galaxy/tools/data/__init__.py:295)
    load_from_config_file (/corral/projects/Galaxy-PSU/lrn/galaxy/server/lib/galaxy/tools/data/__init__.py:150)
    __init__ (/corral/projects/Galaxy-PSU/lrn/galaxy/server/lib/galaxy/tools/data/__init__.py:97)
    _configure_tool_data_tables (/corral/projects/Galaxy-PSU/lrn/galaxy/server/lib/galaxy/app.py:305)
    __init__ (/corral/projects/Galaxy-PSU/lrn/galaxy/server/lib/galaxy/app.py:620)
    app_pair (/corral/projects/Galaxy-PSU/lrn/galaxy/server/lib/galaxy/webapps/galaxy/buildapp.py:59)
    factory (/corral/projects/Galaxy-PSU/lrn/galaxy/server/lib/galaxy/webapps/galaxy/fast_factory.py:63)
    import_app (/corral/projects/Galaxy-PSU/lrn/galaxy/venv/lib/python3.9/site-packages/gunicorn/util.py:412)
    load_wsgiapp (/corral/projects/Galaxy-PSU/lrn/galaxy/venv/lib/python3.9/site-packages/gunicorn/app/wsgiapp.py:48)
    load (/corral/projects/Galaxy-PSU/lrn/galaxy/venv/lib/python3.9/site-packages/gunicorn/app/wsgiapp.py:58)
    wsgi (/corral/projects/Galaxy-PSU/lrn/galaxy/venv/lib/python3.9/site-packages/gunicorn/app/base.py:67)
    setup (/corral/projects/Galaxy-PSU/lrn/galaxy/venv/lib/python3.9/site-packages/gunicorn/arbiter.py:118)
    __init__ (/corral/projects/Galaxy-PSU/lrn/galaxy/venv/lib/python3.9/site-packages/gunicorn/arbiter.py:58)
    run (/corral/projects/Galaxy-PSU/lrn/galaxy/venv/lib/python3.9/site-packages/gunicorn/app/base.py:72)
    run (/corral/projects/Galaxy-PSU/lrn/galaxy/venv/lib/python3.9/site-packages/gunicorn/app/base.py:231)
    run (/corral/projects/Galaxy-PSU/lrn/galaxy/venv/lib/python3.9/site-packages/gunicorn/app/wsgiapp.py:67)
    <module> (/corral/projects/Galaxy-PSU/lrn/galaxy/venv/bin/gunicorn:8)

Additional context
It looks like this goes way back. The process was improved in #3909 but a full walk is still performed.

@natefoo natefoo added this to the 23.0 milestone Jan 13, 2023
@bernt-matthias
Copy link
Contributor

Might it be that this is not for the latest version? Stack fits to

ie 22.05?

Regardless, if I interpret the code correctly, the tool_data dir is walked to find all the .loc and .loc.sample files. Unfortunately the tool_data dir contains also the data...

If I remember correctly, loc files are actually only in the root of the tool_data dir and in one specific subdir (for the shed tools), or? So we could limit the walk. Alternatively we could separate the dir containing the loc files and the die containing the data...

@natefoo
Copy link
Member Author

natefoo commented Jan 16, 2023

Yes, apologies, I put 23.0 in the report since it's still present, but you're right, this is on a 22.05 instance.

Do we need to locate .loc files at all? They should be directly referenced by tool data table config files. Do we support .loc files that aren't referenced by a tool data table?

@natefoo
Copy link
Member Author

natefoo commented Jan 16, 2023

Ok, taking a minute to actually look at the stack it looks like the walk is performed when:

  1. the location file path in the tool data table conf is not absolute,
  2. the part of the path path preceding the location file is not equal to tool_data_path, and
  3. the location file is not at the root of tool_data_path.

Kind of funny we need to do this when Galaxy created the location file and should know where it is...

@mvdbeek
Copy link
Member

mvdbeek commented Jan 16, 2023

We do also allow for bare loc files in tools, but we could break that and see who screams (probably @blankenberg) ?

@natefoo
Copy link
Member Author

natefoo commented Jan 16, 2023

If nothing else we should make it a switch that we eventually default to off, but I guess that is unrelated to this issue.

@bernt-matthias
Copy link
Contributor

I guess I would do a small scream :)

I manage my data_tables for which no data manager exists by creating a xyz.loc file in the tool-data/ dir. For instance I create ./gene_sets.loc and completely ignore all the other datatables

./toolshed.g2.bx.psu.edu/repos/iuc/stringtie/1ebd14235b92/gene_sets.loc
./toolshed.g2.bx.psu.edu/repos/iuc/stringtie/eba36e001f45/gene_sets.loc
./toolshed.g2.bx.psu.edu/repos/iuc/stringtie/dd4df992d93d/gene_sets.loc
./toolshed.g2.bx.psu.edu/repos/iuc/stringtie/76d290331481/gene_sets.loc
./toolshed.g2.bx.psu.edu/repos/iuc/stringtie/333a6e13b622/gene_sets.loc
./toolshed.g2.bx.psu.edu/repos/iuc/featurecounts/46cccc52be5f/gene_sets.loc
./toolshed.g2.bx.psu.edu/repos/iuc/featurecounts/a37612abf7f9/gene_sets.loc
./toolshed.g2.bx.psu.edu/repos/iuc/featurecounts/de2c5e5a206c/gene_sets.loc
./toolshed.g2.bx.psu.edu/repos/iuc/featurecounts/af814359d244/gene_sets.loc
./toolshed.g2.bx.psu.edu/repos/iuc/featurecounts/9301937c9037/gene_sets.loc
./toolshed.g2.bx.psu.edu/repos/iuc/featurecounts/ea04b737afa0/gene_sets.loc
./toolshed.g2.bx.psu.edu/repos/iuc/featurecounts/1759d845181e/gene_sets.loc
./toolshed.g2.bx.psu.edu/repos/iuc/featurecounts/38b6d12edc68/gene_sets.loc
./toolshed.g2.bx.psu.edu/repos/iuc/gffcompare/2bb86e2c417f/gene_sets.loc

for which I never understood what they are good for anyway. Their content is merged anyway on startup, or? Would be really cool to understand this.

The reason to do so is that I wanted to manage the data tables automatically with ansible .. and refering the "versioned" data tables seemed to have no additional value.

@natefoo
Copy link
Member Author

natefoo commented Jan 16, 2023

Yeah, they are merged. Honestly this feature is really pointless since the loc files installed with toolshed tools are basically never populated but they all have to be read on startup. If you do end up populating one by hand you will probably manually add it to tool_data_table_conf.xml and manage all entries in a single loc file somewhere else. And if there's a DM, it updates the loc file in the DM's repo dir on disk, not any of the tool loc files.

@nuwang
Copy link
Member

nuwang commented Jan 16, 2023

IMO, the problem with Galaxy startup in general is that it optimizes for the dev use case and not so much the admin use case. In a dev use case, sure, parsing all tool xmls, rebuilding the navigator from scratch etc., may be useful. But in admin use cases, and possibly even in dev use cases, I think it's more likely to be a waste of time. A restart is more likely to be due to some minor config change, to deal with a failure or similar, and less frequently because the navigator menu changed, or the loc files changed. Yet, the startup penalty must be paid by admins each time regardless. This is especially noticeable on larger systems, where it can take forever to startup with a fully loaded navigator.

It would be much nicer to have some management commands like:

galaxy rebuild navigator
galaxy rebuild tool_cache
galaxy rebuild tool_data
galaxy rebuild all

or similar, and maybe a dev mode that would do all these things automatically for those who prefer.

The tool cache for example, dropped startup time drastically (from 3 mins in a cvmfs setup to < 30 seconds), but was unfortunately dropped due to the extra burden as I recall, of maintaining a central cache. Rather, I think it would be better to explicitly hand over the task of building the required caches as needed to the admin. It would be even nicer if we could go a step further and barely touch the filesystem on restarts.

@mvdbeek
Copy link
Member

mvdbeek commented Jan 17, 2023

with toolshed tools are basically never populated

That may be true now, maybe! But back when I was using data managers through the UI they would write to the loc file in their tool directory. Proceed with a giant warning here.

, the problem with Galaxy startup in general is that it optimizes for the dev use case and not so much the admin use case

I don't think that's the right way to look at it. We optimize for correctness, and dealing with a cache is, with the current architecture and availability of solutions (sqlite on NFS, looking at you), really complicated. That is my takeaway from having added different flavors of broken caches over the years to address this.

@bernt-matthias
Copy link
Contributor

But back when I was using data managers through the UI they would write to the loc file in their tool directory.

This is still the case on my system. Still the question is what versioned data tables are good for? One thing that comes up to my mind is that it is immediately clear which data manager version produced data. I guess this is already sufficient reason.

Having all the loc files etc in the corresponding dirs of the tools consuming the data tables is probably just a technicality.

Do you think it would be enough (i.e. correct) to traverse only

  • the files in the root of tool-data and
  • the complete subtrees of the tool-shed subdirs contained in tool-data

I guess this would exclude all the data and yield some speedup, or?

@mvdbeek
Copy link
Member

mvdbeek commented Jan 17, 2023

I guess this would exclude all the data and yield some speedup, or?

Yes, if carefully done that could work.

@natefoo
Copy link
Member Author

natefoo commented Jan 18, 2023

But back when I was using data managers through the UI they would write to the loc file in their tool directory.

The DM loc file in the tool dir is used, yes, but all of the empty loc files for all the tools themselves that reference a data table are also read and will almost surely never contain anything unless an admin decides to populate one by hand for some reason.

@dannon dannon modified the milestones: 23.0, 23.1 Feb 7, 2023
@abretaud
Copy link
Contributor

abretaud commented May 9, 2023

Just add this issue with a veeeery long startup.

Ok, taking a minute to actually look at the stack it looks like the walk is performed when:

1. the location file path in the tool data table conf is not absolute,

2. the part of the path path preceding the location file is not equal to `tool_data_path`, and

3. the location file is not at the root of `tool_data_path`.

Kind of funny we need to do this when Galaxy created the location file and should know where it is...

Looking at https://github.com/galaxyproject/galaxy/blob/ca77e1841115991ee56c85dcf065d77dc23b1962/lib/galaxy/tools/data/__init__.py#LL505C37-L505C37, walk is performed everytime right?

There's a cache but with a 1 second lifetime, not sure if it's intended?

@mvdbeek mvdbeek modified the milestones: 23.1, 23.2 Jul 21, 2023
@jdavcs jdavcs modified the milestones: 23.2, 24.0 Jan 10, 2024
@jdavcs jdavcs removed this from the 24.0 milestone Mar 1, 2024
@bernt-matthias
Copy link
Contributor

Looked at this a bit and here are my comments.

I also think that we need to attack at the ToolDataPathFiles.update_files function which locates all the loc files and is executed if there are calls to ToolDataPathFiles.exists that are more than 1sec apart. exists is called multiple times (I think independent of absolute/relative path names, see here and here) during the initialization of Galaxy. I guess even for fast file systems this will lead to multiple calls to the update function during the startup.

My ideas are the following at the moment:

  • Disable the update function during update (e.g. make the 1st update interval really long), or make the update interval configurable.
  • Do not walk the whole tool data folder. .loc files are either located in the root or in the trees names line the used toolsheds. So it should be sufficient to walk those.
  • Separate the tool_data directory (ie the dir containing all the loc files) and the dir containing the actual data. I don't think that is currently possible (could be done manually, but I mean via config variables).

@bernt-matthias
Copy link
Contributor

I was wrong about the last part. It can be controlled with galaxy_data_manager_data_path.

@bernt-matthias
Copy link
Contributor

@natefoo does your tool_data_path contains anything else than tool data tables (.loc and .loc.sample files) and tool data table config files?

@bernt-matthias
Copy link
Contributor

I have a branch ready that would implement a configurable update interval.

But, each tool data table should be referred to by exactly one tool data table config. For each of those there seem to be at most 3 file existence checks -- depending on if the file name is given absolute or relative (here and following lines).
So one question might be if those 3 file existence checks can be less efficient than an iteration over a directory containing all the files whose existence is checked (actually it contains more, e.g. unused tool data table config files for each of the shed installed tools and a lot of tool data in the worst case).

It might also be an option to just update cache entries when the exists function is called (and just drop the loop).

For me storing the tool data elsewhere (by using galaxy_data_manager_data_path) solved the problem. But I'm fine if someone thinks that I should open a PR based on the branch or a modification of it. Anyway I'm happy to discuss.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

7 participants