-
Notifications
You must be signed in to change notification settings - Fork 1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Data table directories are walked at startup #15307
Comments
Might it be that this is not for the latest version? Stack fits to galaxy/lib/galaxy/tools/data/__init__.py Line 60 in ca77e18
Regardless, if I interpret the code correctly, the If I remember correctly, loc files are actually only in the root of the |
Yes, apologies, I put 23.0 in the report since it's still present, but you're right, this is on a 22.05 instance. Do we need to locate .loc files at all? They should be directly referenced by tool data table config files. Do we support .loc files that aren't referenced by a tool data table? |
Ok, taking a minute to actually look at the stack it looks like the walk is performed when:
Kind of funny we need to do this when Galaxy created the location file and should know where it is... |
We do also allow for bare loc files in tools, but we could break that and see who screams (probably @blankenberg) ? |
If nothing else we should make it a switch that we eventually default to off, but I guess that is unrelated to this issue. |
I guess I would do a small scream :) I manage my data_tables for which no data manager exists by creating a xyz.loc file in the
for which I never understood what they are good for anyway. Their content is merged anyway on startup, or? Would be really cool to understand this. The reason to do so is that I wanted to manage the data tables automatically with ansible .. and refering the "versioned" data tables seemed to have no additional value. |
Yeah, they are merged. Honestly this feature is really pointless since the loc files installed with toolshed tools are basically never populated but they all have to be read on startup. If you do end up populating one by hand you will probably manually add it to |
IMO, the problem with Galaxy startup in general is that it optimizes for the dev use case and not so much the admin use case. In a dev use case, sure, parsing all tool xmls, rebuilding the navigator from scratch etc., may be useful. But in admin use cases, and possibly even in dev use cases, I think it's more likely to be a waste of time. A restart is more likely to be due to some minor config change, to deal with a failure or similar, and less frequently because the navigator menu changed, or the loc files changed. Yet, the startup penalty must be paid by admins each time regardless. This is especially noticeable on larger systems, where it can take forever to startup with a fully loaded navigator. It would be much nicer to have some management commands like:
or similar, and maybe a dev mode that would do all these things automatically for those who prefer. The tool cache for example, dropped startup time drastically (from 3 mins in a cvmfs setup to < 30 seconds), but was unfortunately dropped due to the extra burden as I recall, of maintaining a central cache. Rather, I think it would be better to explicitly hand over the task of building the required caches as needed to the admin. It would be even nicer if we could go a step further and barely touch the filesystem on restarts. |
That may be true now, maybe! But back when I was using data managers through the UI they would write to the loc file in their tool directory. Proceed with a giant warning here.
I don't think that's the right way to look at it. We optimize for correctness, and dealing with a cache is, with the current architecture and availability of solutions (sqlite on NFS, looking at you), really complicated. That is my takeaway from having added different flavors of broken caches over the years to address this. |
This is still the case on my system. Still the question is what versioned data tables are good for? One thing that comes up to my mind is that it is immediately clear which data manager version produced data. I guess this is already sufficient reason. Having all the loc files etc in the corresponding dirs of the tools consuming the data tables is probably just a technicality. Do you think it would be enough (i.e. correct) to traverse only
I guess this would exclude all the data and yield some speedup, or? |
Yes, if carefully done that could work. |
The DM loc file in the tool dir is used, yes, but all of the empty loc files for all the tools themselves that reference a data table are also read and will almost surely never contain anything unless an admin decides to populate one by hand for some reason. |
Just add this issue with a veeeery long startup.
Looking at https://github.com/galaxyproject/galaxy/blob/ca77e1841115991ee56c85dcf065d77dc23b1962/lib/galaxy/tools/data/__init__.py#LL505C37-L505C37, walk is performed everytime right? There's a cache but with a 1 second lifetime, not sure if it's intended? |
Looked at this a bit and here are my comments. I also think that we need to attack at the My ideas are the following at the moment:
|
I was wrong about the last part. It can be controlled with |
@natefoo does your |
I have a branch ready that would implement a configurable update interval. But, each tool data table should be referred to by exactly one tool data table config. For each of those there seem to be at most 3 file existence checks -- depending on if the file name is given absolute or relative (here and following lines). It might also be an option to just update cache entries when the For me storing the tool data elsewhere (by using |
Describe the bug
On Galaxy load, paths referenced in location files referenced by data tables are fully walked. This can delay startup by minutes to hours if the reference data are on a slow filesystem and contain a huge number of files or directories. Walking just a subset of this data takes 94 minutes on the instance in question.
Does anyone know why we do this? Dynamic options? If it's necessary, can we put some kind of limitation on it or allow tool data tables to define that they should be excluded from this discovery process?
Galaxy Version and/or server at which you observed the bug
Galaxy Version: 23.0.dev
Commit: latest
To Reproduce
Steps to reproduce the behavior:
data_manager_gtdbtk_database_installer
Expected behavior
Data table load only inspects files referenced in the location file or limits its traversal in some way.
Screenshots
Here's the stack during walking:
Additional context
It looks like this goes way back. The process was improved in #3909 but a full walk is still performed.
The text was updated successfully, but these errors were encountered: