Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Sub-cycling? #2241

Open
hjoliver opened this issue Apr 11, 2017 · 6 comments
Open

Sub-cycling? #2241

hjoliver opened this issue Apr 11, 2017 · 6 comments
Labels
efficiency For notable efficiency improvements
Milestone

Comments

@hjoliver
Copy link
Member

hjoliver commented Apr 11, 2017

The Problem

This arises from recent discussions of how best to process large numbers of files generated sequentially each model run in a cycling suite. E.g. consider running 10-day-long forecasts once daily, with 10 output files generated during every hour of forecast time. So 2400 files generated from each forecast run.

There are several obvious ways to do this currently, at each daily cycle point:

  • use integer parameters to statically generate a task for every output file at each cycle point, and foo<p-1> => foo
  • use a single task (or a few) to process all files, internalizing the logic for waiting on each file, repeatedly running the processing script(s), and recording progress to allow incremental re-run if the task is retriggered. [and rose_bunch with sequential execution can help with this].
  • at each model cycle point, run a sub-suite (suite inside a task) that cycles over forecast hour, to process all the output files. This is pretty easy to do, and super-efficient for Cylc (small number of tasks-per-cycle), but there are monitoring and housekeeping issues due to the moderately large number of separate suites (plus how to do full-system restarts etc.)

The first option is easiest to code and to understand, but it can easily result in so many tasks (2400 per cycle, above) that they dominate the suite and affect Cylc performance, which is unfortunate given their rather trivial purpose within the wider workflow.

Proposed Solution

Conceivably, this could be done very nicely using an hourly cycle (out to 240 hours) starting from each daily cycle point, managed as per our normally cycling mode. This "sub-cycle" can extend past the next main cycle point, so the sub-cycling tasks need to be associated with their parent main cycle point, to avoid ambiguity (e.g. in the example above, there will be 10 different sub-cycling tasks at each main cycle point). I think this could be handled easily by making the name of the sub-cycling tasks include the main cycle point. e.g. proc-20200101T00.2020010101T01, proc-20200101T00.20200101T02, etc. and ensuring these sub-cycling tasks can have a final cycle point relative to their main cycle point.

[NOTE: it's possible that this would not be worth the effort to implement, but having thought about it a bit I just want to get the issue up for the record...].

@hjoliver hjoliver added this to the some-day milestone Apr 11, 2017
@hjoliver hjoliver changed the title Dynamic sub-cycling? Sub-cycling? Jul 18, 2017
@hjoliver
Copy link
Member Author

(also from previous discussions) - there was a feeling that this is "not a problem that Cylc should solve" because of current technical limitations of filesystems etc. regards generating many log files from very large numbers of short-running tasks - so it is better to group many processing jobs of this type into a small number of Cylc tasks. However, I think this is something we ought to be able to do at the workflow level, even if the feature needs to be used carefully because of external factors (which may eventually cease to be a problem).

@hjoliver
Copy link
Member Author

The only ways to achieve this currently (without bunching many jobs into fewer tasks) is (a) static workflow for sub-cycling - hence huge numbers of task proxies for Cylc to manage; or (b) sub-suites - e.g. a task in the daily cycling main suite launches a 10-day-long hourly cycling suite in each daily cycle. The latter method works fine, but brings complications in terms of monitoring (sub-suites are separate suites) and housekeeping (each sub-suite run needs a new top-level suite run directory, if you don't want it to overwrite the run dir of the previous sub-suite run)

@matthewrmshin
Copy link
Contributor

matthewrmshin commented Nov 23, 2017

#1307 is some what related.

It will be good if we can have a way to group together a number of related small tasks to run as a single batch job - but still have the suite manages the tasks as separate entities.

(We'll need the ability to decouple the one-to-one(+retries) relationship between a task and its jobs, and the ability to tell a group of tasks that they will be sharing batch jobs.)

@matthewrmshin matthewrmshin added the efficiency For notable efficiency improvements label Nov 23, 2017
@matthewrmshin
Copy link
Contributor

(As an aside, I would also like to raise the possibility of supporting multi-dimensional cycling (and supporting date-time recurrences, integers and list of strings) in the future. With a task pool that can do prioritised spawn-on-demand (#987), this can become extremely powerful for handling very large suites with the requirement raised by this issue.)

@matthewrmshin
Copy link
Contributor

Cylc Google Groups thread: splitting a year (365 days) in 3 chunks of 120, 120 and 125 days is probably another problem that can be solved by multi-dimensional cycling or sub-cycling?

@matthewrmshin
Copy link
Contributor

matthewrmshin commented Mar 22, 2018

A different way of looking at sub-cycling is to allow a single suite to connect graphs of different cycling modes, dimensions and scopes.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
efficiency For notable efficiency improvements
Projects
None yet
Development

No branches or pull requests

2 participants