Improved support for HTCondor #3705

JosephLalli · 2023-03-01T16:51:17Z

I've been tasked with trying to improve nextflow/HTCondor interoperability by the Powers That Be at UWisc (and my own pipeline's needs).

Initially, I thought the major issue would be the lack of a shared file system in HTCondor (#3697). However, I've encountered other issues that suggest a proper feature request might be in order.

Getting access to a node with submit capabilities. This is a HTCondor-side issue, and I'm working it out with them. It should be doable. (It's possible the solution lies in HTCondor's grid setup, but I don't know enough to say for sure.
Shared file system. Our setup has a shared 'staging' drive, which will serve as a shared file system while I get the other issues worked out. Eventually, I think Wave/Fusion should solve this problem.
Launching containerized jobs from Nextflow. Currently all HTCondor jobs are submitted with the descriptor "universe = vanilla", which forces the job to be run as "Vanilla Universe" job, ie locally on each server.
- HTCondor has a separate "Docker" universe to run jobs in docker containers. One simply needs to specifiy 'universe=docker,docker_image=my/dockerimage:version'.
- HTCondor also has a "Container" universe, which sets up jobs to run in Docker containers or Singularity containers. If a universe isn't specified, HTCondor will attempt to determine the universe being requested. Jobs with no image specified are run as vanilla/local jobs, jobs with a docker image url specified are run in docker containers, and jobs with a ".sif" file are run in that singularity container.
- I think most users would benefit from the ability to run containerized jobs. Maybe if 'docker=true' or 'singularity=true', the appropriate universe could be specified along with the image file?

I will add to this post as more issues are encountered. I'm writing something similar to the HTCondor people.

bentsherman · 2023-03-01T23:44:00Z

Hi Joseph, thanks for sharing. It looks like the HTCondor executor in Nextflow hasn't changed much since it was first added. Since you have an HTCondor cluster, it could be a good opportunity to bring everything up to date. The key thing is to make sure that Nextflow supports HTCondor in general, and not just the specific nuances of your cluster.

Getting access to a node with submit capabilities.

This problem is common in many HPC environments, because the head node is usually the only node that can submit jobs, but it's also locked down such that Nextflow can't run there. Some typical workarounds include:

allow compute nodes to submit jobs, then launch Nextflow itself as a job,
create dedicated "workflow" nodes that have high walltime limits and reserved only for workflow jobs like Nextflow,
allow users to provision their own head node

You'll have to talk to your sysadmins about finding a solution that works for your cluster. Hopefully you aren't the only person trying to run workflows and your sys admins are already aware of it.

Shared file system.

Aside from Fusion, you can set up your environment such that your pipeline code and input/output data reside in permanent storage (e.g. home, lab directory) while the work directory resides in temporary storage. So your sys admins could set up a shared filesystem that periodically deletes old files to keep storage under control. Just wanted to mention that in case it helps your discussions with the admins.

Launching containerized jobs from Nextflow.

I have no idea if anyone has tried this yet. I'm guessing no, because currently Nextflow always sets universe = vanilla. I recommend that you just try it and use the clusterOptions directive to set whatever options you need, that way we can figure out what works best. I would like to know which universe we should use, or if we can just specify the image like you said.

JosephLalli · 2023-03-02T00:07:42Z

Agreed, this is a cluster-specific issue. It's my most immediate barrier to testing new setups, but getting head node access is a me-and-UW problem, not a nextflow problem.
Our sys admins are very hesitant about this option. Some groups have done this, but the number of labs that want to run nextflow pipelines cluster is high enough that using the common shared filesystem option (which is set up as you propose - called the 'staging' drive) would overwhelm the server(s) it is hosted on. They are excited about using an s3 drive as an option, which scales much more easily.
Will clusterOptions override the 'universe=vanilla' line? It seems hardcoded when I look at the code. I will try adding clusterOptions="universe=docker" to the test Nextflow config file once I have access to a head node.

bentsherman · 2023-03-02T14:22:39Z

Regarding the clusterOptions, I don't know if it will overwrite or cause an error. Let's just try it once you are able to run a pipeline and see if it works as is with the clusterOptions. Ultimately we will probably change that line to depend on whether a container image is defined.

Thom38 · 2023-06-13T14:55:42Z

Hi,
I have the same need/issue regarding the usage of Nexflow on an HTcondor cluster when we want to use docker universe.
I have tried the clusterOptions="universe=docker" as mentioned but vanilla universe looks hardcoded.
Can you make a change to support the docker universe in condor executor ?

bentsherman · 2023-06-22T15:03:33Z

Hi @Thom38 , I just drafted a PR with Docker support for HTCondor. Just use the container directive and docker.enabled = true like normal. Can you test it in your environment? Comment on the PR if you have any issues.

stale · 2023-12-15T06:59:15Z

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

bentsherman added the executor/htcondor label Mar 1, 2023

bentsherman mentioned this issue Jun 22, 2023

Support container directive for condor executor #4048

Closed

stale bot added the stale label Dec 15, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improved support for HTCondor #3705

Improved support for HTCondor #3705

JosephLalli commented Mar 1, 2023 •

edited

Loading

bentsherman commented Mar 1, 2023

JosephLalli commented Mar 2, 2023 •

edited

Loading

bentsherman commented Mar 2, 2023

Thom38 commented Jun 13, 2023

bentsherman commented Jun 22, 2023

stale bot commented Dec 15, 2023

Improved support for HTCondor #3705

Improved support for HTCondor #3705

Comments

JosephLalli commented Mar 1, 2023 • edited Loading

bentsherman commented Mar 1, 2023

JosephLalli commented Mar 2, 2023 • edited Loading

bentsherman commented Mar 2, 2023

Thom38 commented Jun 13, 2023

bentsherman commented Jun 22, 2023

stale bot commented Dec 15, 2023

JosephLalli commented Mar 1, 2023 •

edited

Loading

JosephLalli commented Mar 2, 2023 •

edited

Loading