Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improved support for HTCondor #3705

Open
JosephLalli opened this issue Mar 1, 2023 · 6 comments
Open

Improved support for HTCondor #3705

JosephLalli opened this issue Mar 1, 2023 · 6 comments

Comments

@JosephLalli
Copy link

JosephLalli commented Mar 1, 2023

I've been tasked with trying to improve nextflow/HTCondor interoperability by the Powers That Be at UWisc (and my own pipeline's needs).

Initially, I thought the major issue would be the lack of a shared file system in HTCondor (#3697). However, I've encountered other issues that suggest a proper feature request might be in order.

  1. Getting access to a node with submit capabilities. This is a HTCondor-side issue, and I'm working it out with them. It should be doable. (It's possible the solution lies in HTCondor's grid setup, but I don't know enough to say for sure.
  2. Shared file system. Our setup has a shared 'staging' drive, which will serve as a shared file system while I get the other issues worked out. Eventually, I think Wave/Fusion should solve this problem.
  3. Launching containerized jobs from Nextflow. Currently all HTCondor jobs are submitted with the descriptor "universe = vanilla", which forces the job to be run as "Vanilla Universe" job, ie locally on each server.
    • HTCondor has a separate "Docker" universe to run jobs in docker containers. One simply needs to specifiy 'universe=docker,docker_image=my/dockerimage:version'.
    • HTCondor also has a "Container" universe, which sets up jobs to run in Docker containers or Singularity containers. If a universe isn't specified, HTCondor will attempt to determine the universe being requested. Jobs with no image specified are run as vanilla/local jobs, jobs with a docker image url specified are run in docker containers, and jobs with a ".sif" file are run in that singularity container.
    • I think most users would benefit from the ability to run containerized jobs. Maybe if 'docker=true' or 'singularity=true', the appropriate universe could be specified along with the image file?

I will add to this post as more issues are encountered. I'm writing something similar to the HTCondor people.

@bentsherman
Copy link
Member

Hi Joseph, thanks for sharing. It looks like the HTCondor executor in Nextflow hasn't changed much since it was first added. Since you have an HTCondor cluster, it could be a good opportunity to bring everything up to date. The key thing is to make sure that Nextflow supports HTCondor in general, and not just the specific nuances of your cluster.

  1. Getting access to a node with submit capabilities.

This problem is common in many HPC environments, because the head node is usually the only node that can submit jobs, but it's also locked down such that Nextflow can't run there. Some typical workarounds include:

  1. allow compute nodes to submit jobs, then launch Nextflow itself as a job,
  2. create dedicated "workflow" nodes that have high walltime limits and reserved only for workflow jobs like Nextflow,
  3. allow users to provision their own head node

You'll have to talk to your sysadmins about finding a solution that works for your cluster. Hopefully you aren't the only person trying to run workflows and your sys admins are already aware of it.

  1. Shared file system.

Aside from Fusion, you can set up your environment such that your pipeline code and input/output data reside in permanent storage (e.g. home, lab directory) while the work directory resides in temporary storage. So your sys admins could set up a shared filesystem that periodically deletes old files to keep storage under control. Just wanted to mention that in case it helps your discussions with the admins.

  1. Launching containerized jobs from Nextflow.

I have no idea if anyone has tried this yet. I'm guessing no, because currently Nextflow always sets universe = vanilla. I recommend that you just try it and use the clusterOptions directive to set whatever options you need, that way we can figure out what works best. I would like to know which universe we should use, or if we can just specify the image like you said.

@JosephLalli
Copy link
Author

JosephLalli commented Mar 2, 2023

  1. Agreed, this is a cluster-specific issue. It's my most immediate barrier to testing new setups, but getting head node access is a me-and-UW problem, not a nextflow problem.
  2. Our sys admins are very hesitant about this option. Some groups have done this, but the number of labs that want to run nextflow pipelines cluster is high enough that using the common shared filesystem option (which is set up as you propose - called the 'staging' drive) would overwhelm the server(s) it is hosted on. They are excited about using an s3 drive as an option, which scales much more easily.
  3. Will clusterOptions override the 'universe=vanilla' line? It seems hardcoded when I look at the code. I will try adding clusterOptions="universe=docker" to the test Nextflow config file once I have access to a head node.

@bentsherman
Copy link
Member

Regarding the clusterOptions, I don't know if it will overwrite or cause an error. Let's just try it once you are able to run a pipeline and see if it works as is with the clusterOptions. Ultimately we will probably change that line to depend on whether a container image is defined.

@Thom38
Copy link

Thom38 commented Jun 13, 2023

Hi,
I have the same need/issue regarding the usage of Nexflow on an HTcondor cluster when we want to use docker universe.
I have tried the clusterOptions="universe=docker" as mentioned but vanilla universe looks hardcoded.
Can you make a change to support the docker universe in condor executor ?

@bentsherman
Copy link
Member

Hi @Thom38 , I just drafted a PR with Docker support for HTCondor. Just use the container directive and docker.enabled = true like normal. Can you test it in your environment? Comment on the PR if you have any issues.

Copy link

stale bot commented Dec 15, 2023

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

@stale stale bot added the stale label Dec 15, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants