Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

deadlock and first nf-test #156

Merged
merged 32 commits into from
Jul 15, 2024
Merged

deadlock and first nf-test #156

merged 32 commits into from
Jul 15, 2024

Conversation

alessiovignoli
Copy link
Contributor

The changes revolve around creating the first nf-test for the workflow of data handling.

All improvements to try not to get deadlock hapenning. WHen that happens ray tune gets stuck in either PENDING or RUNNING mode without any error. To prevent that a number of changes were necessary:

  1. f in the case of nextflow make sure that the resourcess allocated for the process are passed to ray to set the maximum for (CPU, GPU, memory)
  2. set a defined ammount of resource (specified in the configs) per trial/worker/actor aka a given model with a given set of hyperparamters.

if ray is initialized through tuner.fit() (that calls ray.init()) on the cluster it reads completely wrong the available resources. it sees way more than what is allocated for it. That's way ray is initialized esplicitly with a given set of values.

…hecking if model learns, made dnafloat into default test
…ult be inside each process subdir instead of home dir
… what nextflow allocates for the single process
@alessiovignoli alessiovignoli added bug Something isn't working enhancement New feature or request labels Jun 14, 2024
Copy link
Member

@JoseEspinosa JoseEspinosa left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some comments, still not finished

bin/launch_check_model.py Outdated Show resolved Hide resolved
bin/launch_check_model.py Outdated Show resolved Hide resolved
bin/launch_check_model.py Outdated Show resolved Hide resolved
bin/launch_check_model.py Outdated Show resolved Hide resolved
# shuffle the data
csv_obj.shuffle_labels()
# shuffle the data with a default seed. TODO get the seed for the config if and when that is going to be set there.
csv_obj.shuffle_labels(seed=42)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not add it directly to?

def main(data_csv, config_json, out_path)
```
with a default value

bin/launch_tuning.py Outdated Show resolved Hide resolved
bin/launch_tuning.py Outdated Show resolved Hide resolved
bin/launch_tuning.py Outdated Show resolved Hide resolved
alessiovignoli and others added 7 commits June 18, 2024 14:42
Co-authored-by: Jose Espinosa-Carrasco <[email protected]>
Co-authored-by: Jose Espinosa-Carrasco <[email protected]>
Co-authored-by: Jose Espinosa-Carrasco <[email protected]>
Co-authored-by: Jose Espinosa-Carrasco <[email protected]>
Co-authored-by: Jose Espinosa-Carrasco <[email protected]>
Co-authored-by: Jose Espinosa-Carrasco <[email protected]>
Co-authored-by: Jose Espinosa-Carrasco <[email protected]>
def memory_split_for_ray_init(memory_str: Union[str, None]) -> Tuple[float, float]:
"""
compute the memory requirements for ray init.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Process the input memory value into the right unit and allocates 30% for overhead and 70% for tuning.

bin/launch_utils.py Outdated Show resolved Hide resolved
Co-authored-by: mathysgrapotte <[email protected]>
@alessiovignoli alessiovignoli merged commit 884df50 into main Jul 15, 2024
4 checks passed
@alessiovignoli alessiovignoli deleted the nf-test_pipeline branch July 16, 2024 14:54
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working enhancement New feature or request
Projects
None yet
4 participants