deadlock and first nf-test #156

alessiovignoli · 2024-06-14T17:28:14Z

The changes revolve around creating the first nf-test for the workflow of data handling.

All improvements to try not to get deadlock hapenning. WHen that happens ray tune gets stuck in either PENDING or RUNNING mode without any error. To prevent that a number of changes were necessary:

f in the case of nextflow make sure that the resourcess allocated for the process are passed to ray to set the maximum for (CPU, GPU, memory)
set a defined ammount of resource (specified in the configs) per trial/worker/actor aka a given model with a given set of hyperparamters.

if ray is initialized through tuner.fit() (that calls ray.init()) on the cluster it reads completely wrong the available resources. it sees way more than what is allocated for it. That's way ray is initialized esplicitly with a given set of values.

…hecking if model learns, made dnafloat into default test

…ult be inside each process subdir instead of home dir

…stimulus into nf-test_pipeline

…to this change

… still missing gpu info

…bility

… what nextflow allocates for the single process

…_resources when max_gpus is nOne or 0

JoseEspinosa

Some comments, still not finished

bin/launch_check_model.py

JoseEspinosa · 2024-06-18T11:16:37Z

bin/launch_shuffle_csv.py

-    # shuffle the data
-    csv_obj.shuffle_labels()
+    # shuffle the data with a default seed. TODO get the seed for the config if and when that is going to be set there.
+    csv_obj.shuffle_labels(seed=42)


Why not add it directly to?

def main(data_csv, config_json, out_path) ``` with a default value

bin/launch_tuning.py

Co-authored-by: Jose Espinosa-Carrasco <[email protected]>

mathysgrapotte · 2024-07-15T12:08:02Z

bin/launch_utils.py

+def memory_split_for_ray_init(memory_str:  Union[str, None]) -> Tuple[float, float]:
+    """
+    compute the memory requirements for ray init. 


Process the input memory value into the right unit and allocates 30% for overhead and 70% for tuning.

bin/launch_utils.py

Co-authored-by: mathysgrapotte <[email protected]>

alessiovignoli added 21 commits May 28, 2024 15:42

conf as in nf-core, made titanic example into an auxiliary test for c…

93417ef

…hecking if model learns, made dnafloat into default test

start of nf-test on pipeline

26c6fce

put nf-test output and run directory outside commits

1a9663e

trying to fix pending and endlessly running tune trials, made ray_ras…

39d357d

…ult be inside each process subdir instead of home dir

trying to upgrade dependencies to fix ray stuck in running/pending issue

8f6f5ab

Merge branch 'nf-test_pipeline' of https://github.com/mathysgrapotte/…

ee8bddb

…stimulus into nf-test_pipeline

added pandas to image

eadda62

installing ray as a whole

35172e2

adding ray[default] to image

93c60d5

bringing the dependencies to the same level of the image

b275816

created docker and singularity profiles and adapted all lther config …

9f6b80a

…to this change

small linting canges

48ad10e

after upgrade of packages there is no need for env variable setting

8c5f03d

setting structure to put upper limit to ray cluster, uninsg ray.init,…

8b2d211

… still missing gpu info

debugging nf-test input file not found

ec57c62

nf-test stuck on transform step

d4cbbb7

nf-test to point where we need, now need to check seeds and reproduci…

d43f4dc

…bility

made the shuffle take the seed for np to be reproducible

fa7eec4

added request for gpus in nextflow

fcfdc29

set limit for memory to ray cluster, now it respects in all resources…

af2bac2

… what nextflow allocates for the single process

set a check on resources per trial against max resources in ray cluster

7cad048

alessiovignoli added bug Something isn't working enhancement New feature or request labels Jun 14, 2024

alessiovignoli added 3 commits June 17, 2024 12:18

lowering resources to match maximum of gothub CI

1214dd4

fixed re-inizialization error from ray and GPU not present in cluster…

c169218

…_resources when max_gpus is nOne or 0

setting seed for shuffle test in call of shuffle function

19b093b

This was linked to issues Jun 17, 2024

[fix] ray tune runs forever sometimes #143

Closed

[fix] DNA Experiment test is broken - needs to be fixed and added to the common tests #125

Closed

alessiovignoli requested review from luisas and mathysgrapotte June 18, 2024 10:41

JoseEspinosa reviewed Jun 18, 2024

View reviewed changes

alessiovignoli and others added 7 commits June 18, 2024 14:42

Update bin/launch_check_model.py

aa65608

Co-authored-by: Jose Espinosa-Carrasco <[email protected]>

Update bin/launch_check_model.py

54a7c27

Co-authored-by: Jose Espinosa-Carrasco <[email protected]>

Update bin/launch_check_model.py

cc61509

Co-authored-by: Jose Espinosa-Carrasco <[email protected]>

Update bin/launch_check_model.py

1afac08

Co-authored-by: Jose Espinosa-Carrasco <[email protected]>

Update bin/launch_tuning.py

f14fb18

Co-authored-by: Jose Espinosa-Carrasco <[email protected]>

Update bin/launch_tuning.py

3a104ac

Co-authored-by: Jose Espinosa-Carrasco <[email protected]>

Update bin/launch_tuning.py

5552490

Co-authored-by: Jose Espinosa-Carrasco <[email protected]>

mathysgrapotte reviewed Jul 15, 2024

View reviewed changes

mathysgrapotte approved these changes Jul 15, 2024

View reviewed changes

suzannejin approved these changes Jul 15, 2024

View reviewed changes

mathysgrapotte reviewed Jul 15, 2024

View reviewed changes

bin/launch_utils.py Outdated Show resolved Hide resolved

Update bin/launch_utils.py

394b497

Co-authored-by: mathysgrapotte <[email protected]>

alessiovignoli merged commit 884df50 into main Jul 15, 2024
4 checks passed

alessiovignoli deleted the nf-test_pipeline branch July 16, 2024 14:54

alessiovignoli mentioned this pull request Jul 24, 2024

[feat] force raytune to have maximum running time #159

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

deadlock and first nf-test #156

deadlock and first nf-test #156

alessiovignoli commented Jun 14, 2024

JoseEspinosa left a comment

JoseEspinosa Jun 18, 2024

mathysgrapotte Jul 15, 2024

deadlock and first nf-test #156

deadlock and first nf-test #156

Conversation

alessiovignoli commented Jun 14, 2024

JoseEspinosa left a comment

Choose a reason for hiding this comment

JoseEspinosa Jun 18, 2024

Choose a reason for hiding this comment

mathysgrapotte Jul 15, 2024

Choose a reason for hiding this comment