Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Teething issue with EESSI module file #694

Open
ocaisa opened this issue Sep 6, 2024 · 23 comments
Open

Teething issue with EESSI module file #694

ocaisa opened this issue Sep 6, 2024 · 23 comments

Comments

@ocaisa
Copy link
Member

ocaisa commented Sep 6, 2024

So, it seems that Lmod can't search inside EESSI until the module is loaded:

alanc@~$ module purge
alanc@~$ module av

------------------------------------------------------------------ /cvmfs/software.eessi.io/init/modules -------------------------------------------------------------------
   EESSI/2023.06

-------------------------------------------------------------------- /home/alanc/EasyBuild_Git/EB_Devel --------------------------------------------------------------------
   Devel (S)    test2

--------------------------------------------------------------------- /usr/share/lmod/lmod/modulefiles ---------------------------------------------------------------------
   Core/lmod    Core/settarg (D)

  Where:
   S:  Module is Sticky, requires --force to unload or purge
   D:  Default Module

If the avail list is too long consider trying:

"module --default avail" or "ml -d av" to just list the default modules.
"module overview" or "ml ov" to display the number of modules for each name.

Use "module spider" to find all possible modules and extensions.
Use "module keyword key1 key2 ..." to search for all possible modules matching any of the "keys".


alanc@~$ module spider GROMACS
Lmod has detected the following error:  Unable to find: "GROMACS".


alanc@~$ module load EESSI
EESSI/2023.06 loaded successfully

alanc@~$ module spider GROMACS

------------------------------------------------------------------------------------------------------------------------------------------------------------------------
  GROMACS: GROMACS/2024.1-foss-2023b
------------------------------------------------------------------------------------------------------------------------------------------------------------------------
    Description:
...
@ocaisa
Copy link
Member Author

ocaisa commented Sep 6, 2024

That's maybe as expected, but it makes finding a particular version of a particular package a bit more complicated

@bedroge
Copy link
Collaborator

bedroge commented Sep 6, 2024

I think that's the consequence of https://github.com/EESSI/software-layer/blob/2023.06-software.eessi.io/init/modules/EESSI/2023.06.lua#L62 (but I agree that it's annoying...).

@ocaisa
Copy link
Member Author

ocaisa commented Sep 6, 2024

Hmm, this is tricky to solve. We want dynamic cache support but we don't want Lmod to try to update the existing cache. Is there a way to do this with the time limits on cache files?

@ocaisa
Copy link
Member Author

ocaisa commented Sep 6, 2024

Maybe dynamic cache support "just works", I'll give it a try

@ocaisa
Copy link
Member Author

ocaisa commented Sep 6, 2024

No, I tried haveDynamicMPATH() and it didn't seem to do what I wanted.

@boegel
Copy link
Contributor

boegel commented Sep 6, 2024

What if we create a spider cache for all the EESSI/* modules?

But then Lmod has to be configured to know where to find it, I guess...

@ocaisa
Copy link
Member Author

ocaisa commented Sep 6, 2024

Not sure that is what we want, wouldn't Lmod report the possibilities for every architecture then? We may have to craft an overall solution with the help of the Lmod BDFL

@boegel
Copy link
Contributor

boegel commented Sep 6, 2024

Let's tag him then: @rtmclay

@rtmclay
Copy link

rtmclay commented Sep 6, 2024

With our modulefiles sitting on SSD, we are moving away from having system spider cache files. I don't know how well the caching works with CVMFS. So maybe you don't need system spider caches at all. This is something you guys need to check.

If you do need spider cache files then I would recommend having a spider cache file for each arch.

You might also be able to use the ideas in https://lmod.readthedocs.io/en/latest/350_community.html to handle the various arch's.

@ocaisa
Copy link
Member Author

ocaisa commented Sep 9, 2024

We do indeed have a spider cache per architecture, and we protect these paths with

if ( mode() ~= "spider" ) then
    prepend_path("MODULEPATH", eessi_module_path)
end
-- add our spider cache
prepend_path("LMOD_RC", pathJoin(eessi_software_path, "/.lmod/lmodrc.lua"))

The problem is that if you do a spider search for a software package, Lmod can only see the packages under the new module path after the module is loaded. I guess the issue is not really the protected MODULEPATH, but that LMOD_RC is only updated after the module is loaded.

EDIT
No, that doesn't seem to be it:

ocaisa@LAPTOP-O6HF2IKC:~$ module purge
ocaisa@LAPTOP-O6HF2IKC:~$ module spider GROMACS
Lmod has detected the following error:  Unable to find: "GROMACS".

ocaisa@LAPTOP-O6HF2IKC:~$ export LMOD_RC=/cvmfs/software.eessi.io/versions/2023.06/software/linux/x86_64/intel/skylake_avx512/.lmod/lmodrc.lua
ocaisa@LAPTOP-O6HF2IKC:~$ module spider GROMACS
Lmod has detected the following error:  Unable to find: "GROMACS".

ocaisa@LAPTOP-O6HF2IKC:~$ module load EESSI
EESSI/2023.06 loaded successfully
ocaisa@LAPTOP-O6HF2IKC:~$ module spider GROMACS

------------------------------------ 
 GROMACS: GROMACS/2024.1-foss-2023b
------------------------------------
 

@ocaisa
Copy link
Member Author

ocaisa commented Sep 9, 2024

Is it that, in our setup with a variable MODULEPATH, Lmod doesn't know how the MODULEPATH gets prepended to, so if we define LMOD_RC and use haveDynamicMPATH() we might get something that works?

EDIT: That didn't work for me either

@boegel boegel mentioned this issue Sep 12, 2024
@rtmclay
Copy link

rtmclay commented Sep 12, 2024

Lmod only reports modules that are in the modulepath or are can be found via walking the tree. This code:

if ( mode() ~= "spider" ) then
    prepend_path("MODULEPATH", eessi_module_path)
end
-- add our spider cache
prepend_path("LMOD_RC", pathJoin(eessi_software_path, "/.lmod/lmodrc.lua"))

prevents Lmod from knowing about eessi_module_path when spidering.

It seems to me that you hide modules or you show them. Can you explain exactly what you want with a simple module tree? If you are going to compute spider caches, why not provide them all the time?

@ocaisa
Copy link
Member Author

ocaisa commented Sep 13, 2024

Ok, I think I see the issue now. In our case the spider cache and the module path are architecture dependent (so both could be considered to need haveDynamicMPATH()). The main problem is that we are informing Lmod about that cache in the same module file as we extend the module path, that seems to be too late for Lmod to actually be aware of that cache.

To test this, I split the module file in two, the first does everything except add to the MODULEPATH, the second load ts the first and does only the MODULEPATH (and is not protected). Both of these use haveDynamicMPATH() and they do seem to give the behaviour we want. The use of haveDynamicMPATH() does introduce a reliance on 8.7.4+ (which is relatively recent, but we can advise people how to work around it).

We could make a big fat spider cache which would remove the need for the base module to be dynamic, but that probably has it's own downsides.

EDIT:

This is not entirely correct as my session was messed up a little by me tweaking the module file on the fly. The EESSI module file is now:

-- Load all the EESSI environment settings from the matching base
-- module file (which is hidden), including identifying the
-- architecture and setting the appropriate Lmod spider cache to use
always_load(pathJoin('base', '.' .. myModuleVersion()))
-- Add the modulepaths we want
prepend_path("MODULEPATH", os.getenv("EESSI_MODULEPATH"))
prepend_path("MODULEPATH", os.getenv("EESSI_SITE_MODULEPATH"))
haveDynamicMPATH()
if mode() == "load" then
    LmodMessage("EESSI/" .. myModuleVersion() .. " loaded successfully")
end

with the general setup being

ocaisa@LAPTOP-O6HF2IKC:~$ module --show-hidden avail

--------------- /home/ocaisa/software-layer/init/modules ---------------
   base/.2023.06 (H)    EESSI/2023.06

  Where:
   H:  Hidden Module

Trying to search within this context I get:

ocaisa@LAPTOP-O6HF2IKC:~$ module spider GROMACS
Lmod has detected the following error:  Unable to find:
"GROMACS".


ocaisa@LAPTOP-O6HF2IKC:~$ module load base/.2023.06
ocaisa@LAPTOP-O6HF2IKC:~$ module spider GROMACS

--------------------------------------------------------------------
  GROMACS: GROMACS/2024.1-foss-2023b
--------------------------------------------------------------------
    Description:
...

so it seems the cache file must visible to Lmod when the module spider command is invoked, it cannot added to in the dynamic way we want to.

@ocaisa
Copy link
Member Author

ocaisa commented Sep 24, 2024

I was on a system where TCL modules was available, but I needed Lmod so I tried to initialise it. This lead to issues when running scripts (as their module tool was initialising itself on top of Lmod in the subshell) so I tried to create a bash function that can be called within scripts to check EESSI and Lmod are available. When testing this on my local machine (which has Lmod), I saw:

alanc@~$ type module
module is a function
module () 
{ 
    local __lmod_my_status;
    local __lmod_sh_dbg;
    if [ -z "${LMOD_SH_DBG_ON+x}" ]; then
        case "$-" in 
            *v*x*)
                __lmod_sh_dbg='vx'
            ;;
            *v*)
                __lmod_sh_dbg='v'
            ;;
            *x*)
                __lmod_sh_dbg='x'
            ;;
        esac;
    fi;
    if [ -n "${__lmod_sh_dbg:-}" ]; then
        set +$__lmod_sh_dbg;
        echo "Shell debugging temporarily silenced: export LMOD_SH_DBG_ON=1 for Lmod's output" 1>&2;
    fi;
    eval "$($LMOD_CMD bash "$@")" && eval $(${LMOD_SETTARG_CMD:-:} -s sh);
    __lmod_my_status=$?;
    if [ -n "${__lmod_sh_dbg:-}" ]; then
        echo "Shell debugging restarted" 1>&2;
        set -$__lmod_sh_dbg;
    fi;
    return $__lmod_my_status
}
alanc@~$ echo $LMOD_CMD 
/usr/share/lmod/lmod/libexec/lmod
alanc@~$ source /cvmfs/software.eessi.io/versions/2023.06/init/lmod/bash 
Lmod has detected the following error:  The following module(s) are unknown: "EESSI/2023.06"

Please check the spelling or version number. Also try "module spider ..."
It is also possible your cache file is out-of-date; it may help to try:
  $ module --ignore_cache load "EESSI/2023.06"

Also make sure that all modulefiles written in TCL start with the string #%Module



alanc@~$ module av

---------------------------------------------------------- /cvmfs/software.eessi.io/versions/2023.06/init/modules ----------------------------------------------------------
   EESSI/2023.06

If the avail list is too long consider trying:

"module --default avail" or "ml -d av" to just list the default modules.
"module overview" or "ml ov" to display the number of modules for each name.

Use "module spider" to find all possible modules and extensions.
Use "module keyword key1 key2 ..." to search for all possible modules matching any of the "keys".

and doing a module reset gives a hint as to what goes wrong:

alanc@~$ module reset
Running "module reset". Resetting modules to system default. The following $MODULEPATH directories have been removed: /cvmfs/software.eessi.io/versions/2023.06/init/modules
Lmod has detected the following error:  The following module(s) are unknown: "EESSI/2023.06"

Please check the spelling or version number. Also try "module spider ..."
It is also possible your cache file is out-of-date; it may help to try:
  $ module --ignore_cache load "EESSI/2023.06"

Also make sure that all modulefiles written in TCL start with the string #%Module

but I don't understand why this happening. We must be missing something in the configuration of Lmod? @MaKaNu ?

@MaKaNu
Copy link
Contributor

MaKaNu commented Sep 24, 2024

My local test VM does not have lmod locally installed, but I registered the following behavior:

$ source /cvmfs/software.eessi.io/versions/2023.06/init/lmod/bash
EESSI/2023.06 loaded successfully

Okay so far so good. ml av shows now as expected the eessi arch modules and also the init modules:

------------------------------------------------------------ /cvmfs/software.eessi.io/versions/2023.06/software/linux/x86_64/intel/haswell/modules/all -------------------------------------------------------------
   Abseil/20230125.2-GCCcore-12.2.0                   HMMER/3.4-gompi-2023a                               OpenEXR/3.2.0-GCCcore-13.2.0                     (D)
   Abseil/20230125.3-GCCcore-12.3.0            (D)    HPL/2.3-foss-2023b                                  OpenFOAM/v2312-foss-2023a
.
.
.
------------------------------------------------------------------------------ /cvmfs/software.eessi.io/versions/2023.06/init/modules ------------------------------------------------------------------------------
   EESSI/2023.06 (L)

If I now try to reset:

$ ml reset
Running "module reset". Resetting modules to system default. The following $MODULEPATH directories have been removed: None
EESSI/2023.06 loaded successfully

So here it seems that our EESSI module loaded again and is not removed from $MODULEPATH. I am not sure if this is the behavior we had intended. On the other hand ml unload EESSI/2023.06 works as expected.

@ocaisa Is it enough to install lmod locally or do I also need TCL modules to reproduce your behavior?

@ocaisa
Copy link
Member Author

ocaisa commented Sep 24, 2024

I believe a local Lmod installation is enough to test this. I do have something in my .bashrc:

. /usr/share/lmod/lmod/init/bash
module use /home/alanc/EasyBuild_Git/EB_Devel

but I don't see how that would get triggered

@ocaisa
Copy link
Member Author

ocaisa commented Sep 24, 2024

I think the key is Resetting modules to system default....but how to see what the system defaults are?

@MaKaNu
Copy link
Contributor

MaKaNu commented Sep 24, 2024

but I don't see how that would get triggered

This might be the reason why module EESSI/2023.06 is already available? In my scenario I just sourced the lmod init same like you did in your .bashrc but without loading any module. In that case ml av doesn't work since MODULEPATH is not set.

So I checked what happens if I now load the EESSI module and repeat the your steps again: same result as before.

I come to the same conclusion, 'Resetting modules to system default' is the difficult part.

What might be a solution:

  • We detect in the scripts if $LMOD_CMD is set and save it as a LMOD_CMD_DEFAULT.
  • We expand our init module with a behavior for unload and reset
    • we restore $LMOD_CMD
    • we restore $MODULEPATH
    • we restore $LMOD_SYSTEM_DEFAULT_MODULES

While writing I realized we set our MODULEPATH and do append. Maybe this is also an issue

@ocaisa
Copy link
Member Author

ocaisa commented Sep 24, 2024

Do we really need a default set I wonder if it is going to be problematic? In the initialisation scripts we can just explicitly load the module

@MaKaNu
Copy link
Contributor

MaKaNu commented Sep 24, 2024

So you mean instead we should skip the LMOD init? Yes, I think so.

@ocaisa
Copy link
Member Author

ocaisa commented Sep 25, 2024

I mocked something up for bash:

# Purge any modules before we start
if type module &> /dev/null; then
    module purge
fi

# Choose an EESSI version
EESSI_VERSION="${EESSI_VERSION:-2023.06}"

# Initialise Lmod for the shell
. /cvmfs/software.eessi.io/versions/"$EESSI_VERSION"/compat/linux/$(uname -m)/usr/share/Lmod/init/bash

# If an environment exists, let's not mix
if [ ! -z "${MODULEPATH}" ]; then
    module unuse "${MODULEPATH}"
fi

# Path to top-level module tree
module use /cvmfs/software.eessi.io/versions/"$EESSI_VERSION"/init/modules
module load EESSI/$EESSI_VERSION

@MaKaNu
Copy link
Contributor

MaKaNu commented Sep 25, 2024

# Purge any modules before we start
if type module &> /dev/null; then
    module purge
fi

Seems fine for unloading all existing loaded modules.

# If an environment exists, let's not mix
if [ ! -z "${MODULEPATH}" ]; then
    module unuse "${MODULEPATH}"
fi

Not mixing sounds good, but this does not allow for resetting to what ever the user had active before.

If this is fine for the moment we could address it first and find a solution for resetting to Default in a later approach.
Since not to be able to reset is no regression, we could approach this.

Further, we might want to advance our test cases.

@MaKaNu
Copy link
Contributor

MaKaNu commented Sep 25, 2024

Also, somehow logical but surprised me:

$ module reset
The system default contains no modules
  (env var: LMOD_SYSTEM_DEFAULT_MODULES is empty)
  No changes in loaded modules

This also means EESSI/2023.06 will not be unloaded.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants