Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug]: Behaviour of here.gpus[x] when x is out of bounds #26027

Open
Guillaume-Helbecque opened this issue Oct 2, 2024 · 1 comment
Open

Comments

@Guillaume-Helbecque
Copy link
Contributor

Descritpion

While playing with the built-in array here.gpus, I discovered that the behaviour of here.gpus[x] when x is out of bounds is somehow not clearly defined. For example, I executed this program on a system with 8 GPUs, but only one enabled (CHPL_RT_NUM_GPUS_PER_LOCALE=1):

config const gpuID = 0;

proc main() {
  writeln(here.gpus);
  writeln(here.gpus.domain);
  writeln(here.gpus[gpuID], "\n");

  var A: [1..10] int;
  on here.gpus[gpuID] {
    var B: [1..10] int;
    @assertOnGpu
    foreach i in B.domain {
      B[i] = i;
    }

    A = B;
  }

  writeln(A);
}

By default (--gpuID 0), this program returns as expected:

LOCALE0-GPU0
{0..0}
LOCALE0-GPU0

1 2 3 4 5 6 7 8 9 10

But, for any --gpuID x values greater than 1, we get:

LOCALE0-GPU0
{0..0}
nil

1 2 3 4 5 6 7 8 9 10

While nil is probably expected because only one GPU is enabled, it seems that assertOnGPU is not triggered. Is on nil possible?
I also extended the experiment to negative numbers, and the results seem unpredictable as I encountered at least four different outputs:

  • For -1, here.gpus[-1] returns here.gpus[0] and assertOnGPU is not triggered:
LOCALE0-GPU0
{0..0}
LOCALE0-GPU0

1 2 3 4 5 6 7 8 9 10
  • For -2 and -3, here.gpus[x] returns here.id and assertOnGPU is triggered:
LOCALE0-GPU0
{0..0}
LOCALE0

sandbox.chpl:12: error: assertOnGpu() failed
  • For -4, here.gpus[-4] returns segfault:
LOCALE0-GPU0
{0..0}
Segmentation fault
  • For -5, here.gpus[-5] returns nil and assertOnGPU is triggered:
LOCALE0-GPU0
{0..0}
nil

sandbox.chpl:12: error: assertOnGpu() failed

etc.

Of course here.gpus is not expected to be used that way, and these experiments are a bit sadistic, but first I'd like to report this just in case this is not a known behaviour, and then I wonder if there is any interesting explanation behind that.

@bradcray
Copy link
Member

bradcray commented Oct 2, 2024

@Guillaume-Helbecque : I don't think this behavior is intentional and strongly suspect that it's one of the impacts of the following warning printed when doing GPU compilations:

warning: The prototype GPU support implies --no-checks. This may impact debuggability. To suppress this warning, compile with --no-checks explicitly

Specifically, the gpus array is a normal array in Chapel, and accesses to it would normally be bounds-checked; but since GPU compilations use --no-checks, that bounds-checking is disabled. If I compile similar programs for the flat (non-GPU) locale model, I get out-of-bounds errors as expected.

In saying this, I'm only providing a likely explanation, not saying that this is as we'd like things to be. Coming up with a way to enable checks for GPU compilations is definitely something that we'd consider to be important to Chapel's long-term productivity for GPU programming. Sorry for any hassle in the meantime.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants