Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Compile-time analysis to determine smallest task sizes #9984

Open
LouisJenkinsCS opened this issue Jun 22, 2018 · 25 comments
Open

Compile-time analysis to determine smallest task sizes #9984

LouisJenkinsCS opened this issue Jun 22, 2018 · 25 comments

Comments

@LouisJenkinsCS
Copy link
Member

LouisJenkinsCS commented Jun 22, 2018

Currently, the size of each task is static, and from what @mppf had mentioned in IRC is as much as 8MBs, although this can be controlled via the environment variable CHPL_RT_CALL_STACK_SIZE. Now 8MBs can be too small, 8MBs can be too large, or it can be just right, but I don't think that one size fits all kinds of tasks. For example, imagine this task which allocates a large tuple; since tuples are actually just C structs after Chapel is compiled down to C, we'd easily exceed the maximum stack size. 8MBs clearly isn't enough, in fact the compiler should be able to determine just by a single pass that it requires a lot more.

var tpl : 1000000 * int;
writeln(tpl);

TIO

Second, imagine this example where we have a very small task that does nothing, essentially an empty task that waits on a barrier.

use Barriers;
const iterations = 1024 * 1024;
var b = new Barrier(iterations);
coforall 1 .. iterations { 
	b.barrier();  
	b.wait();
}
writeln("Done");

TIO

The above runs out of memory due to the size of the tasks being 8MBs, even though the stacks could get away with being much smaller, say 1KB.

This is a non-trivial issue in that a lot of out-of-memory issues I have come across stem from this... you can try to reduce the size of tasks via CHPL_RT_CALL_STACK_SIZE but this is applied across your entire application, where you can have other libraries requiring an even larger stack space. I believe that the compiler should be able to determine the minimum size that is safe for a task to have, falling back to a larger default size when it cannot safely determine one.

For reference, here is how Go handles 'splitting' the stack on-demand for their goroutines.

@ronawho
Copy link
Contributor

ronawho commented Jun 23, 2018

I don't think something like Go's stack copying is practical without garbage collection (or really the pointer information that you need for garbage collection.)

Here's a relevant paper about some prototype stack size estimation in GCC: https://www.adacore.com/uploads/technical-papers/Stack_Analysis.pdf

So far as I know that's never been made it into GCC, and I don't know of any other compilers that actually do compile-time estimation of stack sizes.

@LouisJenkinsCS
Copy link
Member Author

LouisJenkinsCS commented Jun 23, 2018

I don't think something like Go's stack copying is practical without garbage collection (or really the pointer information that you need for garbage collection.)

I would suggest that Chapel implement the segmented stack model instead, then. While it has some glaring issues in terms of shrinking the stack, it at least provides the ability to grow the stack very quickly and its also done in a way that doesn't require garbage collection or pointer information. With the aid of the compiler, a relatively naïve estimate can be used during the ~~~rare~~~ cases where recursion (that is not bounded by a compile-time constant) is used or for when calls are made to C functions (that are not well-known C functions, like Chapel's runtime). With the segmented stack model, it allows the stack to grow fast and shrink slowly. Significant improvement over current model.

Here's a relevant paper about some prototype stack size estimation in GCC: https://www.adacore.com/uploads/technical-papers/Stack_Analysis.pdf

I've printed the paper to read for later, seems like an interesting read.

So far as I know that's never been made it into GCC, and I don't know of any other compilers that actually do compile-time estimation of stack sizes.

Then Chapel will be the first.

@LouisJenkinsCS
Copy link
Member Author

Perhaps another short-term solution would be to allow the user to attach pragmas to arbitrary statements, and let the user specify a stack size... again as a short-term solution...

use Barriers;
const iterations = 1024 * 1024;
var b = new Barrier(iterations);
pragma "stack size = 1KB"
coforall 1 .. iterations { 
	b.barrier();  
	b.wait();
}
writeln("Done");

and...

pragma "stack size = 10MB"
proc main() {
   var tpl : 1000000 * int;
   writeln(tpl);
}

This is a way that more advanced users can control how large their stacks should be. Again, significantly better than running out of stack space and triggering the stack-overflow detector (and only for tasking layers that have it, I.E not fifo and massivethreads)

@gbtitus
Copy link
Member

gbtitus commented Jun 25, 2018

I've worked on systems that used segmented stacks, notably the Cray XMT, and they do have some nice characteristics with either HW or SW threading at large scale. I don't recall any particular problems with shrinking the XMT stacks. In fact shrinking was faster than growing, just as it is on other systems, though not by as much. Can you say a little more about the problem(s) you see with shrinking? (On XMT when we grew onto a new stack segment we just interposed a synthetic frame for the shrinker routine, so that when the program returned past that point in the call chain the shrinker would be called to free that segment.)

I would say that by far the most glaring issue with segmented stacks, at least on x86_64, is that they're not ABI compliant! Basically this means we'd have to do something fancy on inter-language calls between Chapel and ABI-compliant languages like C/C++/Fortran/etc. This would be manageable if we knew we were coming from or going to an inter-language call, but wouldn't be if we didn't. And it would add (time) overhead, of course.

@gbtitus
Copy link
Member

gbtitus commented Jun 25, 2018

It's easy to construct circumstances in which the compiler cannot accurately gauge the execution-time stack requirements at compile time, or even emit code that will accurately do so at execution time. So the solution has to include pragmas or other decoration for users to specify the size. Unfortunately it's also easy to imagine circumstances in which even the programmer doesn't know at compile time how much stack space will be required at execution time. This isn't to say we couldn't do better, but the same failures that can occur now could occur with compiler analysis and pragmas.

@LouisJenkinsCS
Copy link
Member Author

Can you say a little more about the problem(s) you see with shrinking?

The main issue seems to be the issue of 'stack-thrashing' or 'hot splitting', where, and I quote: "if the stack is almost full, a call will force a new stack chunk to be allocated. When that call returns, the new stack chunk is freed. If the same call happens repeatedly in a tight loop, the overhead of the alloc/free causes significant overhead.", and "when out of stack a function call will force an allocation of a new segment, which is later freed upon return. This is expensive even when the new stack segmented is cached. The result is unpredictable and severe drops in performance whenever a stack boundary happens to fall inside a tight loop."

I don't know enough about it to determine whether the issue would be present with the way you handled it on Cray XMT, but regardless of these issues I am still in favor of implementing the segmented stack.

I would say that by far the most glaring issue with segmented stacks, at least on x86_64, is that they're not ABI compliant! Basically this means we'd have to do something fancy on inter-language calls between Chapel and ABI-compliant languages like C/C++/Fortran/etc. This would be manageable if we knew we were coming from or going to an inter-language call, but wouldn't be if we didn't. And it would add (time) overhead, of course.

Adding overhead is an issue, yeah. Chapel knows which functions are foreign as they are declared with extern, and it could be possible to hard-code or to attach pragmas to extern'd procedures that refer to runtime functions so that they aren't effected by the 'fancy' solution we come with.

pragma "runtimeFn"
extern proc someRuntimeFunction(args) : retType;

If the user is calling out to their own foreign functions, then it will lack that pragma...

extern proc someUserFunction(args) : retType;

Calls to the foreign function can switch to a dedicated stack that can handle switching the stack back once finished. Does it have overhead? Yes, but it should be possible to disable this at compile-time if its an issue. I think for 90% of users, the benefit of having segmented stacks will be beneficial and the overhead of foreign calls will be negligible or even unnoticeable.

@gbtitus
Copy link
Member

gbtitus commented Jun 25, 2018

Oh, yeah, that could totally happen on XMT. It wasn't seen often but it did happen. And there was no particularly pleasant way to deal with it, either. Increasing the stack segment size can make it less likely but won't eliminate the possibility entirely. (XMT stack segments were just 4kb.)

On the other topic, I'm not sure the "runtimeFn" pragma helps. Runtime functions, at least the ones in the Chapel runtime proper which is coded in C, are just C code and we would have to switch from Chapel-style segmented stacks to C-style ABI stacks just like we would for any other C code. I don't know what the impact(s) on performance might be.

Calls from other languages into Chapel would need special handling also, of course. The other language calls would have to go to little interface shim functions generated by the Chapel compiler, that would switch from the ABI stack to a Chapel-style segmented stack and then call the desired "real" Chapel function.

@mppf
Copy link
Member

mppf commented Jun 25, 2018

@gbtitus - two questions. First, can you explain a little bit more about the ABI issue? I think I'm not seeing what the problem is. Second, are pinned memory & huge pages an issue with this approach?

@gbtitus
Copy link
Member

gbtitus commented Jun 25, 2018

@mppf - In a nutshell, the difference is that the ABI calling sequence implements an implicit stack extension model and segmented stacks require an explicit one, and the two are very different in terms of the function entry (and possibly exit) sequence.

In the x86* ABIs, a stack is a single chunk of contiguous addresses up to some size limit and is either extended automatically by the kernel up to that limit (for the original process stack) or (for a pthread stack) the stack pointer runs past the limit into a guard page and segfaults, or runs past the limit into some other memory and causes a wild store and possibly corruption. Either way, though, the way the stack is extended is the same: the function prologue just updates the stack pointer (SP) to reflect the new stack frame and off you go, doing stores to the frame. In particular, there is no stack limit check in the user code. Any limit check is done (if it's done at all) in the kernel code that initially handles the segfault when the stack grows onto a new page. At that point either one or more pages are added to the stack or (if the limit has been reached) the process is sent the SIGSEGV.

For a segmented stack approach, instead the function prologue contains an explicit check against the stack limit and if the new stack frame cannot fit in the current stack segment, special processing is done to allocate a new segment, connect it somehow to the current one, and then build the new frame in the new segment. In addition, somehow one also has to arrange to release that new frame when the call to the new frame (that didn't fit) returns. As I mentioned, on the XMT systems this was done by interposing a synthetic frame belonging to the segment-return function, below that for the new frame that didn't fit. (Possibly excessive detail: the limit check was just enough over-zealous that both a frame for the segment-return function and the maximum stack space needed for interrupt handling would fit in the suffix of a stack segment that was otherwise fully occupied with frames. So there was always space to put everything that might need to be stored in the current segment.) All of this is in user code.

I don't think pinned memory or hugepages are an issue with the segmented approach, in fact the explicit limit check in the segmented stack model arguably works better with hugepages, because when you're using hugepages it's hard to support the guard pages desirable for automatic stack overflow detection.

@mppf
Copy link
Member

mppf commented Jun 25, 2018

I found this:
https://llvm.org/docs/SegmentedStacks.html

I suspect that it'd be possible to support segmented stacks in Chapel code & in the runtime for --llvm by adjusting how clang builds the runtime.

@mppf
Copy link
Member

mppf commented Jun 25, 2018

@LouisJenkinsCS
Copy link
Member Author

Cool, so apparently libgcc has support for it too, according to @mppf's link.

@gbtitus
Copy link
Member

gbtitus commented Jun 25, 2018

Right, libgcc has it, but that's only one out of all the libs we link with, and actually something of a minor one (it's a library of little support functions for the emitted code, not to be confused with glibc, the C library). We'd have to do the same switching between ABI stacks and segmented stacks, and probably the only practical place to do that is at the Chapel/non-Chapel boundary.

@LouisJenkinsCS
Copy link
Member Author

LouisJenkinsCS commented Jun 25, 2018

On the other topic, I'm not sure the "runtimeFn" pragma helps. Runtime functions, at least the ones in the Chapel runtime proper which is coded in C, are just C code and we would have to switch from Chapel-style segmented stacks to C-style ABI stacks just like we would for any other C code. I don't know what the impact(s) on performance might be.

To revisit this, how far off are we from getting Chapel's runtime written in Chapel? That would actually solve the issue. IMO, I think that Chapel has to be written in Chapel before I'd say its even close to 'Chapel 2.0' standards, but I guess the actual criteria would be up to the big man, @bradcray. If Chapel was built in Chapel, and if Chapel became as efficient as C (if not more so due to how much we can make use of task parallelism and data parallelism), then Chapel would be almost complete IMO.

Edit: Perhaps Chapel's compiler can be as well.

@gbtitus
Copy link
Member

gbtitus commented Jun 25, 2018

To revisit this, how far off are we from getting Chapel's runtime written in C? ..

I'm not following. Chapel's runtime is written in C. Did you mean "written in Chapel"? That wouldn't change things significantly, since all of the other stuff we need to make Chapel work, from [g]libc to Qthreads to linear algebra packages, is written in ABI-compliant languages. We still have to be able to interoperate with all that stuff.

@LouisJenkinsCS
Copy link
Member Author

Is it possible to perform some kind of mapping of the runtime C code into whatever 'fancy' ABI we come up with? Since the runtime is available to the compiler, if it were possible then we can at least prevent switching to an x86-compliant contiguous stacks for calls into the runtime, and perhaps provide a tool to the user to do the same for their external C code. Just a thought.

The x86 ABI is well-known, what kind of changes would we need to do during codegen to make our generated C code compliant with the segmented stack model? Like, how do you even change the ABI calling convention of a function if the C code gets fed to gcc to be compiled anyway?

@mppf
Copy link
Member

mppf commented Jun 26, 2018

Is it possible to perform some kind of mapping of the runtime C code into whatever 'fancy' ABI we come up with?

I already said that I'm pretty sure the answer is yes:

I suspect that it'd be possible to support segmented stacks in Chapel code & in the runtime for --llvm by adjusting how clang builds the runtime.

I'm more concerned that the Rust history might indicate that segmented stacks are not a great approach.

@LouisJenkinsCS
Copy link
Member Author

The comment I made on the issues that both Rust and Go decided to switch to stack copying was mentioned in my comment above:
#9984 (comment)

Hot splitting or stack thrashing is significantly better than stack overflow. It's the only known working approach for non-GC languages.

@mppf
Copy link
Member

mppf commented Jun 26, 2018

@LouisJenkinsCS - from the linked Go reference, it looks like they use stack copying instead of segmented stacks but this requires some garbage-collection infrastructure. But if Rust decided not to do segmented stacks, what did they do instead? Fixed-size stacks? (Since they don't have GC, right?)

@gbtitus
Copy link
Member

gbtitus commented Jun 26, 2018

Is it possible to perform some kind of mapping of the runtime C code into whatever 'fancy' ABI we come up with? Since the runtime is available to the compiler, if it were possible then we can at least prevent switching to an x86-compliant contiguous stacks for calls into the runtime, and perhaps provide a tool to the user to do the same for their external C code. ...

Yes, at least if you're using clang as your target compiler and if it can emit code that uses segmented stacks instead of the system's native ABI, that could work. There's still the problem of linking with ABI-compliant system and other libraries, though. Eventually you have to call from code using your 'fancy' calling sequence to code using the ABI calling sequence.

The x86 ABI is well-known, what kind of changes would we need to do during codegen to make our generated C code compliant with the segmented stack model? Like, how do you even change the ABI calling convention of a function if the C code gets fed to gcc to be compiled anyway?

I don't think changes to the emitted C code would be needed. You'd either emit the same C code and compile it with clang, or emit code using LLVM, and in either case tell clang/LLVM that you wanted segmented stacks instead of the system-native ABI.

@LouisJenkinsCS
Copy link
Member Author

@mppf

Looking around it seems that Rust just has fixed sized stacks... but they use actual threads, not tasks multiplexed on top of threads (in fact they even have access to thread-local storage, which is something I sorely miss), and the common concurrency abstraction seems to be thread-pooling.

We can't just follow Rust here, because Chapel is supposed to allow spawning a large number of tasks, and even qthreads, the default tasking layer, boldly states that it is "designed to make using large numbers of threads convenient and easy", yet it fails to do so due to how large Chapel demands the stacks to be. (example for 8MB stacks being too large for 32-bit linux). Qthreads seems to be designed with much smaller stack sizes, like 32KBs. Which is why, if we have segmented stacks, we can possibly reduce the default stack size down to the more reasonable 32KB, and expand on-demand. At the very least this should be optional. What is task parallelism without lightweight tasks? 😄

@mppf
Copy link
Member

mppf commented Jun 26, 2018

Yes, at least if you're using clang as your target compiler and if it can emit code that uses segmented stacks instead of the system's native ABI, that could work. There's still the problem of linking with ABI-compliant system and other libraries, though. Eventually you have to call from code using your 'fancy' calling sequence to code using the ABI calling sequence.

About that, isn't the something that we'd do on an ABI boundary just to enforce that, instead of enough stack space for the function we're just starting, there is enough stack space for a "reasonable" function, whatever that means? I.e. that we'd just make sure that there is a big enough stack chunk for most C functions? Which might only be say 8 KB? Could we have a segmented stack approach where the typical (average case) function call has that much space in general? Say we allocated stack segments in 32KB chunks. Then on average, won't we expect to have 16KB free? (Or something? I'm sure there's some math there I'm not doing quite right). But were that the case, on average, we wouldn't have to allocate anything to call to C, since > 8KB would be available on the current chunk, right?

@gbtitus
Copy link
Member

gbtitus commented Jun 26, 2018

... isn't the something that we'd do on an ABI boundary just to enforce that, instead of enough stack space for the function we're just starting, there is enough stack space for a "reasonable" function, whatever that means? I.e. that we'd just make sure that there is a big enough stack chunk for most C functions? Which might only be say 8 KB? ...

No. We wouldn't need merely to provide enough for the called function itself, we would need to provide enough for the deepest call chain reached during that call. We don't have any idea how deep that might be. If we guess too small and get overflow, the result is a segfault on the guard page if we're lucky, and if we're unlucky wild stores and unpredictable results, possibly including silent wrong answers. One way to look at the seemingly large size of the 8 mb default for the stack is just that it's big enough to make the risk of overflow acceptably small.

@mppf
Copy link
Member

mppf commented Jun 26, 2018

No. We wouldn't need merely to provide enough for the called function itself, we would need to provide enough for the deepest call chain reached during that call.

Yeah, I know. What I was saying is that in typical use (say with our runtime) we might know that this never exceeds 8 KB (or some other constant).

Besides, if I have 8 MB stacks, isn't that just putting off the problem? I don't know how much stack space a function I call will take, what if I've used up 7 MB ? Or 8 MB - 8 KB (which would put us in the particular case I was describing).

In other words, I'm saying:

  • existing C functions have an ad-hoc stack requirement
  • we might be able to know that (for some set of C functions) the deepest call stack in practice is relatively small
  • both this 32KB stacks notion & 8MB fixed stacks still have the same problem with overflow - it's just that 8 MB stacks put off the issue if you typically don't use that much of it.

One way to look at the seemingly large size of the 8 mb default for the stack is just that it's big enough to make the risk of overflow acceptably small.

Yep, but I bet we could do better by knowing more about particular C functions & their call chains & possibly with some knowledge of what most C programs do in practice.

@LouisJenkinsCS
Copy link
Member Author

LouisJenkinsCS commented Jun 27, 2018

Btw, can this get a tag for runtime as well? Compiler as well.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants