-
Notifications
You must be signed in to change notification settings - Fork 424
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
GlobalAtomicObject - Scalable Atomic Operations using Descriptor Tables #6663
Comments
This statement strikes me as being a naive oversimplification. Is an atomic distributed array a necessity for any language that supports distributed arrays? I don't think so, and am not even sure what that would mean. |
The emphasis would be on productivity. "What can I do if X were atomic compared to if it were not". If X is a distributed array, what do we gain? Each element in a distributed array can be updated atomically as is, so perhaps not. The distributed array could be made atomic by allowing concurrent access while resizing, which is a huge plus and would further enable productivity, sure. Would I say that this is a necessity to allowing the user to be more productive? Arguably. However, that's not the case here. Since X is a class instance, the amount of productivity gained by making it atomic is limited to the stretch of the imagination. You can:
Is the above a necessity? Absolutely. However unlike atomic distributed arrays, I have presented an actual solution which works and scales. Edit: For more clarity: Atomics for Class Instances is a necessity, but let me expand on the universal construct bit for a moment. If you have any type of data, you can create a field in an class instance. That's nothing special, but what it allows is: any type of data, whether an var globalData : atomic Obj;
// Read...
var current = globalData.read();
// Copy...
var modified = clone(current);
modified.field = data;
// Update...
globalData.compareAndSwap(current, modified); You have potential RCU. Of course the next actual difficulty would be memory reclamation, but the point is that it would have been the same whether the algorithm was distributed or not. Furthermore, if it sounded like I was bashing Chapel, I'm not, I wish to see it improve and as such I am offering suggestions to its problems. |
I think you're misinterpreting my comment. I don't have any objections to supporting atomic classes in Chapel, and have long believed that we should do so. My objection was with the opening blanket statement which I read as "any new language that has type X must necessarily have atomic type X." And I simply don't believe that to be true (and used "distributed array" as my example X). As a result of my reaction to the statement (+ time constraints), I rolled my eyes and stopped reading there rather than being intrigued and continuing on. Then eventually I came back to comment on my reaction. To me, a more believable, less absolutist, and less objectionable statement would be something like "Any budding language, especially one that strives to be a productive concurrent/parallel language, should consider for each supported type T whether there would also be value in supporting an atomic variant of T." This statement seems far harder to argue with to me. |
Understood, it makes sense that it made a broad generalization, but I didn't expect it to have the kind of effect which would cause people to skip over it from the first sentence. I revised it so that it didn't generalize to any type, but to a specific type of data. |
Thanks! To make sure I'm understanding the following:
Are you saying that the original statement was intended to convey "Supporting atomic classes in a language effectively adds support for arbitrary atomic types as well because given any type |
Technically, it can yes. The semantics would change (requires RCU or similar updates), performance may drop significantly, but it is possible to emulate |
I had another idea in terms of how it can be implemented into the language itself. What if you had one global table (not coupled to a GlobalAtomicObject, but initialized first thing at runtime) that was managed across all nodes. Management of the objects (adding, propagating changes across nodes, removing them when the object is disposed of with Boom, you've got your global descriptor table. The low level details would be beyond me though. |
@LouisJenkinsCS - I'm not sure what to interpret from your other idea other than that we could come up with a completely different solution from what you created, but I already knew that. But I'm probably missing something. What would be the advantage of having one global descriptor table? |
@LouisJenkinsCS: For some reason I'm feeling like one of the edits to the original proposal removed a lot of the context. Basically the need is for AMOs on global pointers which conceptually are larger than 64 bits (and we don't have 128-bit network AMOs), and the specific operations we need are just Load, Store, Swap, and Compare-and-Swap? Is that correct? |
Allow me to contrast having one single global descriptor table to having one coupled to GlobalAtomicObject (or each
Edit: To backup whether this is more than enough: as you stated, an object must take up at least 32 bytes, so we're talking about 590.29581 exabytes worth of data before space becomes an issue. |
@gbtitus Yes, originally I had stated that in the removed introduction, and yes we only need those operations. |
@LouisJenkinsCS - I think using a single global table is interesting as an improvement. Instead of asking us to solve all the problems - can you describe what operations should happen when to support such a thing. What would you need the compiler to do? What does the runtime do? I would expect as we integrate more with the compiler/runtime you'll need help to do so. However I think we should view it as an improvement/optimization and try to get a 1st version merged. I.e. I think we should discuss the global version later and in a different issue. For the purposes of this issue, I think it would nice to simply say why it is harder (what are the challenges?). |
I feel like I must be missing something obvious. One solution would seem to be to implement those AMOs by simply doing an on-stmt to the node where the AMO needs to be done and then doing it as a 128-bit processor AMO (either directly if there is CPU support, or indirectly using the x86 128-bit compare/exchange instruction CMPXCHG16B). In the current implementation this would take 2 network round trips per AMO, all things considered. What are the advantages of the proposed solutions over doing that? |
|
With the current version, we can do the following...
Currently, that's all that would be needed to get a nice efficient implementation going. |
CMPXCHG16B is a 16-byte atomic compare-exchange. The LOCK prefix is needed in multicore situations. It's true that this operation is not necessarily present on all architectures -- it's not implemented on early AMD x86_64 and there may well be other processor architectures that don't have it at all. On those one would need to do something else. However, I'd argue that the "something else" should not do any network ops. In other words, no matter what actual mechanism we use for the pointer updates, it should only involve local operations, because network communication is very costly. I think this rules out a Descriptor Table-based solution, because in the general case that has to do remote references. That still leaves a lot of maneuvering room for implementation, though. For one thing, we might be able to go a bit further with straight pointer compression. If the pointers in question are aligned to more than a single-byte boundary then there are low-order address bits which must be zero and therefore do not have to be stored in the compressed form of a pointer. This would give us more node bits in the compressed form. The bluntest possible tool might be something like maintaining, on every node, a local array of sync vars index by node, and protecting all updates to global pointers with a given locale value by locking the corresponding local sync. (I did say it was a blunt tool!) Speaking of which, in your comparison with sync var protection at the top, where (which node) were the sync vars involved, and was there one per global pointer, or just one for everything, or something else? Would it be possible for you to post your scaling comparison code(s)? |
I also got confused by trying to read the evolving document -- started last week, couldn't find my place this week. |
Is there a similar atomic operation to perform reads, writes, and exchanges for 16 bytes? If it is anything similar to a Double-Word Compare-And-Swap, then wouldn't its purpose be to update two arbitrary 64-bit locations? While each location may be updated atomically, if you had to read each 64-bit value separately, then couldn't another atomic
Local to what? If the data refers to something remote and another remote node can atomically set it, then network operations are unavoidable. Communications across nodes are costly, but they do scale. If bounded, they are a constant overhead that can be managed.
Not necessarily. Again, by caching the descriptor, this means that the only remote operation occurs the first time the object is used. Therefore, this remote operation merely occurs once. The other operation is using network atomics to set the 64-bit atomic field, which is also unavoidable. A network atomic operation would beat using an
That's an interesting approach, definitely one that can be used in the meantime, but relatively soon you would run out of bits.
|
I'm going to respond to the parts of this separately, since I suspect some will be dealt with and disappear rapidly while others lead on to further conversation ...
No, it's an octoword compare-exchange, that is, it operates on an single aligned sequence of 16 bytes, not on two separate 8-byte quadwords. |
@gbtitus - I think it'd be reasonable to do an A separate concern is that some algorithms might want to use a generation number in the other 8 bytes, so @LouisJenkinsCS - I'm pretty sure |
Here's one more little bit of possibly useful information. I've been estimating that remotely initiated processor atomics probably can't perform much better than 1/4 the speed of network atomics. It turns out this is really easy to test. On a Cray XC with comm=ugni, we can toggle the ra-atomics test between doing network AMOs and doing remotely initiated processor AMOs just by whether or not we have a hugepage module loaded. In order to use network AMOs we have to register memory with the NIC and in order to do that we have to put the heap on hugepages. No hugepages, no network AMOs. And when we do use remotely initiated processor AMOs it's through a fairly lightweight runtime-internal mechanism, not a Chapel on-stmt. So this is a pretty good analogy to the situations we've been discussing. (In this case the AMO is a |
Interesting... while it still doesn't leave me with much room for doubt that remote AMO is better for contention-heavy CAS-retry loop, it does give me hope for descriptor tables. I wonder though, does that 30x increase or decrease as you add more nodes? |
Yes, definitely any CAS-retry spin loop should be done on the node where the target datum resides rather than over the network. What's tricky is that if you have an appropriate network AMO and don't need to spin then you should definitely use the network AMO, because remote execution is so much slower. Thus knowing whether contention will occur is really important. The 30x increases as I add nodes: roughly it's 19x on 2 nodes, 30x on 4 nodes, 34x on 8, and 37x on 16 in a set of runs I just did. I think this just reflects that as the node count rises, a greater proportion of network AMOs have to traverse the network. ra-atomics has a block-distributed target array that spans all the nodes, and parallel tasks on all cores and nodes do atomic XOR updates to randomly selected elements of that array as fast as they can. On XC all these updates are done via network AMOs through the NIC. If the update happens to hit an array element local to the node then there is no network traversal, but we nevertheless do have to use a NIC AMO instead of a processor one. This is because the Aries AMO cache is not coherent with the processor cache(s) and thus, for a given memory location that is an AMO target, all AMOs to that location must be either done with the processor (even if the location is remote) or done with the NIC (even if the location is local). |
In such short notice, I'm afraid I can't do too much right now, and I'll have to devote the rest of my time for catching up on my project (I'll be falling behind at this rate, to be honest). The below shows three things: 'GDTWrites' which is a simple write using the descriptor tables (no cache, meaning worst case when using local data), inline void read128bit(void *srcvp, void *dstvp) {
uint128_t *src = srcvp;
uint128_t with_val = *src;
uint128_t cmp_val;
uint128_t *cmp = &cmp_val;
uint128_t *with = &with_val;
char result;
__asm__ __volatile__ ("lock; cmpxchg16b (%6);"
"setz %7; "
: "=a" (cmp->lo),
"=d" (cmp->hi)
: "0" (cmp->lo),
"1" (cmp->hi),
"b" (with->lo),
"c" (with->hi),
"r" (src),
"m" (result)
: "cc", "memory");
*(uint128_t *)dstvp = cmp_val;
} It is as optimized as you can get: regardless of if a CAS fails, it returns what is read at that location in The results of the benchmark are surprising. The descriptor table won outright, by far, and as I mentioned before, it scaled. Remember this benchmark is in terms of raw time. As well, the benchmark consisted of many rapid operations in a 'for-each-task-on-each-locale' kind of way. However, @mppf I promised to deliver on more benchmarks, and I'll try to do so at the earliest time possible, potentially over the weekend I'll be able to do so, but if not then hopefully sometime within next week. Again, I need to catch up on my project first and foremost. |
My thoughts on why the Descriptor Table won (so far) is the following...
I'll add more if I think of anything else to add here. |
Okay, I am going to be doing this incrementally, as this is the easiest way to do this. I also restricted it to Revisited read again, this time also comparing to pointer compression. GlobalAtomicObject is right in the middle and scales well enough.
|
This one shows atomic 'Write' operations, which vary a lot more. As can be seen, the overhead of writing for descriptor tables is massive when the descriptor is not cached, but its more of a constant. It shows that |
Would like to note one crucial flaw in the descriptor table... this can't be solved from caching alone as its unavoidable. The flaw is that once the table begins filling up, adding elements to it becomes significantly slower. I'm assuming the issue is with my skip list implementation (in my defense it was my first time ever implementing, and it was based on the 'Skip List Cookbook' one). In reality once its cached on first use, there's no need to be have the skiplist, In the current implementation, performance is so abysmal it couldn't even finish at 1 node within 5 minutes (it starts very fast, then slows to a crawl). Huge oversight, but still plenty of potential here. I'll try to compose a benchmark reusing memory for the time being. Edit: I'm wondering if a potential bottleneck could be |
Okay, after a very long hiatus, I believe I can finally make some more progress on this again. One change I'm going to be making is stripping the skip-list, as searching for it tends to be the limiting factor, and becomes irrelevant later once caching is implemented. So instead, I'm going to regress to the original design of having to 'register' each object in the table, which then returns a tuple of both the object and the index as the descriptor. So... extern type wide_ptr_t;
type descriptor_t = (wide_ptr_t, int(64));
proc register(obj:objType) : descriptor_t { } In the original document for the idea, here, the performance for it was phenomenal, but that was because it did not have to worry about hashing objects to find their descriptors. This can still become a reality if caching it performed well enough. I'd like to explore this when I have the time. Note: The final version likely will not require the user to 'register' anything and just requires proper tracking of the descriptor by runtime. As I'm short on time, this chapel-side version is the best I can do at the moment. |
Also as well, I plan to privatize tables across all nodes in the cluster, and use my RCU (trademark-aside) abstraction for allowing updates concurrently. I'll find a decent way to propagate such updates, but overall it will favor reusing memory over rapid creation and discarding of new memory to be registered. |
Okay, I think I have a way in which I can handle managing the descriptor table. There will be only one descriptor table, wherein each cell is capable of holding any First and foremost, I'll need to clarify that the RCU and the After we have Problem: How do we recycle descriptors? After the data is no longer used, what happens to the slot that is no longer in use? Potential Solution: Maintaining a global resizable 'free-list' using @gbtitus Any thoughts on this? |
Update: I have implemented the I'll try to get back to this after I graduate (December 16th), so await some promising results! |
Fleshed out the idea more here: LouisJenkinsCS/Chapel-Atomic-Objects#1 (comment) |
I can't tell what the units on the chart are. Is higher better? Can you explain why RCU array is between Sync Array and Array? |
I have the early parts describing the RCU algorithm, but it isn't complete yet, keep that in mind. I'll continue updating it and leave another notification when I'm done. |
@LouisJenkinsCS - I havn't finished reading that yet but it reminded me of a problem we're facing. We have associative domains and arrays, but if you add to the domain, it invalidates array references. That has the side effect that parallel operations on the array can't work. This comes up with say making a histogram in parallel when you don't know the keys beforehand. I.e. I'd like to be able to write: var D:domain(int);
var Count:[D] int;
forall key in input {
D += key;
Count[key] += 1;
} but in fact this program suffers from a race condition, because appending to the domain could cause the array to resize and then invalidate the reference to Count[key] used by another task in the forall. |
Interesting example, and looking at it RCU would be able to solve the problem (although I'd hate to make it out to be a kind of solve-all panacea). In the case of allowing indexing to be reads and appending to be a write, it would work albeit extremely slowly. Each iteration would be a full-blown write, so in the end you're better off doing this serially, or better yet doing some kind of map-reduce operation where each task spawns its own map and you combine them all at the end. RCU is for situations where we have 99.9% reads. Edit: Actually, originally I planned to allow insertions into the data structure in a way that allowed writes to be relatively cheap (only synchronization on a single sync-var needed) so long as we allocated enough memory ahead of time. I removed it because it wouldn't suit my purposes and it slowed down the reads by 3x. Still, it is possible to achieve reasonable performance that would best coarse-grained synchronization, but it would still reduce overall performance. |
Speaking of mappings... I actually had an idea for a distributed hash table based on my own published work I did for the Go programming language. It can actually be perfectly remodeled in a way that fits the distributed aspect (a bit too perfectly, truth be told), and would allow not only concurrent updates, but would scale so well I'm sure it'd put the default sparse distributions to shame (specific data structures definitely would outperform a general data structure at the specific task they are designed for after all) |
It is done (for now), I'm not sure if I'll go back and add more visual demonstrations (likely I will) but the textual parts are mostly done (likely with many typos). |
Also, one more update, after removing 'sync array' as it causes runtime crashes (coarse-grained synchronization won't work at all then, which is even more interesting), I got it to run with 32 nodes... [Array]: nLocales=1, N=16777216, Op/Sec=8.69157e+06 DistVector is the array, so it wins by quite a bit 🥇 |
Also, I see what you meant when you said it can be applied to chpl-privatization.c, as yeah the whole idea of allocating it in chunks and performing some RCU update that reuses the same memory (similar to my own array) should work marvelously. I'd do this myself, but I'm 100% certain how to test and verify the results after I finish. |
For chpl-privatization.c - I think that having a RCU capability in our standard modules would be a big deal, and if you can get the RCU in by fixing leaks in chpl-privatization, that'd be a good intermediate point (for PRs). I think we can help (myself and Elliot, probably) with verifying any update to chpl-privatization.c. |
After making corrections to the benchmark, I see that there was extra communications I didn't account for before, but the good news is that the distributed vector is still scaling extremely well (like its consistently 1/3 the performance of the unsynchronized distributed Chapel array), which is extremely good... Like, 300M Op/Sec vs 900M Op/Sec good! Now, after seeing that QSBR-based RCU yields literally zero-overhead, I realize that right now if I could implement that Chapel-side this would likely bring the data structure actually ahead of (or at least neck-and-neck) with the normal Chapel array, while offering concurrent resizing & indexing. |
So I was thinking about the true applicability and generality of the descriptor table... why stop at wide-pointers? In fact, its funny but the table could be used to allow atomic instances for structures as well. Essentially the descriptor acts as a layer of indirection similar to how a pointer does, so it would be possible to store at each slot a |
Global Atomic Object
(PR available: #6717 )...
Details
Below, I go over some details that are present in the current implementation
Pointer Compression
This is the common case, but certainly not a novel concept in and of itself nor to
Chapel itself, however it is sufficient for most cases where the number of locales
fit within 16 bits. It takes advantage of the fact that all known architectures and
operating systems only use 48 bits of the virtual address space, meaning we are free
to embed the 16 bits of the locale id in the upper 16 bits. This means for 65K locales,
we get the benefit of using atomics for a single 64-bit atomic operation, making
atomics extremely fast.
Descriptor Tables
This could be a new design pattern really, but the idea, while novel in its application,
is simple in theory. We maintain a table of objects, one for each locale, of which
we use the index and locale id as a descriptor (a pseudo-pointer). When we want to
atomically operate using a class instance, we first hash the class instance (it's 64-bit address)
and perform an
insertIfNotPresent
operation on the target locale, and then embedthe locale id in the upper 32 bits, and the index into the descriptor table into the
lower 32-bits. Compression is the most expensive part, as it involves doing the above
every time (although see improvements for a huge improvement), but decompression
is cheap in that it can easily directly index into the descriptor table.
Performance
I compare GlobalAtomicObject (using descriptor tables) performance against 'Sync Var',
the all-mighty
sync
variable, which currently gets a handicap by just acquiring the lock,perform some arbitrary cheap operation, then immediately release it.
This is the 'baseline' which we must beat or else the implementation is useless.
The next is
Atomic
on a 64-bit value, which is the 'ceiling', and can be seen as howmuch performance we can get for normal pointer compression versus keeping tables like this.
It has two benchmarks: 'single' which is a single atomic operation, and 'LLSC' which is
a pseudo Load-Linked Store-Conditional operation. LLSC looks like such...
There is no retrying, just a one-and-done atomic operation (which I count as one).
In 'Global Atomic', we use similar atomic operations as 'Atomic' does, although
a few things need to be considered. While for 'Single', it performs a
write
(because a
read
is not interesting since it is a basic lookup in a locale's table),for data in local memory, which shows the ideal. Now 'LLSC' can perform operations for
data that is remote, meaning that each time, its possible that during compression
of the data, we may need to jump to other locales. It shows a more average case, but
not the worse case when data to
compareExchangeStrong
is both remote. Currently,'LLSC' for
GlobalAtomicObject
demonstrates 3 atomic operations: theread
whichis a simple lookup, and compression of both variables passed to
compareExchangeStrong
which one is local, the other remote.
It should be noted again that with pointer compression and a smaller number of nodes,
performance is excellent. However, when one has more than the maximum allowed, it is
still manageable and better than a
sync
variable is. It should also be emphasized thatthe runtime of
GlobalAtomicObject
with pointer compression is equivalent toAtomic
.I present two graphs: one with respect to runtime, and another in terms of raw operations
per second.
Time
This shows the runtime of the benchmark. As can be seen, raw atomic operations grossly
beats anything else, although in terms of atomic LL/SC, it does lag enough to demonstrate
the power of global atomics. A single global atomic, while is magnitudes slower than a
single 64-bit atomic, is only 2x slower than the LLSC.
Imagine if we had attempted to implement some MCAS operation to update both locations using
descriptors. At best we would succeed within 4 LL/SC operations without obstruction
(2 to set descriptors, 2 to update), of which a single Global Atomic variable would beat.
Of course, in reality, the operation would require multiple retries, in the hundreds to
thousands with real workloads that placed contention on it. This shows that, in terms of
global atomics, this is the best option currently, as a global atomic is guaranteed to succeed in one operation.
Operations Per Second
This shows the raw operations per second. Notice that I removed 64-bit atomics entirely as they
shadowed over everything else to the point that the performance boost wouldn't be noticeable.
As can be seen however, both a single global atomic operation, and even the LL/SC combo
operation with a potential remote reference (meaning an added
on
statement) beats a normalsync
variable. In every case,GlobalAtomicObject
wins outright, but performance penalties fromremote references are very apparent and should be avoided at all costs.
Usage
Sample usage of
GlobalAtomicObject
can be seen below (and yes, it does work).GlobalAtomicObject(C)
can easily becomeatomic C
. If the language were to bemodified, it wouldn't be too surprising to see the above become the below (which
of course lacks the excessive comments), which is an exemplary ideal that @mppf
insisted on:
Where
delete
can call some hook which also removes it from the table if it has not yet beenreclaimed. I believe this can become a reality with little effort.
Improvements
Below, I go over some potential improvements that can be made by others who are up to it.
Pointer Compression + Descriptor Tables
It may be possible to combine normal pointer compression for the lower locales that fit
within 16 bits. Combining both means we allow better performance on average, even for
cases when they have more than 16 bits worth of locales, any locales with an id
less than 2^16 get the power of atomics, while the ones greater than or equal
to 2^16 get to use the much-slower-but-faster-than-sync-variables descriptors.
See next idea for how to differentiate.
Zone Compression
A theoretical model where we use a more dynamic type of compression which attempts
to uses the lowest 32 bits for the virtual address, and divide the upper 32 bits
into zone bits and locale bits. A 'zone' is a base 64-bit offset identified by
zone id, which is allocated when a new unique id is found. With optimizations
such as caching the zone mappings for other locales (I.E: First time a unique upper 32
bits are found, check if we have it cached, if not found jump to the owning locale
and add it if not already, and update what we have cached, etc.)
The benefit here is that we get to keep using 64-bit pointer compression, and the amount
of unique zones that can be kept track of depend on the number of bits needed to keep
track of the locale id.
"What happens when we run out of zones?" - We use the descriptor
tables again. We can just search the unique zones for the target locale first, if the
zone isn't found and we can't create a new one, we just use descriptors.
"How do we differentiate between Zone Compression, Pointer Compression, and Descriptor
Tables?" - Its possible to use the lowest 2 bits. This means lowering the potential
maximum segments for descriptor tables, but it allows a much more dynamic compression
strategy.
Caching Descriptors into Objects
This is a change that isn't theoretical but requires changes with the language itself.
Majority of the overhead of descriptor tables is hashing the object, which requires jumping
to the owning locales, but if the descriptors were cached then repeated accesses would
be just about as fast as atomics. If it were cached, the benefits of recycling memory
to be used in atomics would be tenfold (or hundredfold, as the benchmarks show).
Concurrent-Safe Implementation
The table utilizes a simplistic skip list implementation on a segmented memory pool.
The memory pool allows indexing by allowing an easy translation from a 32 bit
index into a segment and segment data index. It utilizes a simple bitmap (
BigInteger
).The implementation of the memory pool allows wait-free reads (hence indexing is fine)
but requires synchronization to recycle/reuse memory. The skip list used to manage
memory also requires synchronization, and a lock-free/wait-free implementation requires
proper memory management such as Hazard Pointers which requires thread-local or task-local
storage. Hence, synchronization is needed and we use a simple local
sync
variable.If someone managed to make everything concurrent, it could improve performance further
by 5 - 10x at least.
Edit:
Removed Introduction to avoid further derailment of discussion.
The text was updated successfully, but these errors were encountered: