-
Notifications
You must be signed in to change notification settings - Fork 428
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Getting closer to hardware vector instructions directly from Chapel - a Vector Class #15812
Comments
I'm pretty sure a vector type would need to be a value type, i.e. a record vs a class. I'm also curious to what extent existing types could handle the data storage portion. Tuples like GCC has for a long time supported intrinsics for vector operations. However I have personally found these to be not very useful because of the limited number of operations available. That leads me to think that in order to be useful such a mechanism needs to support lots of operations. But, to be portable, it needs to support a small number of operations. There is a tension there that I'm not sure it's possible to resolve. Would operators like There is a strategy for this that one can do now. One can use C interoperability features, such as extern blocks, to write inline functions that include intrinsics or assembly for a particular architecture, and then call these from Chapel code. While this strategy is not particularly easy, it is more certain to work, because any hardware-supported vector instruction/intrinsic is likely to be available in this way (and the Chapel project does not need to do anything at all in order to make it available). |
You are probably right. My brain is in C-mode currently which means it has pushed knowledge of Chapel and C++ to the back of the neural cell. I had this project on my back burner. But when your GSoC personal wanted to call the new specialized list a vector, I figured I need to come out and claim the name 'vector' before it got hijacked (as it had been in C++ with the STL)
No idea at this stage. I had not gotten into details. And I was hoping for some help at the detail level anyway as I figured other's input was going to be invaluable. By definition, my, or anybody's, HPC experience with vectors is limited. I am sure my knowledge and manual coding of SIMD is
Yes. In my next big posting, or should it be a file I attach, there are references, two of which are: [1] Fog, Agner, https://www.agner.org/optimize/vectorclass.pdf
Probably enough for testing concepts. We can always write our own inline asm with C.
Yessss! Therein lies the dilemma. This is early days. I suggest we err on the side of small. My thoughts were is to first write an implementation in Chapel and then see what assembler is created by the backend. Mind you, that assumes we have a nice way to read the assembler for individual routines without a whole lot of other assembler, i.e. like what is produced by
I have not figured out how to do that in Chapel cleanly yet. And please do not say that we should be disassembling the executable. Then rework the 'vectorclass' as noted above to reflect the Chapel for X86-64. And then for cpu in [ X86-64, SomethingElse like an ARM with SVE ] do
{
for tries in 1..4 do
{
sit back;
think a lot;
get feedback;
tweak both Chapel and C++ based on feedback and hard thinking;
if no-more feedback then break;
}
write report (because that exposes brain fade);
ponder a bit more;
exploit it in something, e.g. some piece of Linear Algebra Code;
}
The former.
This is what I am trying to avoid. Think 'elegant'. Think 'Chapel'
Yes to 'not particularly easy'. Yes to 'more certain to work'. But then you have to write one for X86-64, one for A64FX, and so on. And it means that you are not leveraging the benefits of the LLVM backend to the Chapel compiler or Chapel itself. So, I would really like to try to avoid this approach if I can. |
The following is arguably incomplete but more than enough work for now. Hopefully it covers all the areas where problems will occur and are sufficient to engender all the required discussion. It is really being released too early but it needs to be released into the wild now for reasons beyond my control. Chapel needs a least a type that can fit into a hardware vector, or SIMD, register, or a small set thereof. This type could have the form within a declaration like: var v : vector(real(w), n); where n is a small param integer and is limited to being an integral power of 2 at this stage. So we could have: var x : vector(real(w), 2); // max using real(64) with SSE2
var x : vector(real(w), 4); // max using real(32) with SSE2
var x : vector(real(w), 8); // max using real(64) and AVX512
var x : vector(real(w), 16); // max using real(32) and AVX512
var x : vector(real(w), 32); // max using real(64) and ARMv8 + SVE
var x : vector(real(w), 64); // max using real(32) and ARMv8 + SVE ?
var x : vector(real(w), 128); // a bit of an orphan
var x : vector(real(w), 256); // max using real(64) NEC-SX for hardware
var x : vector(real(w), 512); // max using real(32) NEC-SX for hardware What are the implications of this for a RISC-V's Vector Extensions or even ARMv8's Scalable Vector Extensions (SVE) ? There are some arguments to limit n to not exceed 256. Does the choice of n need to be more limited in size to make it more manageable? At this stage, integer vectors are being ignored to keep discussions short. But, is handling integer vectors more important that floating point ones? Defining the Algol68 nicities proc lwb(ref x[?D] : T) : int return D.dim(1).low;
proc upb(ref x[?D] : T) : int return D.dim(1).high; Or does this need to be dim(0) in a 0-based world? Grumble, grumble! So given 'm' and 'n' are small integers, we could have var v : vector(real(w), n);
var u : vector(real(w), n);
var w : vector(real(w), n);
var x : [1..n] real(w);
var y : [1..m] real(w);
var t : real(w);
assert(m < n); The three most basic methods to be defined for a vector are: size: v.size return the number of elements of v, n
core: v.core returns the type of any v[i], real(w)
wipe: v.wipe(), v[0:n-1] = 0:x.core() The assignment implied above is not real. It is likely to be an XOR. There needs to be way to get data into and out of these data structures. pack: v.pack(x), v = x assuming nx == n
data: v.data(x), x = v assuming nx == n
pack: v.pack(x, keymap), [i in 0:n-1] v[i] = x[lwb(x) + keymap[i]]
data: v.data(x, keymap), [i in 0:n-1] x[lwb(x)+keymap[i]] = v[i] The last two are gather/scatter operations. We may need something that handles bitmap indexing into an array. Or is trim: v.data(k), v[k:n-1] = 0:v.core()
load: v.data(y), v[0:m-1] = x[1..m], v.trim(m) We may need to pull out slices or a single element, or inject a slice or get: v.get(i, m), returns v[i;m]
get: v.get(i), return v[i]
put: v.put(i, m, x), returns v[i;m] = x[lwb(x);m]
put: v.put(i, t), returns v[i] = t This allows us to have head: v.head is the same as v.get(0, v.size() << 1);
tail: v.tail is the same as v.get(v.size() << 1, v.size() << 1) There is a need for a broadcast operation: send: v.send(t), v[0;n] = t
send: v.send(t, i), v[i;n] = t
send: v.send(t, i, m), v[i;m] = t We need the orphan operation (which is probably highly inefficient) swap: swap(i, j) effectively swaps v[i] and v[j]j The following operations are defined between 2 vector operands: + add
- subtract
* multiply
/ divide
+= add
-= subtract
*= multiply
/= divide
==
!=
<=
>=
<
> There are 7 operations which return a vector : abs: v.abs() returns the element by element absolute value
neg: v.neg() returns the element by element negated value
sqrt: v.sqrt() returns the element by element SQuare RooT
muie: v.muie() returns the element by element MUltiplicative InversE
min: v.min(u) returns the element by element minimum of 'u' with 'v'
max: v.max(u) returns the element by element maximum of 'u' with 'v'
fma: v.fma(u, w) returns the element by element FMA - v + u * w Reductions but implemented as a method, not a Chapel reduction: maxval: minimum over 'v'
minval: maximum over 'v'
maxloc: minimum over 'v'
minloc: maximum over 'v'
sum: v.sum compute the naive sum of all v[i], i in [0.v.size-1]
dot: v.dot(u) compute the naive sum of all v[i]*u[i], i in [0..vsize-1] and their equivalent that considers absolute values: amaxval: minimum over 'v'
aminval: maximum over 'v'
amaxloc: minimum over 'v'
aminloc: maximum over 'v' Note that some instruction set architectures (ISAs)have totally broken FMIN/FMAX instructions that will stab us in the back here. The culprits are Intel and RISC-V. Other ISAs may also do this. There is whole load of grief in permutations, blends, and friends: [rl]shift
c[rl]shift
permute
blend
lookup
gather
scatter They should be addressed in the scope of this project but are left as placeholders here. |
You said
Yes. You are correct. Brain-fade on my part. You asked
Me too. Not sure how to evaluate the various options. And also
I really have not gone into any detail there. I was more focused on the interface and whether is was complete enough to be useful to others and clean enough to be implementable. Also, what is the benefit of a c_array(real(32), 4) over a var x : [0..3] real(32); This array 'x' already has the built-in methods x.eltType to return the core type, can be vectorized already or assisted with vectorizeOnly. This gives me some hope that at least a prototype could be done in pure Chapel. By the time we got to such a prototype, say by the time of 1.23, would the Chapel compiler be further advanced in its ability to vectorize with reduced overhead. Even then, If you then took this reference and replaced bits with compiler intrinsics where needed, would this hybrid get closer to the desired performance. At least if you could then contemplate this hybrid definition, you would have something that might give you pointers on what is needed in the compiler to allow the module that implements this vector type to be done in pure Chapel. Resorting to C for the actual details is not what I was thinking about for the long term but rather tweaking how you interact with LLVM. Remember I know close to zero about that back end so I could be rambling incoherently at this point. A lot of my ideas need to be better influenced about how Chapel currently handles vectorization. And I do not have that knowledge yet and I figure it will only materialize with experimentation or assistance from those far more enlightened/experienced on the topic than I. I am throwing up ideas at this stage. Enough from me. Other can comment for a while. |
Right, but actually I think that problem specifically could be solve by writting "assembly" in LLVM instead of specific assembly for each target. (I'm not saying that would be elegant). For more information about LLVM IR you might look at https://llvm.org/docs/LangRef.html#add-instruction (and imagine you are reading an instruction set reference).
From https://developer.arm.com/docs/101726/0210/learn-about-the-scalable-vector-extension-sve -
So I am not quite sure that with SVE the programmer should be saying what the vector length is at all. As a result I think that the effective way to use SVE looks more like saying "This loop is vectorizeable" rather than using vector types. However there are SVE C extensions that allow one to write vector intriniscs and the intrinsics work with types that have unknown size at compile time.
I'd probably provide the ability to go between tuples and vectors rather than arrays and vectors. (But there are issues if the tuple size is not known at compile time, which might not go so well with SVE). |
I agree with you that we probably have to write "assembly" in LLVM. But when I read the LLVM docs, my eyes glaze over and I feel so far out of my comfort zone that I was wanting somebody to agree with me. Also, I was unsure whether it could be done purely at the LLVM level or whether I was going to be condemning some poor bunny to even more work on the compiler. I always feel guilty when I suggest the latter. That is what I suggested that we do an implementation of the interface purely in Chapel, even if it is very suboptimal from a vectorization perspective. Get it right, tested, and used, and then we attack the back end. What do you think? Is this the sort of thing @ben-albrecht is into? All I know on ARMv8's SVE is what I have read or heard on YouTube. It will be a challenge reconciling the ARM approach (and that of RISC-V) with other vendors SIMD ISA, maintaining transparency, and achieving optimality. Their will be have issues with supporting the various X86-64 SIMD instructions depending on whether a chipset support SSE2, AVX2 and AVX-512. And do you want the penalty of the clockspeed drop when you use AVX-512. I think we only have to worry about 128bit SIMD ISAs on other architectures, Do we need to expose some sort of base/minimal "vector length" to Chapel. I notice you keep mentioning tuples? What is the magic there? There must be more than meets my eye. Anyway, in my brevity, I did not provide an overloaded pack that copies a tuple into a vector, even if it is mainly used at declaration time of the vector. For example var f0 : real(32) = ....
.....
var v = vector(f0, f1, f2, f3); would create a 128-bit aligned vector of four 32-bit floats that would map to a vector on just about any machine with SVE or SIMD. For simplicity at this stage of the process I was going to stick which the vector length being a param, at least on the Chapel side. On the back end, usingthe more generic vector may reduce complexity. Note that from my perspective, I would always be using these vectors as slices of larger 1D-arrays or rows or columns of 2D-arrays, which some people call respectively vectors and matrices but then we end up being caught in a semantic war. So they will generally align with vectors rather than tuples. As a selfish aside, and because I use slices so much, I was wondering why Chapel could not use "start:finish:stride" notation instead of "start:finish by stride" It always looks messy? Just curious. |
I need to go back to fundamentals as I have been going around in circles. How is a tuple stored and manipulated internally? How do you guarantee it will get vectorized? I looked at c_array and my head spun. I have no idea what it does internally. I cannot even see where the data is defined in the implementation. See below. proc main
{
var t : c_array(real(64), 4);
var s : c_array(real(64), 4);
var x : [1..4] real(64) = [ 1.0, 2.0, 3.0, 4.5 ];
var y : [1..4] real(64);
for i in 0..3 do t[i] = x[i + 1];
for i in 0..3 do s[i] = sin(x[i + 1]);
writeln(t);
for i in vectorizeOnly ( 0..3 ) do s[i] += t[i];
writeln(s);
for i in 0..3 do y[i] = s[i];
writeln(y);
} |
The reason I was bringing up tuples and c_array is that an array (at least today) is necessarily heap allocated and that doesn't make so much sense for relatively small things (like 4 reals).
Tuples are allocated on the stack and the compiler can choose the memory layout. There's currently no particular guarantee of vectorization (but the later parts of compilation are pretty good at pattern-matching sequences of related operations, as with say adding two tuples element by element).
c_array is supported by the compiler. It is akin to declaring say |
(This part is off topic, so if you want to delve into it further we need to move it to another issue)
you mean
|
My feeling is that we need something more like c_array than a tuple. I cannot yet understand enough of the internals of this record to be sure that it is totally adequate although some things might be overkill. As I said, I cannot figure out how the space is created or even how the indexing operator even finds that space. Or is that some magic in the compiler itself? But, at this stage, it is not crucial unless it hits a brick-wall.. I am curious as to the implications that c_array has in terms of any redundant temporary vectors that may get created in intermediate expressions. Remember that what we are trying to do here is to have code that helps the back-end keep things in vector registers. I think I will proceed from here using c_array although knowing more details about how it works would make me feel more comfortable. Does it need any compiler support, and if so, what is it? Knowing that a Chapel array is always allocated off the heap is interesting and useful to know but is off-topic. I would never have guessed that and cannot remembering reading it anywhere. Or at least it never jumped out at me if it was in what I was reading. Note that in the CPtr.chpl source file, the comments prior to the definition refer to c_array as a class. |
Yes, that's what I was trying to say.
Are you talking about the C compiler? (It definitely needs chpl compiler support but should be portable across C compilers and the LLVM backend). |
How do I write a (operator) proc that takes 2 c_arrays, both of type real(w) and length n, adds them element-wise and returns the result in a temporary of the same type? I have tried several variations and I keep going nowhere? What I am trying to achieve is: proc +(u : [1..4] ?T, v : [1..4] T) : [1..4] T
{
var t : [1..4] T;
for i in vectorizeOnly(1..4) do t[i] = u[i] + v[i];
return t;
} That said, why does Chapel want to use this operator proc for anything other than a 1-D array having a domain of [1..4]. Or conversely, how do I force Chapel to use the above operator proc for the said 1-D array type. |
Here's a + overload that adds c_arrays (TIO): proc +(x: c_array(real, ?n), y: c_array(real, n)) {
var z: c_array(real, n);
forall i in 0..#n do
z[i] = x[i] + y[i];
return z;
}
config param size = 10;
var a, b: c_array(real, size);
for i in 0..#size {
a[i] = 3.0;
b[i] = i:real;
}
var c = a + b;
writeln(c); |
I'm not certain what else you're finding it gets applied to, but guessing (let me know if I'm wrong): Chapel currently only resolves array arguments based on static type information such as (1) the array's rank / dimensionality, (2) its element type. Thus, any 1D arrays will dispatch to this overload regardless of index set, though any that don't have the index set 1..4 will result in an execution-time error. The reason for this is that though the formal arguments' indices are specified with As examples, here are a bunch of cases that don't work, with explanations as to why: proc +(u : [1..4] ?T, v : [1..4] T) : [1..4] T
{
var t : [1..4] T;
writeln("In Damian's + overload");
for i in vectorizeOnly(1..4) do t[i] = u[i] + v[i];
return t;
}
var A, B: [1..4] real;
writeln(A+B);
var C, D: [0..3] real;
// Matches Damian's overload, but generates an error due to domain
// mismatch. This is because Chapel doesn't have param ranges and
// domains, so doesn't attempt to reason about index sets at
// compile-time, only execution time.
//
// writeln(C+D);
var E, F: [1..5] real;
// Same as above: execution-time index set mismatch
//
// writeln(E+F);
var G, H: c_array(real, 4);
// + on c_arrays not supported by default
//
// writeln(G+H);
var I, J: 4*real;
// OK, but doesn't use Damian's overload (different types)
//
writeln(I+J); And here's a more general proc +(u : [?D] ?T, v : [D] T) : [D] T
{
var t : [D] T;
writeln("In Damian's + overload");
for i in vectorizeOnly(D) do t[i] = u[i] + v[i];
return t;
}
var A, B: [1..4] real;
writeln(A+B);
var C, D: [0..3] real;
writeln(C+D);
var E, F: [1..5] real;
writeln(E+F);
var G, H: c_array(real, 4);
// + on c_arrays not supported by default
//
// writeln(G+H);
var I, J: 4*real;
// OK, but doesn't use Damian's overload (different types)
//
writeln(I+J); This more flexible approach could do things like take one code path if the array size was 4 and another code path if it wasn't, but currently that would result in an execution-time conditional for similar reasons: because the compiler doesn't try to reason about the size / membership of index sets at present. |
Thanks heaps for this. proc +(x: c_array(real, ?n), y: c_array(real, n)) {
var z: c_array(real, n);
forall i in 0..#n do
z[i] = x[i] + y[i];
return z;
} I had actually tried a generic version of this, i.e. with real(?w) and with just a for loop with vectorizeOnly(0..n) and it gave me a compiler error. I am using 1.20 so I figured I was doing something stupid and never reported it. Several questions for my guidance. Why you did not specify a return type or can Chapel always assume that from what you are doing? If vectorized code is desired from the loop, is the forall a better choice, especially considering n is small? Is the length of a c_array always known at compile time? Is using the count operator # likely to help or hinder vectorization or neither? If Chapel was not a compiler but an introspective AI tool and it saw real(w, n), what would it think was being requested? I personally think that the c_array approach as originally suggested by Michael is the way to go for now, although a real(w, n), and matching int(w, n) types might be the way to go in the long term, especially if you want the compiler to be smarter at how it looks at vectorization. Thanks for the thoughts on using arrays with vectorization. I was trying to see if there was a way to force the compiler to use one operator + for short arrays and another for longer arrays. I actually had proven your approach worked for all arrays but I did not want to mess with the default Chapel operator for addition for 1D arrays, just for the short ones. I will pursue that a bit more. If I use the Chapel compiler and save the intermediate C files, if there a way to then manually compiler these with a "-S" flag to produce the assembler? An earlier suggestion to disassemble what is nearly a 10MB executable and go fishing through the result sounds like a sledgehammer to crack a nut to me. And less than productive. Surely you look at the assembler that Chapel generates yourselves? |
We could make a compilation flag that just does this (but it would actually disassemble the generated binary) and if that has significant value to you I recommend creating a feature request issue. Assuming you are using something compatible with GCC, one thing you can do is this:
That will create However you can use the experimental
For me, this created |
I tend to rely a lot on Chapel's type inference (where, in this case, it will look at the
I haven't studied vectorization much, and wasn't thinking about it in sketching out this routine (was focusing more on getting the procedure's pattern correct). Specifically, I don't personally have much experience with what will or will not produce good vectorized code with our compiler today. For small
Yeah, it is. Chapel's
I don't believe it should have any effect, particularly in this case where
I'm not understanding this question. Are you saying it might be interpreted as some sort of "is w real" logic predicate?
Works for me.
Michael already answered this, but I'll say that, personally, I rarely do. Other developers have, however, and I believe have taken the approach of (a) having the back-end compiler save the assembly using |
Thanks @mppf , I think that either
or even the experimental -incremental will achieve what I need. |
Thanks for the insightful feedback @bradcray. I will take your point on vectorizeOnly. As to my poorly written point on a real(w) type or int(w) declaration for a conventional single data (SD) base type, it looks like such a declaration was almost carefully designed, pre-ordained or even begging, to be extended to be able to declare a multiple data (MD) near-base type real(w, n) of n real(w) elements that can be used in SIMD instructions. Such an extension appears to be a far clearer and more compact way of specifying such MD near-base types than either the more verbose GCC style typedef int v4si __attribute__ ((vector_size (16))); or the Sierra LLVM extension int varying(4) v = { 2, 4, 6, 8 }; or even LLVM's own vector instructions (of which I have no experience). That suggested Chapel extension (or something like it) is just a thought for future compiler enhancements (which go beyond my skill set). While only add one extra argument to the type declaration, this extension needs serious compiler support and in that respect is quite different to the c_array approach being proposed. For now, a c_array approach should provide a basis to find out what vectorizations are feasible with the current Chapel compiler. Also, I will investigate if the latest compiler handles a generic type in the c_array parameter to the + operator proc. Or maybe that internal error is still there. |
It seems to me that |
Does any other language have a high level concept that allows direct access to objects which can map directly to a SIMD or I am talking of something that hides things at a higher level than gcc's typedef double v4f64 __attribute__ ((vector_size(32)); or other compiler's equivalents. These would need to scale from systems which only have SIMD for 32 bit (or smaller) quantities through to the NEC Tsubasa which handles a vector of 256 real(64) quantities. And I am curious about what we need to such data structures help us to exploit GPUs. Do these need us to put our brains back to the mindset which used old 1970s style vectors which worked directly on data in memory (rather than registers). |
Not a lot of action/feedback happening here. The idea is to treat short, fixed length arrays as special. With apologies for concentrating on real(w) arrays, this would ultimately allow something like a dot product to be written generically as proc dot(r : range, const ref u : [] ?Real, const ref v : [] Real)
{
param VL = vector(Real,0).optimalSize;
type vReal = vector(Real, VL);
const size = r.size;
const f = r.first;
const tailSize = size & (VL - 1);
const bodySize = size - tailSize;
const body = f .. #bodySize by VL;
const tail = f + bodySize .. r.last;
var s : vReal;
s.wipe(); // clear the array
for i in body do
{
const ref _u = u[i..#VL]:vReal, _v = v[i..#VL]:vReal;
s += _u * _v;
}
var t = s.vsum(); // compute horizontal sum through the entire vector
for i in tail do
{
t += u[i] * v[i];
}
return t;
} I do not pretend that this algorithm is optimal. But if the compiler knows that there is effectively a 1D-array of a fixed length (that fits exactly into a vector register), then it should be able to optimize things without having to do a lot of work. Note that the vector type can also return a minimumSize and a maximumSize* which may indeed be the same (as with say an IBM Power ISA) but are all unique with an AVX-512 ISA. I would propose that type vReal = vector(Real, 0); defines a type which has the optimal length anyway from which one can deduce param VL = vReal.size; Am I right that the concept of length has disappeared in favour of size Then again, vectors have always had a length. But then again, an array with a base type of T has (in Chapel) an eltType of T. We probably should be aiming for consistency rather than using names that map directly to reality. |
This seems like a question that you (or anybody) could answer. I am not able to investigate this question right now because it would be too distracting from my other goals.
I already mentioned my concerns that having programmers write code specific to one vector length won't work for the newer ARM vector extensions. It seems to me that it's better if we can avoid code specific to vector length. The way that I know how to do that is to focus on vectorizing loops (and trying to be clear about when that is possible) rather than on creating a vector type. However it might be possible to solve by creating a vector type.
I am currently trying to figure out how to both vectorize and target GPUs in a way that is understandable to Chapel programmers. I do not feel I have all the answers there yet but I am pretty sure that it needs to look more like creating vectorizeable/GPU-kernel-izeable loops rather than data types with a fixed vector width. What I would (personally) need to be convinced that we need vector types is:
|
I was hoping for some input from somebody else. As you note, your goals are (rightly) elsewhere. To answer my own question about high level concepts,
My limited experience is that these do not exist unless you count the
which is the SIMD extension proposed for C++ but that seems to be moving very slowly. The initial ideas on that which came from Boost got rejected about 6 years ago. It seems a non trivial problem, or at least getting agreement certainly is. There is Krentz's thesis although that is trying to be way more complete than what I am suggesting. Due to being starved of time today, except for your question:
I will address the rest of your reply over the coming week. I notice the overlap with the recently raised #16403 issue although I question my ability to make any meaningful input to that conversation. I did look at #16404 to which I can at least add my 2c and feel like I have made a tiny contribution. For your last question, but only in the context of my example quoted above, what you want can be achieved by changing that param to const. Or have I misunderstood your question? |
Ah. In that event it's probably going to work better to post something on a mailing list (or Gitter or the new Discourse site) because only the people involved already in this issue are likely to be notified when a new post shows up. |
This seems to have died. Or is it just a case of lots of other, more pressing things. |
The latter. |
Some modern instructions set architectures (ISAs) such as X86-64, Power9, MIPS and SPARCv9 support SIMD instructions. Others like ARMv8 and RISC-V support far more complete vector extensions (VE) while the NEC SX-Aurora Tsubasa is a full vector ISA. All of these, with a good compiler, allow varying degress of vectorization.
While compiler optimizations exploit these instruction transparently to the user, many languages provide mechanisms such as compiler vector intrinsics to allows user to get closer to, or even directly use, these instructions. Compiler intrinsics are NOT generic.
It is proposed to provide a vector class for this purpose within Chapel although this should designed to allow Chapel way more transparency than compiler intrinsics. This class will help take some of the workload off the compiler and result in faster code in a broad range of scenarios.
This is totally different what C++ calls a vector class which is a spoecialized list.
It may have some aspects in common with what t C++ calls a valarray class.
Has anybody already (in Chapel), drafted the design of such a class, designed such a class or even implemented such a class?
For performance reasons, the major driver for this proposal, this class should be able to map very closelyt to hardware vector register in some sense of the word 'map'.
Some ideas for discussions appear below but these need suggestions, insight and augmentation from others. It is deliberately incomplete and has some deliberate/known flaws to trigger discussion so feel free to comment.
This design need to considered in a somewhat portable context. So this means looking at SIMD in ISAs like x86-64, Power9, MIPS and SPARC, the potentially more powerful/compact 'scalable' vectors in an ARMv8 SVE (like the new A68fx) an RISC-V VE, or the fully blown vector ISA )VISA) of a NEC SX-Aurora Tsubasa. It should definitely be considered from the perspective of Fortran users.
Is the Tsubasa too novel to work? Do we stick to SIMD? Do we design it so that extension to using predicates is easy? Do we ditch SPARC because it is dying, or MIPS because their licensing model is not yet fixed? And these questions are just the start.
Is it a two level design, i.e. one that maps directly to hardware registers and one that builds on top of that and is able to address not only SIMD but also the true VEs and even a VISA.
Introducing GPUs into this discussion is likely to be highly counterproductive.
Simple details of class to follow. This needs keywords attached. And maybe a change of subject.
The text was updated successfully, but these errors were encountered: