Talk:Vector processor

A request has been made for this article to be peer reviewed to receive a broader perspective on how it may be improved. Please make any edits you see fit to improve the quality of this article.

Computing Start‑class

	This article is within the scope of WikiProject Computing, a collaborative effort to improve the coverage of computers, computing, and information technology on Wikipedia. If you would like to participate, please visit the project page, where you can join the discussion and see a list of open tasks.ComputingWikipedia:WikiProject ComputingTemplate:WikiProject ComputingComputing
Start	This article has been rated as Start-class on Wikipedia's content assessment scale.
???	This article has not yet received a rating on the project's importance scale.
	This article is supported by Computer hardware task force.

Confusing SIMD categorisation

In most of the computer architecture books that I have read, SIMD is a categorized as type of multiprocessing, not as a type of vectorization. My understanding of the meaning of vectorization is an architecture which streams data into an execution unit. That is, it achieves high performance through high temporal utilization of a single functional unit. SIMD achieves high performance through a different axis, that of replication of functional units. For that reason, I believe this article is confusing SIMD as a type of vectorization. Dyl 23:34, 27 December 2005 (UTC)[reply]

it's not the article itself per-se, it's that some vendors miscategorised their ISA by using the word "Vector" without actually providing features *of* Vector processors. For example some ISAs took traditional Gather-Scatter operations or permute operations from "true" variable-length Vector ISAs, slammed them into fixed-width SIMD instructions then claimed that they'd made a Vector Extension. in other words just because there are *features* lifted from pure Vector processors and jammed into SIMD does not make SIMD itself a Vector Processor. it just massively confuses things. joy. Lkcl (talk) 20:35, 6 June 2021 (UTC)[reply]

8086 and family

If it's allowable to use multiple cycles in data processing then do the x86 family, with things like the string operations, fit into this category? --ToobMug 15:47, 26 May 2007 (UTC)[reply]

This is one of the best articles on microcomputer architecture I've ever read. It's descriptions are simple enough for a layman like me to understand, and yet leads the casual reader into a wealth of information.I'm sure other technical articles on Wikipedia could do with emulating this style. Fantastic work !

GPUs?

Isn't a shader in a typical ATI or Nvidia GPU a vector processor? They process pixels and color data as vectors. 76.205.122.29 (talk) 18:48, 26 May 2010 (UTC)[reply]

yes, although typically the pixel colour data is processed / categorised as "sub-vectors": vec2, vec3, vec4. VEC2 would be XY, VEC3 would be RGB or YUV or XYZ, and VEC4 would be ARGB or XYZW. it's made additionally complicated by these sub-vectors sometimes being treated as independent elements within vectors. RVV has this capability, as does SVP64, the Draft Extension to PowerISA i am developing Lkcl (talk) 20:19, 6 June 2021 (UTC)[reply]

Difference between Array and Vector processors

Array processors and vector processors are different, Aren't they? I think redirect from Array processor shd be disabled and a separate section for Array processor has to be made —Preceding unsigned comment added by 129.217.129.131 (talk) 20:47, 5 January 2011 (UTC)[reply]

Tanenbaum, A.S. 1999. Structured Computer Organization. Prentice Hall. makes a difference between array machines and vector machines (I don't have the book here right now, I might remember incorrectly). I just looked into the new edition via Amazon and there Tanenbaum makes a difference between "SIMD processor" and "vector processor". The former have multiple PEs (processing elements) which have local memory, and are controlled by a single instruction stream (example ILLIAC IV). Vector processors on the other hand have vector registers and a single functional unit to operate on all entries in such a register. Tanenbaum cites the Cray-1 as an example. Other examples are SSE, AVX, AltiVec, NEON. It seems hard to find a consistent differentiation in naming the different SIMD hardware. I do find it important, though, to be clear about the differences there are. Mkretz (talk) 11:17, 29 June 2013 (UTC)[reply]

there is an actual processor which called itself an Array String Processor, by Aspex Microelectronics. I worked for them back in... mmmm... 2003? i think. that was an insane incredible architecture: 4096 2-bit ALUs with left-right connections, you could do 8192 bit addition or bit-shift, or you could break it down to do thousands of smaller (8-bit, 4-bit) Computations. In some specialist algorithms it was a hundred even a THOUSAND times faster than processors of its era. I added crossreferences to Academic peer-journal papers by its key architects, and to archive.org. Fascinating anecdotal tidbit: the ASP was an actual serious contender for the Reagan era "Star Wars" Programme! Only the lasers were the bit that let them down :) Yes, Array Processors have actually been manufactured and sold: they are *not* the same thing as "bare" (non-predicate-capable) SIMD processors, they are more like "true" Vector Processors. Lkcl (talk) 13:43, 6 June 2021 (UTC)[reply]

Not the real history

It is a distortion of historical events to characterize on-chip simd operations as vector instructions. The simd concept originated with the early work on parallel computers which was both separate from and earlier than the big-iron vector machines. Jfgrcar (talk) 03:23, 29 January 2011 (UTC)[reply]

Yes, that is not the right characterization, but the article has bigger problems anyway. History2007 (talk) 21:17, 8 July 2011 (UTC)[reply]

Quality?

This page needs real clean up. A simple diagram would do a lot, and there are zero refs now. Unless there are objections I will remove the x86 architecture code that has no place in an encyclopedia. I will have to find a nice image to explain the concept. Does anyone have a nice diagram for this? History2007 (talk) 21:17, 8 July 2011 (UTC)[reply]

A 4-element SIMD extension like SSE isn't a vector processor anyway, so its irrelevant to this page. A vector processor would be something like an NEC SX-6 or a Cray-2 or some DSPs. 69.54.60.34 (talk) 03:43, 8 September 2011 (UTC)[reply]

SIMD examples are extremely useful to illustrate starkly and bluntly how truly and horrifically awful SIMD really is, compared to good Vector ISAs. that is not overstated. the incumbent current computing giants have done the world a massive disservice by believing and propagating the SIMD seduction for 30 years. this is best illustrated in more neutral understated language in the sigarch "SIMD Considered harmful" citation now added to the page, which, on careful reading, is observed to provide stunning statistics such as a 10:1 reduction in the number of instructions executed, and 50% *or greater* savings in the number of instructions needed. this is a big damn deal that Intel and AMD have a hell of a lot to answer for, and ARM is only just waking up to with the introduction of SVE2. Lkcl (talk) 23:51, 5 June 2021 (UTC)[reply]

to illustrate how stark this really is i tried compiling the ultra-simple 2-line iaxpy example with x86 gcc 10.3 on godbolt.org with the options "-O3 -march=knl" to allow optimised AVX512. the results? an astounding TWO HUNDRED AND SEVENTY assembler instructions. i mean wtf?!? i won't list them here, but you should be able to use this link https://godbolt.org/z/55Kax4j9f Lkcl (talk) 03:28, 8 June 2021 (UTC)[reply]

tried the same thing with ARM SVE2 https://godbolt.org/z/nd1aE1vY4 the options given are from the ARM SVE tutorial which are armv8 clang 11 -O3 -march=armv8-a+sve and it's not bad: only 45 instructions. this is however still *double* that of the equivalent RVV number of instructions which can be seen in the sigarch "SIMD considered harmful" link. whoops. Lkcl (talk) 04:16, 8 June 2021 (UTC)[reply]

Two Kinds of Vectors

The article notes that such things as AltiVec and SSE are examples of vector processing, and so it's common on current chips.

But if that is the case, then vector processing goes back long before the STAR-100.

Intel's MMX split up a 64-bit word into multiple 32-bit or 16-bit integers.

With a 36-bit word, the Lincoln Laboratories TX-2 was doing the same thing, as was the AN/FSQ-31 and 32 with a 48-bit word. And those two were derived from IBM's SAGE system, which operated on vectors of two 16-bit numbers at once.

The kind of vector processing that a Cray-I did, on the other hand, isn't nearly as common; right now, the only current system of that general kind is the SX-ACE from NEC. — Preceding unsigned comment added by Quadibloc (talk • contribs) 22:33, 7 August 2016 (UTC)[reply]

Altivec is miscategorised. just because "VSX" has the word "Vector" in it does not make it Vector Processing: VSX and Altivec are pure fixed-length SIMD, with zero predication, and cause programmers to write the most horrendous general-purpose (variable length) assembler. Actual Vector Processing involves having either a VL (Vector Length) register or at the bare minimum some Vector Predicates which allows mask-out of element operations. NEON, MMX, SSE, Altivec, VSX, these are SIMD. AVX, AVX512, ARM SVE2, these are predicated SIMD. Cray, RISCV RVV, LibreSOC's SVP64, SX-ACE (which I had not heard of before, thank you for that one i will look it up), these are all Cray-style Variable Length. Lkcl (talk) 23:58, 5 June 2021 (UTC)[reply]

Why isn't the word SIMD in this article?

Aren't SIMD and vector processors largely synonymous? Isn't SIMD usually vector? Isn't vector processing usually SIMD? WorldQuestioneer (talk) 20:10, 13 July 2020 (UTC)[reply]

@WorldQuestioneer: There seems to be a somewhat arbitrary distinction between "traditional" vector machines and SIMD machines described in SIMD article:

Vector-processing architectures are now considered separate from SIMD computers, based on the fact that vector computers processed the vectors one word at a time through pipelined processors (though still based on a single instruction), whereas modern SIMD computers process all elements of the vector simultaneously. [some 1998 ref here]

I am no expert on this issue, but this seems... dumb. Like, there's only this much you can do with pipeline and looping on a single ALU, so modern stuff sold as "vector processors" like NEC SX-Aurora TSUBASA use a bunch of SIMD units too. Some phrasing need to be added to accept these sort of stuff. --Artoria 2e5 🌉 01:42, 15 July 2020 (UTC)[reply]

it seems dumb because it's plain wrong. Cray Vector Engines had so many registers and could do so many elements in parallel in a single clock cycle that they had to have external ultra-expensive multi-ported SRAM instead of internal register files. they got away with that because the speed of processing matched speed of memory at the time (a trick that won't work today). Modern Vector Processors actually have predicated SIMD ALU back-ends (called "Lanes"), you just don't get to use them directly because the ISA hides them from you. the Issue Phase is what chucks variable-length Element operations at the SIMD backends *on your behalf* so you as the programmer don't have to piss about with god-awful stripmining and teardown. Lkcl (talk) 00:07, 6 June 2021 (UTC)[reply]

SIMD and Cray-style Vector Processing, Vectors are so light-years ahead of SIMD in terms of efficiency and effectiveness it's not even funny. see other comments above. Lkcl (talk) 00:00, 6 June 2021 (UTC)[reply]

a more direct answer (now in the article) is that modern Vector Processors tend to use SIMD back-ends, fronted by a proper Vector ISA. you as the programmer absolutely do not need to know about this: you use the *Vector* ISA, not a SIMD ISA. those SIMD back-ends have built-in predication masks, which the micro-architecture can use to finish loops by going, "oh, huh, we only have 3 items left to do, and the SIMD units are 8 wide, um, let me do some math, here... that means i have to calculate a predicate mask of 0b00000111 and chuck it at the SIMD ALUs for you". if the Vector operation is itself predicated, that 0b00000111 is simply ANDed with the relevant mask bits. bottom line is that there is absolutely no excuse whatsoever for Intel, AMD and ARM, in the year 2021, 50+ years after Cray Vectors were invented, to be pedalling SIMD as if it was doing us a favour. Lkcl (talk) 03:44, 8 June 2021 (UTC)[reply]

Recommendation that importance be set "Top"

the page currently does not have importance set. i recommend it be changed to "top" after a review.

Cray-I supercomputer. says it all.
Vector processing as a concept saves so much power, so fewer instructions, so much less absolute hell for programmers it's not funny. LITERALLY an order of magnitude saving on program size.
Vector processing is the basis of every GPU on the planet. every smartphone, 99% of supercomputers, every Gaming Graphics Card.
there are *477* links to this page!

basically the page has near zero recognition of the strategic importance of how Vector processing has influenced our lives, in computing. this is actually a cause for some concern, from a sociological and historic perspective. however i am not comfortable setting it myself, would prefer a review. Lkcl (talk) 16:53, 10 June 2021 (UTC)[reply]

Discernable features

Lkcl has been quite insistent that vector processors are distinguishable from SIMD and offers a two-point test at the end of the lead for identifying a vector processor. I don't see support for this test in any cited sources. The citations provided are for WP:PRIMARY technical details of individual architectures which is a recipe for WP:SYNTHESIS. We need to reference these assertions to a WP:SECONDARY source like a textbook on processor architecture. ~Kvng (talk) 14:26, 13 June 2021 (UTC)[reply]

With Vector Processing having been completely forgotten about for nealy 50 years, with only extreme high-end secretive systems only actually properly implementing true Vector ISAs, and with even Sony, IBM, Intel *and ARM* completely misleading absolutely everyone including Academics about this, you're simply not going to find anything. You can clearly see the claim since 2003 by Sony / IBM when the Cell Processor first came out that VSX and Altivec are Vectors: Alti-VEC, and VSX VECTOR Scalar Extension. with such large companies making such false and misleading claims, and those false claims being propagated throughout literature for decades, it might seem difficult to say otherwise. However the fact remains that a basic first-principles analysis, as well as comprehensive detailed review of available ISAs, both SIMD (VSX NEON SSE), Predicated SIMD (AVX512, SVE2), and "True" Vector ISAs (Cray, SX-Aurora, RVV) clearly shows the difference, namely that they borrowed features from Vector ISAs. I have spent nearly a week going over this, creating examples in considerable detail, with citations in each example, showing in each example where the features exist. Thus, although there are no books which "state" this fact, it is a fact that may be logically deduced and concluded. Whether it's popular, whether the Marketing Departments of the billion-dollar who came up with the false and misleading statements which conflated Vectors with SIMD like that being pointed out, this remains to be seen. Lkcl (talk) 16:09, 13 June 2021 (UTC)[reply]

ok, so what is a way forward, here. surely there must be other wikipedia pages where the "Marketing" of large billion dollar companies has been ongoing for such a long time (2 decades in this case) that it's permeated pretty much all literature on the subject. if the correct logically-deducible facts are removed from this page it misleads readers and continues to allow billion dollar companies to propagate false marketing. there must be a process or wording by which this can be clearly communicated. what might that be? Lkcl (talk) 16:34, 13 June 2021 (UTC)[reply]

something which might help with the process of logical reasoning and factual deduction here: look up the definition of a Vector. actually, that's harder than it looks, but you get the idea https://en.wikipedia.org/wiki/Vector_(mathematics_and_physics). question: where in the definition of vector does it say the number of elements is a hard-fixed quantity? SIMD by definition makes the number of elements a hard inalienable unchangeable quantity. this is a fact. it is part of the definition of SIMD. by total contrast, Seymour Cray and other designers of Vector ISAs specifically designed Vector ISAs to be variable length. this is also a fact. therefore, it is blindingly obvious by definition - fact - that SIMD != Vector. is that clear enough and simple enough? if so, how is it best worded? Lkcl (talk) 17:01, 13 June 2021 (UTC)[reply]

ah - got another one for you. look closely at this article: https://community.arm.com/developer/ip-products/processors/b/processors-ip-blog/posts/coding-for-neon---part-2-dealing-with-leftovers - it starts with this: "In this post, we deal with an often encountered problem: input data that is not a multiple of the length of the vectors you want to process. You need to handle the leftover elements at the start or end of the array - what is the best way to do this on Neon?". note the wording. input data is not a multiple of the length of the vectors you want to process. it then uses explicit fixed-length NEON SIMD instructions vld1.8. further down, a statement Neon provides loads and stores that can operate on single elements in a vector. Using these, you can load a partial vector containing one element, then in the code fragment below talks about vld1.8 {d0}, [r0]! @ load eight elements from the array.

in other words they're conflating the data itself, which is arrays (Vectors) with the capability of the hardware (which is fixed-length SIMD) and thus giving the reader the completely false and misleading impression by implication and by accidental word-association that the hardware itself is "Vector-capable".

now, this makes the article itself no less clear: it's a brilliant well-written article that does its best in the face of the god-awful seductive horror that fixed-length non-predicated SIMD actually is: it describes very clearly and succinctly the work-around techniques called "Fixups" which have to be deployed when the data vector length does not match the fixed hardware length... all the while not mentioning at all that if NEON was *actually* a Vector Processor, none of those god-awful dog's dinner techniques would even be needed. ARM can't exactly go shooting its own contributors, can it?

fortunately for ARM (and thank god for the programming community), SVE2 fixed all of that by providing element-level predication as a fundamental part of the SVE2 ISA.

by complete contrast to NEON this makes ARM SVE2 capable of properly processing variable-length Vector data. and that's really what this is about. NEON != Vector. SVE2 ~= Vector. Cray == Vector. Lkcl (talk) 20:27, 13 June 2021 (UTC)[reply]

and another one. the article is about processors that were designed from the ground up to be "processors that handle large vectors". general purpose computers which had SIMD added as an afterthought do not qualify as vector processors. there is already a page on SIMD, and people interested on SIMD should go there and read about it. if SIMD === Vector Processors, then why on earth does this page exist at all? the answer is: because it's *about* Vector Processors, not about SIMD. that alone tells you that there's a definite difference. Lkcl (talk) 22:57, 13 June 2021 (UTC)[reply]

deeper problems with all associated articles

High Priority

https://en.wikipedia.org/wiki/Talk:SIMD#Page_quality_is_awful_(in_the_summary)

there are fundamental problems with the three pages, Vector Processing, SIMD, and SIMT. from the link above it can be seen that there is MASSIVE confusion even from academic coursework and academic literature on this topic.

it also does not help that neither Flynn nor Duncan taxonomy cover SIMT! Even i was not aware in 2004 when working for Aspex that it was a *SIMT* processor not a *SIMD* one because NVIDIA had not coined the phrase, only introducing it in what... 2012? 2016? sonething like that.

it also does not help that a pure SIMD only processor with zero scalar capability and no scalar registers is ANOTHER class of processor that at the hardware level is virtually indistinguishable from SIMT.

some diagrams are urgently needed here which illustrate these things properly.

given that SIMD is literally the top world hit on google search engines, this is a pretty damn high priority task. how can this be properly given attention and resources? Lkcl (talk) 14:21, 15 June 2021 (UTC)[reply]

Lkcl, when there is disagreement in sources about something like this, the approach we generally take is to report the different sides of the argument (with citations). Wikipedia is not the decider. ~Kvng (talk) 21:20, 15 June 2021 (UTC)[reply]

hiya Kvng, i don't have a problem with the citations being used, they are good. the problem is, they're not being read / understood properly because some of them are quite old (1977) i.e. use different terminology from modern computing. combine that with the complexity of the subject, combine it with the secrecy that NVIDIA, AMD, Intel and ATM engage in where people *cannot find out* what is inside, and combine it with the "circular citation problem" of wikipedia (a misreport gets cited in academia which then is published and is cited by wikipedia....) and we have the situation where several inter-related very important computing topics are badly misrepresenting the fundamentals of computing architecture that is the cornerstone of our modern way of life. now, i can point out the problem, from the expertise that i have, but i have a hell of a lot to get done. i am going to need help finding citations. i can hand-draw diagrams very quickly, but someone else with more time will need to do them in SVG. basically a conversation and collaboration is needed. Lkcl (talk) 01:41, 16 June 2021 (UTC)[reply]

guy harris kindly found a ref 1977 Flynn paper, Flynn followed up after his initial paper and sub-categorised SIMD. one of those is SIMT! Lkcl (talk)

Array processors needs to redirect to Flynn's taxonomy

arg after guy kindly found the 1972 flynn paper, strictly speaking the redirect Array Processors shoukd instead be to the (new) subsection in Flynn's taxonomy because Array Processor is a subclass of SIMD. whoops. Lkcl (talk) 05:43, 18 June 2021 (UTC)[reply]

Vector reduction example

In this example we start with an algorithm which involves reduction. Just as with the previous example, we first show it in scalar instructions, then SIMD, and finally Vector instructions. We start in c:

void (size_t n, int a, const int x[]) {
    int y = 0;
    for (size_t i = 0; i < n; i++)
        y += x[i];
    return y;
}

Here, an accumulator (y) is used to sum up all the values in the array, x.

Scalar Assembler

Our scalar version of this would load each of x, add it to y, and loop:

loop:
  set     y, 0     ; y initialised to zero
  load32  r1, x    ; load one 32bit data
  add32   y, y, r1 ; y := y + r1
  addl    x, x, $4 ; x := x + 4
  subl    n, n, $1 ; n := n - 1
  jgz     n, loop  ; loop back if n > 0
out:
  ret y            ; returns result, y

This is very straightforward. "y" starts at zero, 32 bit integers are loaded one at a time into r1, added to y, and the address of the array "x" moved on to the next element in the array.

SIMD reduction

This is where the problems start. SIMD by design is incapable of operating "inter-element". Element 0 of one SIMD register may be added to Element 0 of another register, but Element 0 may not be added to anything other than Element 0. This places some severe limitations on potential implementations. Let us assume for simplicity that n is exactly 8:

  addl      r3, x, $16 ; for 2nd 4 if x
  load32x4  v1, x      ; first 4 of x
  load32x4  v2, r3     ; 2nd 4 of x
  add32x4   v1, v2, v1 ; add 2 groups

At this point we have performed four adds: x[0]+x[4] x[1]+x[5] and so on, but from there it is downhill just as it is with the general case of using SIMD for general-purpose loops. To sum our four partial results, two-wide SIMD can be used, followed by a single Scalar add, to finally produce the answer, but, frequently, the data must be transferred out of dedicated SIMD registers before the last scalar computation can be performed.

Even with a general loop (n not fixed), the only way to use 4-wide SIMD is to assume four separate "streams", each offset by four elements. Finally, the four partial results have to be summed. Other techniques involve shuffle: examples online can be found for AVX-512 of how to do "Horizontal Sum"^[1]^[2]

Aside from the size of the program and the complexity, an additional potential problem arises if floating-point computation is involved: the fact that the values are not being summed in strict order (four partial results) could result in rounding errors.

Vector ISA reduction

If we may assume that n is less or equal to the Maximum Vector Length, only three instructions are required:

  setvl      t0, n  # VL=t0=min(MVL, n)
  vld32      v0, x  # load vector x
  vredadd32  y, v0  # reduce-add into y

The reduction when n is larger than the Maximum Vector Length is not that much more complex, and gives an algorithm very similar to our first example ("IAXPY").

vloop:
  set     y, 0
  setvl   t0, n      # VL=t0=min(MVL, n)
  vld32   v0, x      # load vector x
  vredadd32 y, y, v0 # add all x into y
  add     x, t0*4    # advance x by VL*4
  sub     n, t0      # n -= VL (t0)
  bnez    n, vloop   # repeat if n != 0

The simplicity of the algorithm is stark in comparison to SIMD. Again, just as with the IAXPY example, the algorithm is length-agnostic (even on Embedded implementations where Maximum Vector Length could be only one).

Implementations in hardware may, if they are certain that the right answer will be produced, perform the reduction in parallel. Some Vector ISAs offer a parallel reduction mode as an explicit option, for when the programmer knows that any rounding errors do not matter.^[3]

This example again highlights a key critical fundamental difference between "True" Vector Processors and those SIMD processors, including most commercial GPUs, which are "inspired" by features of Vector Processors.

[1] [1]

[2] ttps://stackoverflow.com/questions/6996764/fastest-way-to-do-horizontal-sse-vector-sum-or-other-reduction/35270026#35270026

[3] [2]

[1]

[2]

[3]