Talk:Vectorization (parallel computing)

Pipeline, run-time and other issues

I gave this article a good refactoring and there are some edges yet.

old vector machines are still in use and some simpler (embedded) processors are not as powerful as cutting-edge versions. This is why I said "some times". Feel free to re-phrase that, of course.
some processors can detect repetition in the load/store pattern and automatically change from normal instructions to vector instructions and the dynamic vectorization in the paper you mention is really interesting, but I was trying to cover "software" vectorizations. In the text I explain that it's not "vectorizing at run-time" but actually "choosing the vectorized version at run-time". Maybe a better title for the topic is needed, but I can't think of one that is as "catchy" as that... ;)
I'll add memory reference optimizations (induction, reduction, etc) too, but I've been busy with loop dependence analysis and normalized loop.

rengolin (talk) 21:55, 4 December 2009 (UTC)[reply]

Some issues on the descriptions added on Dec. 1. 2009

"pipeline synchronization" in the last line of the Background section: Pipeline synchronization is not a factor for slow vector code for modern SIMD architectures. It used to be the case for old vector machines but not anymore.
The description of runtime vectorization in the Run-time vs. Compile-time vectorization section: The description of runtime vectorization is misleading. The current description is about compile-time check-code generation for runtime values, which is part of 'code versioning'. I've seen some people are using the word this way but to me, runtime vectorization refers to the techniques that converts scalar instructions to vector counterparts during runtime such as a hardware technique described in the following paper.

http://pages.cs.wisc.edu/~sriram/vajapeyam99dynamic.pdf

So, the whole issue seems to stem from the fact that the two different SIMD architectures (one multimedia extension type and the other old pipelined vector machines) are described in one place together. As another example that comes from this is that you say 'a[i] = a[i+1];' cannot be vectorized in 'Building the dependency graph' section but below in 'Detecting idioms' section, you say 'a[i] = a[i] + a[i+1];' can be vectorized. As you know, 'a[i] = a[i] + a[i+1];' cannot be vectorized in modern SIMD architectures in a normal sense. So, my hope is that someone could make this distinction when referring to the features that are supported in only one of the two types of SIMD architectures. —Preceding unsigned comment added by Vermin8tr (talk • contribs) 17:07, 6 December 2009 (UTC)[reply]

Needs an example

scalar program: for (i=0;i<10;i++) c[i]=a[i]+b[i]

and the equivalent vector version and how the vector version would run faster.

x86's MMX instruction set provides facilities for adding multiple numbers together. MMX registers are 64 bits wide and can operate on 1 quadword, 2 doublewords, 4 words or 8 bytes. In your example, if we are allowed to assume c/a/b are all bytes (8 bits), then we can use the PADDB (packed add bytes) instruction to add 8 numbers at the same time. For simplicity, I'll change your example to loop to i<8 instead. So the equivalent code, using ASM syntax:

MOVQ mm0, [a]          ; Load a[0] to a[7] into the mm0 register
MOVQ mm1, [b]          ; Load b[0] to b[7] into mm1
PADDB mm0, mm1         ; Add the bytes of mm0 and mm1 together, storing result in mm0
MOVQ [c], mm0          ; Store the result into c[0] to c[7]

Each of those instructions takes roughly one cycle. There are complications involving memory/cache read/writes but we'll ignore them for now.

Therefore, the scalar version requires:

1 initialisation (i = 0)
8 checks (i < 8)
8 increments (i++)
16 loads (get a[i] and b[i])
8 adds (a[i] + b[i])
8 stores (c[i] = result)

= 49 cycles*

The MMX version requires:

2 loads (MOVQ mm0/mm1 [a]/[b]
1 add (PADDB)
1 store (MOVQ [c], mm0)

= 4 cycles

The number of cycles is reduced to 4/9=8%. The code is 12.25 times faster.

note that you can save 17 (~35%) instructions by flattening your loop and copy/pasting your code 8 times, which is what compilers might do, but it's still much slower than the MMX version. C xong (talk) 23:54, 16 March 2010 (UTC)[reply]

deleted to include the features of modern SIMD architectures

I deleted the following sentence, because it describes the feature of the conventional vector machines. Such feature is better described in vector processor. Modern SIMD instructions, for example those of AltiVec and SSE, are not much different from the scalar instructions in terms of instruction latency. Similar to scalar instructions, ILP can be exploited for SIMD instructions as well.

 since while there may be some overhead to starting up a vector operation, once it starts each individual  
 operation is faster (in part because it avoids the need for instruction decoding).

deleted unrelated part

I deleted the following paragraph, because it is unrelated to the theme. Maybe it should fit a "vectorization (computer graphics)" or something :

Raster-to-vector conversion is an important operation for computer graphics software. Raster images are the natural output of scanners and digital photography, they are the mainstay of TV and digital printing, and they can be created (with varying levels of skill) by users of bitmap-handling software such as Adobe PhotoShop or Corel PhotoPaint. However, raster images do not scale well, and for many purposes vector graphics are to be preferred. Raster-to-vector conversion software needs to: decode raster file formats, detect colour boundaries in images, simplify boundaries into smaller numbers of vectors (typically lines, arcs and Bezier curves), and write out vector files in suitable formats. Well-known raster-to-vector converters include Corel Trace, Adobe Streamline, and most sign-making programs.

Disambiguation

I just created a page Vectorization (mathematics) which discusses a different use of the term. I put some disambiguation at the beginning of this article for now. Would it be OK with those who watch this article if it were moved to something like Vectorization (computer science) so that the unmodified Vectorization could be used as a disambiguation page? I don't know how this will affect the proposed merge. Michael Kinyon 17:10, 24 September 2006 (UTC)[reply]

My thoughts exactly. I do not think either use is significantly more prominent than the other. --CyHawk (talk) 00:02, 6 March 2008 (UTC)[reply]

Done Oicumayberight (talk) 03:30, 5 October 2008 (UTC)[reply]

Unconventional old fashion computers?

This sentence is a mess:

Vector processing is a major feature of both conventional and modern supercomputers.

Can't modern computers be conventional? What exactly is a conventional computer? —Preceding unsigned comment added by 62.73.248.37 (talk) 18:40, 15 April 2009 (UTC)[reply]

Good point. Be bold and feel free to rephrase it. —Ben FrantzDale (talk) 00:55, 16 April 2009 (UTC)[reply]

Changed to

Vector processing is a major feature of both conventional computers and modern supercomputers.

(computers these days have SIMD through the MMX/SSE extensions) C xong (talk) 00:00, 17 March 2010 (UTC)[reply]