Zum Inhalt springen

Parallel Thread Execution

aus Wikipedia, der freien Enzyklopädie
Dies ist eine alte Version dieser Seite, zuletzt bearbeitet am 14. November 2009 um 01:29 Uhr durch en>Fivemack (Wrote page). Sie kann sich erheblich von der aktuellen Version unterscheiden.
(Unterschied) ← Nächstältere Version | Aktuelle Version (Unterschied) | Nächstjüngere Version → (Unterschied)

Parallel Thread Execution is a pseudo-assembly language used in nVidia's CUDA programming environment. The 'nvcc' compiler translates code written in CUDA, a C-like language, into PTX, and the graphics driver contains a compiler which translates the PTX into something which can be run on the processing cores.

PTX uses an arbitrarily large register set; the output from the compiler is almost pure single-assignment form, with consecutive lines generally refer to consecutive registers. Programs start with declarations of the form .reg .u32 %r<335>; which indicate that up to 335 registers of type unsigned 32-bit integer are used.

It is a three-argument assembly language, and almost all instructions list explicitly the data type (in terms of sign and width) on which they operate. Register names are preceded with a % character and constants are literal: eg shr.u64 %rd14, %rd12, 32; or cvt.u64.u32 %rd142, %r112;

There are predicate registers, but compiled code in shader model 1.0 uses these only in conjunction with branch commands; the conditional branch is '@%p14 bra $label'. The 'setp.cc.type' instruction sets a predicate register to the result of comparing two registers of appropriate type, there is also a 'set' instruction, where "set.le.u32.u64 %r101, %rd12, %rd28 " sets the 32-bit register r101 to 0xffffffff if the 64-bit register rd12 is less than or equal to the 64-bit register rd28, or to 0x00000000 otherwise.

Pseudoregisters %tid, %ntid and %ctaid contain thread indices.

Load and store commands refer to several distinct memory banks: ld.param, ld.global and ld.shared. The .param memory contains the parameters passed to a kernel, the .global is the main memory on the card, and .shared is memory shared between the threads in a 'thread block'; it is declared in the PTX file via lines at the start of the form

.shared .align 8 .b8 pbatch_cache[15744];

(which defines 15744 bytes, aligned to an 8-byte boundary) mov.u64 %rd9, pbatch_cache;


and generally addressed via a kernel-global piointed set up at the start of the kernel by