Parallel Thread Execution
Parallel Thread Execution is a pseudo-assembly language used in nVidia's CUDA programming environment. The 'nvcc' compiler translates code written in CUDA, a C-like language, into PTX, and the graphics driver contains a compiler which translates the PTX into something which can be run on the processing cores.
PTX uses an arbitrarily large register set; the output from the compiler is almost pure single-assignment form, with consecutive lines generally refer to consecutive registers. Programs start with declarations of the form .reg .u32 %r<335>; which indicate that up to 335 registers of type unsigned 32-bit integer are used.
It is a three-argument assembly language, and almost all instructions list explicitly the data type (in terms of sign and width) on which they operate. Register names are preceded with a % character and constants are literal: eg shr.u64 %rd14, %rd12, 32; or cvt.u64.u32 %rd142, %r112;
There are predicate registers, but compiled code in shader model 1.0 uses these only in conjunction with branch commands; the conditional branch is '@%p14 bra $label'. The 'setp.cc.type' instruction sets a predicate register to the result of comparing two registers of appropriate type, there is also a 'set' instruction, where "set.le.u32.u64 %r101, %rd12, %rd28 " sets the 32-bit register r101 to 0xffffffff if the 64-bit register rd12 is less than or equal to the 64-bit register rd28, or to 0x00000000 otherwise.
Pseudoregisters %tid, %ntid and %ctaid contain thread indices.
Load and store commands refer to several distinct memory banks: ld.param, ld.global and ld.shared. The .param memory contains the parameters passed to a kernel, the .global is the main memory on the card, and .shared is memory shared between the threads in a 'thread block'; it is declared in the PTX file via lines at the start of the form
.shared .align 8 .b8 pbatch_cache[15744];
(which defines 15744 bytes, aligned to an 8-byte boundary) mov.u64 %rd9, pbatch_cache;
and generally addressed via a kernel-global piointed set up at the start of the kernel by