Thread block (CUDA programming)

CUDA is a parallel computing platform and programming model that higher level languages can use to exploit parallelism. In CUDA, the kernel is executed with the aid of threads. The thread is an abstract entity that represents the execution of the kernel. A kernel is a small program or a function. Multi threaded applications use many such threads that are running at the same time, to organize parallel computation. Every thread has an index, which is used for calculating memory address locations and also for taking control decisions.

For better process and data mapping, threads are grouped into thread blocks. A thread block is a programming abstraction that represents a group of threads that can be executing serially or in parallel. The number of threads varies with available shared memory. 'The number of threads in a thread block is also limited by the architecture to a total of 512 threads per block.^[1]' The threads in the same thread block run on the same stream processor. Threads in the same block can communicate with each other via shared memory , barrier synchronization or other synchronization primitives such as atomic operations.

Multiple blocks are combined to form a grid. All the blocks in the same grid contain the same number of threads. Since the number of threads in a block is limited to 512, grids can be used for computations that require a large number of thread blocks to operate in parallel.

Thread block dimensions

CUDA operates on a heterogeneous programming model which is used to run host device application programs. It has an execution model that is similar to OPenCL. In this model, we start executing an application on the host device which is usually a CPU core. The device is a throughput oriented device, i.e., a GPU core which performs parallel computations. Kernel functions are used to do these parallel executions. Once these kernel functions are executed the control is passed back to the host device that resumes serial execution.

As many parallel applications involve multidimensional data, it is convenient to organize thread blocks into 1D,2D or 3D arrays of threads. The blocks in a grid must be able to be executed independently, as communication or cooperation between blocks in a grid is not possible. 'When a kernel is launched the number of threads per thread block, and the number of thread blocks is specified, this, in turn, defines the total number of CUDA threads launched^[2].' The maximum x, y and z dimensions of a block are 512, 512 and 64, and it should be allocated such that x × y × z ≤ 512, which is the maximum number of threads per block. ^[2]Blocks can be organized into one- or two-dimensional grids of up to 65,535 blocks in each dimension.^[2] The limitation on the number of blocks in a thread is actually imposed because the number of registers that can be allocated across all threads is limited.^[2]

Thread block indexing

Every thread in CUDA is associated with a particular index so that in can calculate and access memory locations in an array.

Consider an example in which there is an array of 512 elements. One of the organization structure of the is taking a grid with a single block that has a 512 threads. Consider that there is an array C of 512 elements that is made of element wise multiplication of two arrays A and B which are both 512 elements each. Every thread has an index i and it performs the performs the multiplication of ith element of A and B and then store the result in the i^th element of C. i is calculated by using blockId (which is 0 in this case as there is only one block) , blockDim (512 in this case as the block has 512 elements) and threadId that varies from 0 to 511 for each block.

Thread Hierarchy in CUDA Programming^[3]

The index I is calculated by the following formula :

$i=blockId.x*blockDim.x+threadId.x$

blockId is the x dimension block identifier

blockDim is the x dimension of the block dimension

threadId is the x dimension of the thread identifier

Thus ‘i’ will have have values ranging from 0 to 511 that covers the entire array.

If we want to consider computations for an array that is larger than 512 we can have multiple blocks with 512 threads each. Let us consider an example with 1024 array elements. In this case we have 2 thread blocks with 512 threads each. Thus the thread identifier’s values will vary from 0 to 511, the blockId will vary from 0 to 1 and the blockDim will be 512.Thus the first block will get index values from 0 to 511 and the last one will have index values from 512 to 1023.

In the same way in particularly complex grids, the blockId as well as the threadId need to be calculated by each thread depending on geometry of the grid.Consider, a 2-dimensional Grid with 2-dimensional blocks. The threadId and the blockId will be calculated by the following formulae :

$blockId=blockId.x+blockId.y*gridDim.x;$ $threadId=blockId*(blockDim.x*blockDim.y)+(threadId.y*blockDim.x)+threadId.x;$ ^[4]

See-also

References

^ "CUDA Overview". cuda.ce.rit.edu. Retrieved 2016-09-21.
^ ^a ^b ^c ^d "CUDA Thread Model". www.olcf.ornl.gov. Retrieved 2016-09-21.
^ "Thread Hierarchy in CUDA Programming". Retrieved 2016-09-21.
^ "Thread Indexing Cheatsheet" (PDF).

This is a user sandbox of Thread block (CUDA programming). You can use it for testing or practicing edits.
This is not the place where you work on your assigned article for a dashboard.wikiedu.org course.
Visit your Dashboard course page and follow the links for your assigned article in the My Articles section.

Get Help

This template should only be used in the user namespace.This template should only be used in the user namespace.

[1] "CUDA Overview". cuda.ce.rit.edu. Retrieved 2016-09-21.

[:0-2] "CUDA Thread Model". www.olcf.ornl.gov. Retrieved 2016-09-21.

[3] "Thread Hierarchy in CUDA Programming". Retrieved 2016-09-21.

[4] "Thread Indexing Cheatsheet" (PDF).

[1]

[2]

[3]

[4]