OpenCL

OpenCL
	File:OpenCL Logo.png
Paradigm	Imperative (procedural), structured
Designed by	Apple Inc.
Developer	Khronos Group
First appeared	August 28, 2009; 15 years ago
Stable release	2.0 / November 18, 2013; 11 years ago
Typing discipline	Static, weak, manifest, nominal
OS	Cross-platform (multi-platform)
Filename extensions	.cl
Website	www.khronos.org/opencl; www.khronos.org/webcl
Major implementations
	AMD, Apple, freeocl, Gallium Compute, IBM, Intel Beignet, Intel SDK, Nvidia
Influenced by
	C99, CUDA

Open Computing Language (OpenCL) is a framework for writing programs that execute across heterogeneous platforms consisting of central processing units (CPUs), graphics processing units (GPUs), digital signal processors (DSPs), field-programmable gate arrays (FPGAs) and other processors. OpenCL includes a language (based on C99) for programming these devices, and application programming interfaces (APIs) to control the platform and execute programs on the compute devices. OpenCL provides parallel computing using task-based and data-based parallelism. OpenCL is an open standard maintained by the non-profit technology consortium Khronos Group. It has been adopted by Apple, Intel, Qualcomm, Advanced Micro Devices (AMD), Nvidia, Altera, Samsung, Vivante, Imagination Technologies and ARM Holdings.

For example, OpenCL can be used to give an application access to a graphics processing unit for non-graphical computing (see general-purpose computing on graphics processing units). Academic researchers have investigated automatically compiling OpenCL programs into application-specific processors running on FPGAs,^[1] and commercial FPGA vendors are developing tools to translate OpenCL to run on their FPGA devices.^[2]^[3] OpenCL can also be used as an intermediate language for directives-based programming such as OpenACC.^[4]^[5]^[6]

Overview

OpenCL views a computing system as consisting of a number of compute devices, which might be central processing units (CPUs) or "accelerators" such as graphics processing units (GPUs), attached to a host processor (a CPU). It defines a C-like language for writing programs, called kernels, that execute on the compute devices. A single compute device typically consists of many individual processing elements (PEs) and a single kernel execution can run on all or many of the PEs in parallel.

In addition, OpenCL defines an application programming interface (API) that allows programs running on the host to launch kernels on the compute devices and manage device memory, which is (at least conceptually) separate from host memory. Programs in the OpenCL language are intended to be compiled at run-time, so that OpenCL-using applications are portable between implementations for various host devices.^[7] The OpenCL standard defines host APIs for C and C++; third-party APIs exist for other programming languages such as Python,^[8] Julia,^[9] or Java. An implementation of the OpenCL standard consists of a library that implements the API for C and C++, and an OpenCL C compiler for the compute device(s) targeted.

Memory hierarchy

OpenCL defines a four-level memory hierarchy for the compute device:^[7]

global memory: shared by all compute devices, but has high access latency;
read-only memory: smaller, low latency, writable by the host CPU but not the compute devices;
local memory: shared by multiple processing elements within one device;
per-element private memory, (registers).

Not every device needs to implement each level of this hierarchy in hardware. Consistency between the various levels in the hierarchy is relaxed, and only enforced by explicit synchronization constructs, notably barriers.

Devices may or may not share memory with the host CPU.^[7] The host API provides handles on device memory buffers and functions to transfer data back and forth between host and devices.

History

OpenCL was initially developed by Apple Inc., which holds trademark rights, and refined into an initial proposal in collaboration with technical teams at AMD, IBM, Qualcomm, Intel, and Nvidia. Apple submitted this initial proposal to the Khronos Group. On June 16, 2008, the Khronos Compute Working Group was formed^[10] with representatives from CPU, GPU, embedded-processor, and software companies. This group worked for five months to finish the technical details of the specification for OpenCL 1.0 by November 18, 2008.^[11] This technical specification was reviewed by the Khronos members and approved for public release on December 8, 2008.^[12]

OpenCL 1.0

OpenCL 1.0 released with Mac OS X Snow Leopard on August 28, 2009. According to an Apple press release:^[13]

Snow Leopard further extends support for modern hardware with Open Computing Language (OpenCL), which lets any application tap into the vast gigaflops of GPU computing power previously available only to graphics applications. OpenCL is based on the C programming language and has been proposed as an open standard.

AMD decided to support OpenCL instead of the now deprecated Close to Metal in its Stream framework.^[14]^[15] RapidMind announced their adoption of OpenCL underneath their development platform to support GPUs from multiple vendors with one interface.^[16] On December 9, 2008, Nvidia announced its intention to add full support for the OpenCL 1.0 specification to its GPU Computing Toolkit.^[17] On October 30, 2009, IBM released its first OpenCL implementation as a part of the XL compilers.^[18]

OpenCL 1.1

OpenCL 1.1 was ratified by the Khronos Group on June 14, 2010^[19] and adds significant functionality for enhanced parallel programming flexibility, functionality, and performance including:

New data types including 3-component vectors and additional image formats;
Handling commands from multiple host threads and processing buffers across multiple devices;
Operations on regions of a buffer including read, write and copy of 1D, 2D, or 3D rectangular regions;
Enhanced use of events to drive and control command execution;
Additional OpenCL built-in C functions such as integer clamp, shuffle, and asynchronous strided copies;
Improved OpenGL interoperability through efficient sharing of images and buffers by linking OpenCL and OpenGL events.

OpenCL 1.2

On November 15, 2011, the Khronos Group announced the OpenCL 1.2 specification,^[20] which added significant functionality over the previous versions in terms of performance and features for parallel programming. Most notable features include:

Device partitioning: the ability to partition a device into sub-devices so that work assignments can be allocated to individual compute units. This is useful for reserving areas of the device to reduce latency for time-critical tasks.
Separate compilation and linking of objects: the functionality to compile OpenCL into external libraries for inclusion into other programs.
Enhanced image support: 1.2 adds support for 1D images and 1D/2D image arrays. Furthermore, the OpenGL sharing extensions now allow for OpenGL 1D textures and 1D/2D texture arrays to be used to create OpenCL images.
Built-in kernels: custom devices that contain specific unique functionality are now integrated more closely into the OpenCL framework. Kernels can be called to use specialised or non-programmable aspects of underlying hardware. Examples include video encoding/decoding and digital signal processors.
DirectX functionality: DX9 media surface sharing allows for efficient sharing between OpenCL and DX9 or DXVA media surfaces. Equally, for DX11, seamless sharing between OpenCL and DX11 surfaces is enabled.

OpenCL 2.0

On November 18, 2013, the Khronos Group announced the ratification and public release of the finalized OpenCL 2.0 specification.^[21] Updates and additions to OpenCL 2.0 include:

Shared virtual memory
Nested parallelism
Generic address space
Images
C11 atomics
Pipes
Android installable client driver extension

Implementation

OpenCL consists of a set of headers and a shared object that is loaded at runtime. An installable client driver loader (ICD loader) must be installed on the platform for every class of vendor for which the runtime would need to support. That is, for example, in order to support Nvidia devices on a Linux platform, the Nvidia ICD would need to be installed such that the OpenCL runtime would be able to locate the ICD for the vendor and redirect the calls appropriately. The standard OpenCL header is used by the consumer application; calls to each function are then proxied by the OpenCL runtime to the appropriated driver using the ICD. Each vendor must implement each OpenCL call in their driver.^[22]

A number of open source implementations of the OpenCL ICD exist, including freeocl^[23] and ocl-icd.^[24] An implementation of OpenCL for a number of platforms is maintained as part of the Gallium Compute Project,^[25] which builds on the work of the Mesa project to support multiple platforms. An implementation by Intel for its Ivy Bridge hardware was released in 2013.^[26] This software, called "Beignet", is not based on Mesa/Gallium, which has attracted criticism from developers at AMD and Red Hat,^[27] as well as Michael Larabel of Phoronix.^[28]

Timeline of vendor implementations

December 10, 2008: AMD and Nvidia held the first public OpenCL demonstration, a 75-minute presentation at Siggraph Asia 2008. AMD showed a CPU-accelerated OpenCL demo explaining the scalability of OpenCL on one or more cores while Nvidia showed a GPU-accelerated demo.^[29]^[30]
March 16, 2009: at the 4th Multicore Expo, Imagination Technologies announced the PowerVR SGX543MP, the first GPU of this company to feature OpenCL support.^[31]
March 26, 2009: at GDC 2009, AMD and Havok demonstrated the first working implementation for OpenCL accelerating Havok Cloth on AMD Radeon HD 4000 series GPU.^[32]
April 20, 2009: Nvidia announced the release of its OpenCL driver and SDK to developers participating in its OpenCL Early Access Program.^[33]
August 5, 2009: AMD unveiled the first development tools for its OpenCL platform as part of its ATI Stream SDK v2.0 Beta Program.^[34]
August 28, 2009: Apple released Mac OS X Snow Leopard, which contains a full implementation of OpenCL.^[35]

OpenCL in Snow Leopard is supported on the Nvidia GeForce 320M, GeForce GT 330M, GeForce 9400M, GeForce 9600M GT, GeForce 8600M GT, GeForce GT 120, GeForce GT 130, GeForce GTX 285, GeForce 8800 GT, GeForce 8800 GS, Quadro FX 4800, Quadro FX5600, ATI Radeon HD 4670, ATI Radeon HD 4850, Radeon HD 4870, ATI Radeon HD 5670, ATI Radeon HD 5750, ATI Radeon HD 5770 and ATI Radeon HD 5870.^[36]

September 28, 2009: Nvidia released its own OpenCL drivers and SDK implementation.
October 13, 2009: AMD released the fourth beta of the ATI Stream SDK 2.0, which provides a complete OpenCL implementation on both R700/R800 GPUs and SSE3 capable CPUs. The SDK is available for both Linux and Windows.^[37]
November 26, 2009: Nvidia released drivers for OpenCL 1.0 (rev 48).

The Apple,^[38] Nvidia,^[39] RapidMind^[40] and Gallium3D^[41] implementations of OpenCL are all based on the LLVM Compiler technology and use the Clang Compiler as its frontend.

October 27, 2009: S3 released their first product supporting native OpenCL 1.0 - the Chrome 5400E embedded graphics processor.^[42]
December 10, 2009: VIA released their first product supporting OpenCL 1.0 - ChromotionHD 2.0 video processor included in VN1000 chipset.^[43]
December 21, 2009: AMD released the production version of the ATI Stream SDK 2.0,^[44] which provides OpenCL 1.0 support for R800 GPUs and beta support for R700 GPUs.
June 1, 2010: ZiiLABS released details of their first OpenCL implementation for the ZMS processor for handheld, embedded and digital home products.^[45]
June 30, 2010: IBM released a fully conformant version of OpenCL 1.0.^[46]
September 13, 2010: Intel released details of their first OpenCL implementation for the Sandy Bridge chip architecture. Sandy Bridge will integrate Intel's newest graphics chip technology directly onto the central processing unit.^[47]
November 15, 2010: Wolfram Research released Mathematica 8 with OpenCLLink package.
March 3, 2011: Khronos Group announces the formation of the WebCL working group to explore defining a JavaScript binding to OpenCL. This creates the potential to harness GPU and multi-core CPU parallel processing from a Web browser.^[48]^[49]
March 31, 2011: IBM released a fully conformant version of OpenCL 1.1.^[46]^[50]
April 25, 2011: IBM released OpenCL Common Runtime v0.1 for Linux on x86 Architecture.^[51]
May 4, 2011: Nokia Research releases an open source WebCL extension for the Firefox web browser, providing a JavaScript binding to OpenCL.^[52]
July 1, 2011: Samsung Electronics releases an open source prototype implementation of WebCL for WebKit, providing a JavaScript binding to OpenCL.^[53]
August 8, 2011: AMD released the OpenCL-driven AMD Accelerated Parallel Processing (APP) Software Development Kit (SDK) v2.5, replacing the ATI Stream SDK as technology and concept.^[54]
December 12, 2011, AMD released AMD APP SDK v2.6^[55] which contains a preview of OpenCL 1.2.
February 27, 2012: The Portland Group released the PGI OpenCL compiler for multi-core ARM CPUs.^[56]
April 17, 2012: Khronos released a WebCL working draft.^[57]
May 6, 2013: Altera released the Altera SDK for OpenCL, version 13.0.^[58] It is conformant to OpenCL 1.0.^[59]
November 18, 2013: Khronos announced that the specification for OpenCL 2.0 had been finalised.^[60]
March 19, 2014: Khronos releases the WebCL 1.0 specification^[61]^[62]
August 29, 2014: Intel releases HD Graphics 5300 driver that supports OpenCL 2.0.^[63]
September 25, 2014: AMD releases Catalyst 14.41 RC1, which includes an OpenCL 2.0 driver.^[64]

OpenCL C language

The programming language used to write computation kernels is called OpenCL C and is based on C99,^[65] but adapted to fit the device model in OpenCL: the memory region qualifiers __global, __local, __constant, and __private on pointers determine where in the memory hierarchy the buffers that they point to reside, and functions can be marked __kernel to signal that they are entry points into the program to be called from the host program. OpenCL C omits function pointers, bit fields and variable-length arrays, and forbids recursion.^[66] The C standard library is replaced by a custom set of standard functions, geared toward math programming.

OpenCL C is extended to facilitate use of parallelism with vector types and operations, synchronization, and functions to work with work-items and work-groups.^[67] In particular, besides scalar types such as float and double, which behave similarly to the corresponding types in C, OpenCL provides fixed-length vector types such as float4 (4-vector of single-precision floats); such vector types are available in lengths two, three, four, eight and sixteen for various base types.^[65]^: §6.1.2 Vectorized operations on these types are intended to map onto SIMD instructions sets, e.g. SSE or VMX when running OpenCL programs on CPUs.^[7]

Example: computing the FFT

This example will load a fast Fourier transform (FFT) implementation and execute it. The implementation is shown below.^[68]

  // create a compute context with GPU device
  context = clCreateContextFromType(NULL, CL_DEVICE_TYPE_GPU, NULL, NULL, NULL);

  // create a command queue
  clGetDeviceIDs( NULL, CL_DEVICE_TYPE_DEFAULT, 1, &device_id, NULL );
  queue = clCreateCommandQueue(context, device_id, 0, NULL);

  // allocate the buffer memory objects
  memobjs[0] = clCreateBuffer(context, CL_MEM_READ_ONLY | CL_MEM_COPY_HOST_PTR, sizeof(float)*2*num_entries, srcA, NULL);
  memobjs[1] = clCreateBuffer(context, CL_MEM_READ_WRITE, sizeof(float)*2*num_entries, NULL, NULL);

  // create the compute program
  program = clCreateProgramWithSource(context, 1, &fft1D_1024_kernel_src, NULL, NULL);

  // build the compute program executable
  clBuildProgram(program, 0, NULL, NULL, NULL, NULL);

  // create the compute kernel
  kernel = clCreateKernel(program, "fft1D_1024", NULL);

  // set the args values
  clSetKernelArg(kernel, 0, sizeof(cl_mem), (void *)&memobjs[0]);
  clSetKernelArg(kernel, 1, sizeof(cl_mem), (void *)&memobjs[1]);
  clSetKernelArg(kernel, 2, sizeof(float)*(local_work_size[0]+1)*16, NULL);
  clSetKernelArg(kernel, 3, sizeof(float)*(local_work_size[0]+1)*16, NULL);

  // create N-D range object with work-item dimensions and execute kernel
  global_work_size[0] = num_entries;
  local_work_size[0] = 64; //Nvidia: 192 or 256
  clEnqueueNDRangeKernel(queue, kernel, 1, NULL, global_work_size, local_work_size, 0, NULL, NULL);

The actual calculation (based on Fitting FFT onto the G80 Architecture):^[69]

  // This kernel computes FFT of length 1024. The 1024 length FFT is decomposed into
  // calls to a radix 16 function, another radix 16 function and then a radix 4 function

  __kernel void fft1D_1024 (__global float2 *in, __global float2 *out,
                          __local float *sMemx, __local float *sMemy) {
    int tid = get_local_id(0);
    int blockIdx = get_group_id(0) * 1024 + tid;
    float2 data[16];

    // starting index of data to/from global memory
    in = in + blockIdx;  out = out + blockIdx;

    globalLoads(data, in, 64); // coalesced global reads
    fftRadix16Pass(data);      // in-place radix-16 pass
    twiddleFactorMul(data, tid, 1024, 0);

    // local shuffle using local memory
    localShuffle(data, sMemx, sMemy, tid, (((tid & 15) * 65) + (tid >> 4)));
    fftRadix16Pass(data);               // in-place radix-16 pass
    twiddleFactorMul(data, tid, 64, 4); // twiddle factor multiplication

    localShuffle(data, sMemx, sMemy, tid, (((tid >> 4) * 64) + (tid & 15)));

    // four radix-4 function calls
    fftRadix4Pass(data);      // radix-4 function number 1
    fftRadix4Pass(data + 4);  // radix-4 function number 2
    fftRadix4Pass(data + 8);  // radix-4 function number 3
    fftRadix4Pass(data + 12); // radix-4 function number 4

    // coalesced global writes
    globalStores(data, out, 64);
  }

A full, open source implementation of an OpenCL FFT can be found on Apple's website.^[70]

OpenCL-conformant products

The Khronos Group maintains an extended list of OpenCL-conformant products.^[71]

Synopsis of OpenCL conformant products^[72]
AMD APP SDK (supports OpenCL CPU and accelerated processing unit Devices)	X86 + SSE2 (or higher) compatible CPUs 64-bit & 32-bit;^[73] Linux 2.6 PC, Windows Vista/7 PC	AMD Fusion E-350, E-240, C-50, C-30 with HD 6310/HD 6250	AMD Radeon/Mobility HD 6800, HD 5x00 series GPU, iGPU HD 6310/HD 6250	ATI FirePro Vx800 series GPU
Intel SDK for OpenCL Applications 2013^[74] (supports Intel Core processors and Intel HD Graphics 4000/2500)	Intel CPUs with SSE 4.1, SSE 4.2 or AVX support.^[75]^[76] Microsoft Windows, Linux	Intel Core i7, i5, i3; 2nd Generation Intel Core i7/5/3, 3rd Generation Intel Core Processors with Intel HD Graphics 4000/2500	Intel Core 2 Solo, Duo Quad, Extreme	Intel Xeon 7x00,5x00,3x00 (Core based)
IBM Servers with OpenCL Development Kit for Linux on Power running on Power VSX^[77]^[78]	IBM Power 755 (PERCS), 750	IBM BladeCenter PS70x Express	IBM BladeCenter JS2x, JS43	IBM BladeCenter QS22
IBM OpenCL Common Runtime (OCR) ^[79]	X86 + SSE2 (or higher) compatible CPUs 64-bit & 32-bit;^[80] Linux 2.6 PC	AMD Fusion, Nvidia Ion and Intel Core i7, i5, i3; 2nd Generation Intel Core i7/5/3	AMD Radeon, Nvidia GeForce and Intel Core 2 Solo, Duo, Quad, Extreme	ATI FirePro, Nvidia Quadro and Intel Xeon 7x00,5x00,3x00 (Core based)
Nvidia OpenCL Driver and Tools^[81]	Nvidia Tesla C/D/S	Nvidia GeForce GTS/GT/GTX	Nvidia Ion	Nvidia Quadro FX/NVX/Plex

Extensions

Some vendors provide extended functionality over the standard OpenCL specification via the means of extensions. These are still specified by Khronos but provided by vendors within their SDKs. They often contain features that are to be implemented in the future - for example device fission functionality was originally an extension but is now provided as part of the 1.2 specification.

Extensions provided in the 1.2 specification include:

Writing to 3D image memory objects
Half-precision floating-point format
Sharing memory objects with OpenGL
Creating event objects from GL sync objects
Sharing memory objects with Direct3D 10
DX9 media Surface Sharing
Sharing Memory Objects with Direct3D 11

Device fission

Device fission - introduced fully into the OpenCL standard with version 1.2 - allows individual command queues to be used for specific areas of a device. For example, within the Intel SDK, a command queue can be created that maps directly to an individual core. AMD also provides functionality for device fission, also originally as an extension. Device fission can be used where the availability of compute is required reliably, such as in a latency sensitive environment. Fission effectively reserves areas of the device for computation.

Portability, performance and alternatives

A key feature of OpenCL is portability, via its abstracted memory and execution model, and the programmer is not able to directly use hardware-specific technologies such as inline Parallel Thread Execution (PTX) for NVidia GPUs unless they are willing to give up direct portability on other platforms. It is possible to run any OpenCL kernel on any conformant implementation.

However, performance of the kernel is not necessarily portable across platforms. Existing implementations have been shown to be competitive when kernel code is properly tuned, though, and auto-tuning has been suggested as a solution to the performance portability problem,^[82] yielding "acceptable levels of performance" in experimental linear algebra kernels.^[83] Portability of an entire application containing multiple kernels with differing behaviors was also studied, and shows that portability only required limited tradeoffs.^[84]

Furthermore, in studies of straightforward translation of CUDA programs to OpenCL C programs, CUDA has been found to outperform OpenCL;^[82]^[85] but the performance differences can mostly be attributed to differences in the programming model (esp. the memory model) and in the optimizations that OpenCL C compilers performed as compared to those in the CUDA compiler.^[82]

References

^ Jääskeläinen, Pekka O.; de La Lama, Carlos S.; Huerta, Pablo; Takala, Jarmo H. (July 2010). "OpenCL-based design methodology for application-specific processors". 2010 International Conference on Embedded Computer Systems (SAMOS). IEEE: 223–230. doi:10.1109/ICSAMOS.2010.5642061. ISBN 978-1-4244-7936-8. Retrieved February 17, 2011.
^ Altera OpenCL
^ "Jobs at Altera". Archived from the original on July 21, 2011.
^ "Caps Raises The Case For Hybrid Multicore Parallel Programming". Dr. Dobb's. June 17, 2012. Retrieved January 17, 2014.
^ "Does the OpenACC API run on top of OpenCL?". OpenACC.org. Retrieved January 17, 2014.
^ Reyes, Ruymán; López-Rodríguez, Iván; Fumero, Juan J.; de Sande, Francisco (August 27–31, 2012). accULL: An OpenACC Implementation with CUDA and OpenCL Support. EURO-PAR 2012 International European Conference on Parallel and Distributed Computing. doi:10.1007/978-3-642-32820-6_86. Retrieved January 17, 2014. {{cite conference}}: External link in |conferenceurl= (help); Unknown parameter |conferenceurl= ignored (|conference-url= suggested) (help)
^ ^a ^b ^c ^d Stone, John E.; Gohara, David; Shi, Guochin (2010). "OpenCL: a parallel programming standard for heterogeneous computing systems". Computing in Science & Engineering. doi:10.1109/MCSE.2010.69.
^ Template:Cite doi/10.1016.2Fj.parco.2011.09.001
^ https://github.com/JuliaGPU/OpenCL.jl
^ "Khronos Launches Heterogeneous Computing Initiative" (Press release). Khronos Group. June 16, 2008. Retrieved June 18, 2008.
^ "OpenCL gets touted in Texas". MacWorld. November 20, 2008. Retrieved June 12, 2009.
^ "The Khronos Group Releases OpenCL 1.0 Specification" (Press release). Khronos Group. December 8, 2008. Retrieved June 12, 2009.
^ "Apple Previews Mac OS X Snow Leopard to Developers" (Press release). Apple Inc. June 9, 2008. Retrieved June 9, 2008.
^ "AMD Drives Adoption of Industry Standards in GPGPU Software Development" (Press release). AMD. August 6, 2008. Retrieved August 14, 2008.
^ "AMD Backs OpenCL, Microsoft DirectX 11". eWeek. August 6, 2008. Retrieved August 14, 2008.
^ "HPCWire: RapidMind Embraces Open Source and Standards Projects". HPCWire. November 10, 2008. Retrieved November 11, 2008.
^ "Nvidia Adds OpenCL To Its Industry Leading GPU Computing Toolkit" (Press release). Nvidia. December 9, 2008. Retrieved December 10, 2008.
^ "OpenCL Development Kit for Linux on Power". alphaWorks. October 30, 2009. Retrieved October 30, 2009.
^ Khronos Drives Momentum of Parallel Computing Standard with Release of OpenCL 1.1 Specification
^ Khronos Releases OpenCL 1.2 Specification
^ "Khronos Finalizes OpenCL 2.0 Specification for Heterogeneous Computing". Khronos Group. November 18, 2013. Retrieved February 10, 2014.
^ OpenCL ICD Specification
^ freeocl
^ ocl-icd
^ GalliumCompute
^ Michael Larabel (January 10, 2013). "Beignet: OpenCL/GPGPU Comes For Ivy Bridge On Linux". Phoronix.
^ Michael Larabel (April 16, 2013). "More Criticism Comes Towards Intel's Beignet OpenCL". Phoronix.
^ Michael Larabel (December 24, 2013). "Intel's Beignet OpenCL Is Still Slowly Baking". Phoronix.
^ "OpenCL Demo, AMD CPU". December 10, 2008. Retrieved March 28, 2009.
^ "OpenCL Demo, Nvidia GPU". December 10, 2008. Retrieved March 28, 2009.
^ "Imagination Technologies launches advanced, highly-efficient POWERVR SGX543MP multi-processor graphics IP family". Imagination Technologies. March 19, 2009. Retrieved January 30, 2011.
^ "AMD and Havok demo OpenCL accelerated physics". PC Perspective. March 26, 2009. Retrieved March 28, 2009.
^ "Nvidia Releases OpenCL Driver To Developers". Nvidia. April 20, 2009. Retrieved April 27, 2009.
^ "AMD does reverse GPGPU, announces OpenCL SDK for x86". Ars Technica. August 5, 2009. Retrieved August 6, 2009.
^ Dan Moren; Jason Snell (June 8, 2009). "Live Update: WWDC 2009 Keynote". macworld.com. MacWorld. Retrieved June 12, 2009.
^ "Mac OS X Snow Leopard – Technical specifications and system requirements". Apple Inc. March 23, 2011. Retrieved March 23, 2011.
^ "ATI Stream Software Development Kit (SDK) v2.0 Beta Program". Retrieved October 14, 2009.^{[dead link]}
^ "Apple entry on LLVM Users page". Retrieved August 29, 2009.
^ "Nvidia entry on LLVM Users page". Retrieved August 6, 2009.
^ "Rapidmind entry on LLVM Users page". Retrieved October 1, 2009.
^ "Zack Rusin's blog post about the Gallium3D OpenCL implementation". Retrieved October 1, 2009.
^ "S3 Graphics launched the Chrome 5400E embedded graphics processor". Retrieved October 27, 2009.
^ "VIA Brings Enhanced VN1000 Graphics Processor]". Retrieved December 10, 2009.
^ "ATI Stream SDK v2.0 with OpenCL 1.0 Support". Retrieved October 23, 2009.
^ http://www.ziilabs.com/opencl
^ ^a ^b "Khronos Group Conformant Products".
^ "Intel discloses new Sandy Bridge technical details". Retrieved September 13, 2010.
^ WebCL related stories
^ Khronos Releases Final WebGL 1.0 Specification
^ "OpenCL Development Kit for Linux on Power".
^ "About the OpenCL Common Runtime for Linux on x86 Architecture".
^ Nokia Research releases WebCL prototype
^ Samsung's WebCL Prototype for WebKit
^ "AMD Opens the Throttle on APU Performance with Updated OpenCL Software Development ". Amd.com. August 8, 2011. Retrieved June 16, 2013.
^ AMD APP SDK v2.6
^ "The Portland Group Announces OpenCL Compiler for ST-Ericsson ARM-Based NovaThor SoCs". Retrieved May 4, 2012.
^ WebCL Latest Spec
^ "Altera Opens the World of FPGAs to Software Programmers with Broad Availability of SDK and Off-the-Shelf Boards for OpenCL". Altera.com. Retrieved January 9, 2014.
^ "Altera SDK for OpenCL is First in Industry to Achieve Khronos Conformance for FPGAs". Altera.com. Retrieved January 9, 2014.
^ Khronos Finalizes OpenCL 2.0 Specification for Heterogeneous Computing
^ WebCL 1.0 Press Release
^ WebCL 1.0 Specification
^ Intel OpenCL 2.0 Driver
^ AMD OpenCL 2.0 Driver
^ ^a ^b Aaftab Munshi, ed. (2014). "The OpenCL C Specification, Version 2.0" (PDF). Retrieved June 24, 2014.
^ AMD. Introduction to OpenCL Programming 201005, page 89-90
^ AMD. Introduction to OpenCL Programming 201005, page 89-90
^ "OpenCL" (PDF). SIGGRAPH2008. August 14, 2008. Retrieved August 14, 2008.
^ "Fitting FFT onto G80 Architecture" (PDF). Vasily Volkov and Brian Kazian, UC Berkeley CS258 project report. May 2008. Retrieved November 14, 2008.
^ "OpenCL on FFT". Apple. November 16, 2009. Retrieved December 7, 2009.
^ OpenCL Conformant Products
^ "Conformant Products". Retrieved August 11, 2011.
^ "OpenCL and the AMD APP SDK". AMD Developer Central. developer.amd.com. Retrieved August 11, 2011.
^ "About Intel OpenCL SDK 1.1". software.intel.com. intel.com. Retrieved August 11, 2011.
^ "Product Support". Retrieved August 11, 2011.
^ "Intel OpenCL SDK - Release Notes". Retrieved August 11, 2011.
^ "Announcing OpenCL Development Kit for Linux on Power v0.3". Retrieved August 11, 2011.
^ "IBM releases OpenCL Development Kit for Linux on Power v0.3 - OpenCL 1.1 conformant release available". OpenCL Lounge. ibm.com. Retrieved August 11, 2011.
^ "IBM releases OpenCL Common Runtime for Linux on x86 Architecture". Retrieved September 10, 2011.
^ "OpenCL and the AMD APP SDK". AMD Developer Central. developer.amd.com. Retrieved September 10, 2011.
^ "Nvidia Releases OpenCL Driver". Retrieved August 11, 2011.
^ ^a ^b ^c Fang, Jianbin; Varbanescu, Ana Lucia; Sips, Henk (2011). "2011 International Conference on Parallel Processing": 216. doi:10.1109/ICPP.2011.45. ISBN 978-1-4577-1336-1. Retrieved January 12, 2012. {{cite journal}}: |chapter= ignored (help); Cite journal requires |journal= (help)
^ Template:Cite doi/10.1016.2Fj.parco.2011.10.002
^ Romain Dolbeau, François Bodin, Guillaume Colin de Verdière (September 7, 2013). "One OpenCL to rule them all?". Retrieved January 14, 2014. {{cite journal}}: Cite journal requires |journal= (help)CS1 maint: multiple names: authors list (link)
^ Kamran Karimi, Neil G. Dickson, Firas Hamze (May 16, 2011). "A Performance Comparison of CUDA and OpenCL" (PDF). arXiv:1005.2581v3. Retrieved January 12, 2012. {{cite journal}}: Cite journal requires |journal= (help)CS1 maint: multiple names: authors list (link)

External links

[1] Jääskeläinen, Pekka O.; de La Lama, Carlos S.; Huerta, Pablo; Takala, Jarmo H. (July 2010). "OpenCL-based design methodology for application-specific processors". 2010 International Conference on Embedded Computer Systems (SAMOS). IEEE: 223–230. doi:10.1109/ICSAMOS.2010.5642061. ISBN 978-1-4244-7936-8. Retrieved February 17, 2011.

[2] Altera OpenCL

[3] "Jobs at Altera". Archived from the original on July 21, 2011.

[4] "Caps Raises The Case For Hybrid Multicore Parallel Programming". Dr. Dobb's. June 17, 2012. Retrieved January 17, 2014.

[5] "Does the OpenACC API run on top of OpenCL?". OpenACC.org. Retrieved January 17, 2014.

[6] Reyes, Ruymán; López-Rodríguez, Iván; Fumero, Juan J.; de Sande, Francisco (August 27–31, 2012). accULL: An OpenACC Implementation with CUDA and OpenCL Support. EURO-PAR 2012 International European Conference on Parallel and Distributed Computing. doi:10.1007/978-3-642-32820-6_86. Retrieved January 17, 2014. {{cite conference}}: External link in |conferenceurl= (help); Unknown parameter |conferenceurl= ignored (|conference-url= suggested) (help)

[CiSE-7] Stone, John E.; Gohara, David; Shi, Guochin (2010). "OpenCL: a parallel programming standard for heterogeneous computing systems". Computing in Science & Engineering. doi:10.1109/MCSE.2010.69.

[pyopencl-8] Template:Cite doi/10.1016.2Fj.parco.2011.09.001

[OpenCL.jl-9] ttps://github.com/JuliaGPU/OpenCL.jl

[10] "Khronos Launches Heterogeneous Computing Initiative" (Press release). Khronos Group. June 16, 2008. Retrieved June 18, 2008.

[macWorld-11] "OpenCL gets touted in Texas". MacWorld. November 20, 2008. Retrieved June 12, 2009.

[khronosGroup-12] "The Khronos Group Releases OpenCL 1.0 Specification" (Press release). Khronos Group. December 8, 2008. Retrieved June 12, 2009.

[pressrelease-13] "Apple Previews Mac OS X Snow Leopard to Developers" (Press release). Apple Inc. June 9, 2008. Retrieved June 9, 2008.

[AMDpressrelease-14] "AMD Drives Adoption of Industry Standards in GPGPU Software Development" (Press release). AMD. August 6, 2008. Retrieved August 14, 2008.

[eweekAMD-15] "AMD Backs OpenCL, Microsoft DirectX 11". eWeek. August 6, 2008. Retrieved August 14, 2008.

[RapidMindHPCWire-16] "HPCWire: RapidMind Embraces Open Source and Standards Projects". HPCWire. November 10, 2008. Retrieved November 11, 2008.

[Nvidia_Press_Release_2008-12-09-17] "Nvidia Adds OpenCL To Its Industry Leading GPU Computing Toolkit" (Press release). Nvidia. December 9, 2008. Retrieved December 10, 2008.

[openclIBM-18] "OpenCL Development Kit for Linux on Power". alphaWorks. October 30, 2009. Retrieved October 30, 2009.

[19] Khronos Drives Momentum of Parallel Computing Standard with Release of OpenCL 1.1 Specification

[20] Khronos Releases OpenCL 1.2 Specification

[21] "Khronos Finalizes OpenCL 2.0 Specification for Heterogeneous Computing". Khronos Group. November 18, 2013. Retrieved February 10, 2014.

[22] OpenCL ICD Specification

[23] reeocl

[24] -icd

[25] GalliumCompute

[26] Michael Larabel (January 10, 2013). "Beignet: OpenCL/GPGPU Comes For Ivy Bridge On Linux". Phoronix.

[27] Michael Larabel (April 16, 2013). "More Criticism Comes Towards Intel's Beignet OpenCL". Phoronix.

[28] Michael Larabel (December 24, 2013). "Intel's Beignet OpenCL Is Still Slowly Baking". Phoronix.

[29] "OpenCL Demo, AMD CPU". December 10, 2008. Retrieved March 28, 2009.

[30] "OpenCL Demo, Nvidia GPU". December 10, 2008. Retrieved March 28, 2009.

[31] "Imagination Technologies launches advanced, highly-efficient POWERVR SGX543MP multi-processor graphics IP family". Imagination Technologies. March 19, 2009. Retrieved January 30, 2011.

[32] "AMD and Havok demo OpenCL accelerated physics". PC Perspective. March 26, 2009. Retrieved March 28, 2009.

[33] "Nvidia Releases OpenCL Driver To Developers". Nvidia. April 20, 2009. Retrieved April 27, 2009.

[34] "AMD does reverse GPGPU, announces OpenCL SDK for x86". Ars Technica. August 5, 2009. Retrieved August 6, 2009.

[35] Dan Moren; Jason Snell (June 8, 2009). "Live Update: WWDC 2009 Keynote". macworld.com. MacWorld. Retrieved June 12, 2009.

[36] "Mac OS X Snow Leopard – Technical specifications and system requirements". Apple Inc. March 23, 2011. Retrieved March 23, 2011.

[37] "ATI Stream Software Development Kit (SDK) v2.0 Beta Program". Retrieved October 14, 2009.^{[dead link]}

[38] "Apple entry on LLVM Users page". Retrieved August 29, 2009.

[39] "Nvidia entry on LLVM Users page". Retrieved August 6, 2009.

[40] "Rapidmind entry on LLVM Users page". Retrieved October 1, 2009.

[41] "Zack Rusin's blog post about the Gallium3D OpenCL implementation". Retrieved October 1, 2009.

[42] "S3 Graphics launched the Chrome 5400E embedded graphics processor". Retrieved October 27, 2009.

[43] "VIA Brings Enhanced VN1000 Graphics Processor]". Retrieved December 10, 2009.

[44] "ATI Stream SDK v2.0 with OpenCL 1.0 Support". Retrieved October 23, 2009.

[45] ttp://www.ziilabs.com/opencl

[khronos.org-46] "Khronos Group Conformant Products".

[47] "Intel discloses new Sandy Bridge technical details". Retrieved September 13, 2010.

[48] WebCL related stories

[49] Khronos Releases Final WebGL 1.0 Specification

[50] "OpenCL Development Kit for Linux on Power".

[51] "About the OpenCL Common Runtime for Linux on x86 Architecture".

[52] Nokia Research releases WebCL prototype

[53] Samsung's WebCL Prototype for WebKit

[54] "AMD Opens the Throttle on APU Performance with Updated OpenCL Software Development ". Amd.com. August 8, 2011. Retrieved June 16, 2013.

[55] AMD APP SDK v2.6

[56] "The Portland Group Announces OpenCL Compiler for ST-Ericsson ARM-Based NovaThor SoCs". Retrieved May 4, 2012.

[57] WebCL Latest Spec

[58] "Altera Opens the World of FPGAs to Software Programmers with Broad Availability of SDK and Off-the-Shelf Boards for OpenCL". Altera.com. Retrieved January 9, 2014.

[59] "Altera SDK for OpenCL is First in Industry to Achieve Khronos Conformance for FPGAs". Altera.com. Retrieved January 9, 2014.

[60] Khronos Finalizes OpenCL 2.0 Specification for Heterogeneous Computing

[61] WebCL 1.0 Press Release

[62] WebCL 1.0 Specification

[63] Intel OpenCL 2.0 Driver

[64] AMD OpenCL 2.0 Driver

[openclc-65] Aaftab Munshi, ed. (2014). "The OpenCL C Specification, Version 2.0" (PDF). Retrieved June 24, 2014.

[66] AMD. Introduction to OpenCL Programming 201005, page 89-90

[67] AMD. Introduction to OpenCL Programming 201005, page 89-90

[siggraph-68] "OpenCL" (PDF). SIGGRAPH2008. August 14, 2008. Retrieved August 14, 2008.

[VolkovKazianFFTG80-69] "Fitting FFT onto G80 Architecture" (PDF). Vasily Volkov and Brian Kazian, UC Berkeley CS258 project report. May 2008. Retrieved November 14, 2008.

[AppleOpenCLFFT-70] "OpenCL on FFT". Apple. November 16, 2009. Retrieved December 7, 2009.

[71] OpenCL Conformant Products

[72] "Conformant Products". Retrieved August 11, 2011.

[73] "OpenCL and the AMD APP SDK". AMD Developer Central. developer.amd.com. Retrieved August 11, 2011.

[intelsdk-74] "About Intel OpenCL SDK 1.1". software.intel.com. intel.com. Retrieved August 11, 2011.

[75] "Product Support". Retrieved August 11, 2011.

[76] "Intel OpenCL SDK - Release Notes". Retrieved August 11, 2011.

[77] "Announcing OpenCL Development Kit for Linux on Power v0.3". Retrieved August 11, 2011.

[78] "IBM releases OpenCL Development Kit for Linux on Power v0.3 - OpenCL 1.1 conformant release available". OpenCL Lounge. ibm.com. Retrieved August 11, 2011.

[79] "IBM releases OpenCL Common Runtime for Linux on x86 Architecture". Retrieved September 10, 2011.

[80] "OpenCL and the AMD APP SDK". AMD Developer Central. developer.amd.com. Retrieved September 10, 2011.

[81] "Nvidia Releases OpenCL Driver". Retrieved August 11, 2011.

[comprehensive-82] Fang, Jianbin; Varbanescu, Ana Lucia; Sips, Henk (2011). "2011 International Conference on Parallel Processing": 216. doi:10.1109/ICPP.2011.45. ISBN 978-1-4577-1336-1. Retrieved January 12, 2012. {{cite journal}}: |chapter= ignored (help); Cite journal requires |journal= (help)

[83] Template:Cite doi/10.1016.2Fj.parco.2011.10.002

[84] Romain Dolbeau, François Bodin, Guillaume Colin de Verdière (September 7, 2013). "One OpenCL to rule them all?". Retrieved January 14, 2014. {{cite journal}}: Cite journal requires |journal= (help)CS1 maint: multiple names: authors list (link)

[85] Kamran Karimi, Neil G. Dickson, Firas Hamze (May 16, 2011). "A Performance Comparison of CUDA and OpenCL" (PDF). arXiv:1005.2581v3. Retrieved January 12, 2012. {{cite journal}}: Cite journal requires |journal= (help)CS1 maint: multiple names: authors list (link)

[1]

[2]

[3]

[4]

[5]

[6]

[7]

[8]

[9]

[10]

[11]

[12]

[13]

[14]

[15]

[16]

[17]

[18]

[19]

[20]

[21]

[22]

[23]

[24]

[25]

[26]

[27]

[28]

[29]

[30]

[31]

[32]

[33]

[34]

[35]

[36]

[37]

[38]

[39]

[40]

[41]

[42]

[43]

[44]

[45]

[46]

[47]

[48]

[49]

[50]

[51]

[52]

[53]

[54]

[55]

[56]

[57]

[58]

[59]

[60]

[61]

[62]

[63]

[64]

[65]

[66]

[67]

[68]

[69]

[70]

[71]

[72]

[73]

[74]

[75]

[76]

[77]

[78]

[79]

[80]

[81]

[82]

[83]

[84]

[85]

v t e Khronos Group Standards
Active	EGL glTF NNEF OpenCL OpenVG OpenVX OpenXR SPIR SYCL Vulkan
Inactive	COLLADA OpenGL ES SC WebGL OpenKODE OpenMAX OpenSL ES OpenWF WebCL

v t e Parallel computing
General	Distributed computing Parallel computing Parallel algorithm Massively parallel Cloud computing High-performance computing Multiprocessing Manycore processor GPGPU Computer network Systolic array
Levels	Bit Instruction Thread Task Data Memory Loop Pipeline
Multithreading	Temporal Simultaneous (SMT) Simultaneous and heterogenous Speculative (SpMT) Preemptive Cooperative Clustered multi-thread (CMT) Hardware scout
Theory	PRAM model PEM model Analysis of parallel algorithms Amdahl's law Gustafson's law Cost efficiency Karp–Flatt metric Slowdown Speedup
Elements	Process Thread Fiber Instruction window Array
Coordination	Multiprocessing Memory coherence Cache coherence Cache invalidation Barrier Synchronization Application checkpointing
Programming	Stream processing Dataflow programming Models Implicit parallelism Explicit parallelism Concurrency Non-blocking algorithm
Hardware	Flynn's taxonomy SISD SIMD Array processing (SIMT) Pipelined processing Associative processing MISD MIMD Dataflow architecture Pipelined processor Superscalar processor Vector processor Multiprocessor symmetric asymmetric Memory shared distributed distributed shared UMA NUMA COMA Massively parallel computer Computer cluster Beowulf cluster Grid computer Hardware acceleration
APIs	Ateji PX Boost Chapel HPX Charm++ Cilk Coarray Fortran CUDA Dryad C++ AMP Global Arrays GPUOpen MPI OpenMP OpenCL OpenHMPP OpenACC Parallel Extensions PVM pthreads RaftLib ROCm UPC TBB ZPL
Problems	Automatic parallelization Deadlock Deterministic algorithm Embarrassingly parallel Parallel slowdown Race condition Software lockout Scalability Starvation
Category: Parallel computing