Multidimensional DSP with GPU acceleration

Digital signal processing (DSP) is a ubiquitous methodology in scientific and engineering computations. However, practically, DSP problems are often not only 1-D. For instance, image data are 2-D signals and radar signals are 3-D signals. While the number of dimension increases, the time and/or storage complexity of processing digital signals grow dramatically. Therefore, solving DSP problems in real-time is extremely difficult in reality.

Modern general purpose graphics processing units (GPGPUs) are considered as having an excellent throughput on vector operations and numeric manipulations through a high degree of parallel computations. While processing digital signals, particularly multidimensional signals, often involves a series of vector operations on massive amount of independent data samples, GPGPUs are now widely employed to accelerate multidimensional DSP, such as image processing, video codecs, radar signal analysis, sonar signal processing, and ultrasound scanning. Conceptually, using GPGPU devices to perform multidimensional DSP is able to dramatically reduce the computation complexity compared with central processing units (CPUs), digital signal processors (DSPs), or other FPGA accelerators.

Motivation

Processing multidimensional signals is a common problem in scientific researches and/or engineering computations. Typically, a DSP problem's computation complexity grows exponentially while the number of dimension increases. Notwithstanding, with a high degree of time and storage complexity, it is extremely difficult to process multidimensional signals in real-time. Although there are many fast algorithms (e.g. FFT) have been proposed in 1-D DSP problems, they are still not efficient enough to be adapted in high dimensional DSP problems. Therefore, it is still hard to obtain the computation results with digital signal processors (DSPs). Hence, a better solution of software algorithm or hardware architecture to accelerate multidimensional DSP computations is strongly required.

Existing Approaches

Practically, to accelerate multidimensional DSP, some common approaches have been proposed and developed in the past decades.

Lower Sampling Rate

A makeshift to achieve real-time requirement in multidimensional DSP applications is using a lower sampling rate can efficiently reduce the number of samples to be processed at one time and thereby decreasing the total processing time. However, this can lead to the aliasing problem in the sampling theorem and make a poor quality of outputs. In some applications, such as military radars and medical images, we are eager to have highly precise and accurate results. In such cases, using a lower sampling rate to reduce amount of computation in multidimensional DSP domain is not always allowable.

Digital Signal Processors (DSPs)

Digital signal processors are designed specifically to process vector operations. They have been widely used in DSP computations for decades. However, most digital signal processors are only capable of manipulating couple operations in parallel. This kind of designs is sufficient to accelerate audio processing (1-D signals) and image processing (2-D signals). However, with a large amount of data samples in multidimensional signals, this is still not powerful enough to retrieve computation results in real-time.

Supercomputer Assistance

In order to accelerate multidimensional DSP computations, using dedicated supercomputers or cluster computers is required in some circumstances, e.g., weather forecasting and military radars. Nevertheless, using supercomputers designated to simply perform DSP operations takes considerable money cost and energy consumption. It is not practical and suitable for all multidimensional DSP applications.

GPU Acceleration

GPUs are originally devised to accelerate image processing and video stream rendering. Moreover, since modern GPUs' have good ability to perform numeric computations in parallel with a relatively low cost and better energy efficiency, GPUs are becoming a popular alternative to replace supercomputers performing multidimensional DSP.^[1]

GPGPU Computations

Modern GPU designs are mainly based on SIMD computation paradigm.^[2]^[3] This type of GPU devices is so-called general-purpose GPUs (GPGPUs).

GPGPUs are able to perform an operation on multiple independent data concurrently with their vector or SIMD functional units. A modern GPGPU can spawn thousands of concurrent threads and process all threads in a batch manner. With this nature, GPGPUs can be employed as DSP accelerators easily while many DSP problems can be solved by divide-and-conquer algorithms. A large scale and complex DSP problem can be divided into bunch of small numeric problems and be processed altogether at one time so that the overall time complexity can be reduced significantly. For example, multiplying two $M \times M$ matrices can be processed by $M \times M$ concurrent threads on a GPGPU device without any output data dependency. Therefore, theoretically, by means of GPGPU acceleration, we can gain up to $M \times M$ speedup compared with a traditional CPU or digital signal processor.

GPU Programming Languages

Currently, there are several existing programming languages or interfaces which support GPGPU programming.

CUDA

CUDA is the standard interface to program NVIDIA GPUs. NVIDIA also provides many CUDA libraries to support DSP acceleration on NVIDIA GPU devices.^[4]

OpenCL

OpenCL is an industrial standard which was originally proposed by Apple Inc. and is maintained and developed by Khronos Group now.^[5] OpenCL provides C++ like APIs for programming different devices universally, including GPGPUs.

C++ Amp

C++ Amp is a programming model proposed by Microsoft. C++ Amp is a C++ based library designed for programming SIMD processors^[6]

OpenAcc

OpenAcc is a programming standard for parallel computing developed by Cray, CAPS, NVIDIA and PGI.^[7] OpenAcc targets to program for CPU and GPU heterogeneous systems with C, C++, and Fortran extensions.

Examples of GPU Programming for Multidimensional DSP

$m \times m$ Matrix Multiplication

Suppose $A$ and $B$ are two $m \times m$ matrices and we would like to compute $C = A \times B$ .

$\mathbf {A} ={\begin{pmatrix}A_{11}&A_{12}&\cdots &A_{1m}\\A_{21}&A_{22}&\cdots &A_{2m}\\\vdots &\vdots &\ddots &\vdots \\A_{m1}&A_{m2}&\cdots &A_{mm}\\\end{pmatrix}},\quad \mathbf {B} ={\begin{pmatrix}B_{11}&B_{12}&\cdots &B_{1m}\\B_{21}&B_{22}&\cdots &B_{2m}\\\vdots &\vdots &\ddots &\vdots \\B_{m1}&B_{m2}&\cdots &B_{mm}\\\end{pmatrix}}$

$\mathbf {C} =\mathbf {A} \times \mathbf {B} ={\begin{pmatrix}C_{11}&C_{12}&\cdots &C_{1m}\\C_{21}&C_{22}&\cdots &C_{2m}\\\vdots &\vdots &\ddots &\vdots \\C_{m1}&C_{m2}&\cdots &C_{mm}\\\end{pmatrix}},\quad C_{ij}=\sum _{k=1}^{m}A_{ik}B_{kj}$

To compute each element in $C$ takes $m$ multiplications and $(m - 1)$ additions. Therefore, with a CPU implementation, the time complexity to achieve this computation is Θ(n³) in the following C example. However, we have known that elements in $C$ are independent to each other. Hence, the computation can be fully parallelized by SIMD processors, such as GPGPU devices. With a GPGPU implementation, the time complexity significantly reduces to Θ(n) by unrolling the for-loop showing in the following OpenCL example.

// MxM matrix multiplication in C
void matrixMul(
    float *A,  // input matrix A
    float *B,  // input matrix B
    float *C,  // output matrix C
    int size)  // size of the matrices
{
    // N x N x N iterations
    for (int row = 0; row < size; row++) {
        for (int col = 0; col < size; col++) {
            int id = row * size + col;
            flot sum = 0.0;

            for (int m = 0; m < size; m++) {
                sum += (A[row * size + m] * B[m * size + col]);
            }

            C[id] = sum;
        }
    }
}

// MxM matrix multiplication in OpenCL
__kernel void matrixMul(
    __global float *A,  // input matrix A
    __global float *B,  // input matrix B
    __global float *C,  // output matrix C
    __global int size)  // size of the matrices
{
    size_t id = get_global_id(0);   // each thread works on an element
    size_t row = id / size;
    size_t col = id % size;
    float sum = 0.0;

    // N iterations
    for (int m = 0; m < size; m++) {
        sum += (A[row * size + m] * B[m * size + col]);
    }

    C[id] = sum;
}

Multidimensional Convolution (M-D Convolution)

Convolution is a frequently used operation in DSP. To compute 2-D convolution of two m × m signals, it requires $m 2$ multiplications and $m \times (m - 1)$ additions for an output element. That is, the overall time complexity is Θ(n⁴) for the entire output signal. As the following OpenCL example shows, with GPGPU acceleration, the total computation time effectively decreases to Θ(n²) since all output elements are data independent.

$y(n_{1},n_{2})=x(n_{1},n_{2})**h(n_{1},n_{2})=\sum _{k_{1}=0}^{m-1}\sum _{k_{2}=0}^{m-1}x(k_{1},k_{2})h(n_{1}-k_{1},n_{2}-k_{2})$

// 2-D convolution implementation in OpenCL
__kernel void convolution(
    __global float *x,  // input signal x
    __global float *h,  // filter h
    __global float *y,  // output signal y
    __global int size)  // size of ROS of the input signal and filter
{
    size_t id = get_global_id(0);   // each thread works on an element
    size_t row = size + size - 1;   // number of rows of the output signal
    size_t col = size + size - 1;   // number of columns of the output signal
    size_t n1 = id / row;
    size_t n2 = id % col;
    float sum = 0.0;

    // N x N iterations
    for (int k1 = 0; k1 < size; k1++) {
        for (int k2 = 0; k2 < size; k2++) {
            sum += (x[k1 * row + k2] * h[(n1 * row - k1) + (n2 - k2)]);
        }
    }

    C[id] = sum;
}

Note that, although the example demonstrated above is a 2-D convolution, a similar approach can be adopted for a higher dimension system. Overall, for a s-D convolution, a GPGPU implementation has time complexity Θ(n^s), whereas a CPU implementation has time complexity Θ(n^2s).

$y(n_{1},n_{2},...,n_{s})=x(n_{1},n_{2},...,n_{s})**h(n_{1},n_{2},...,n_{s})=\sum _{k_{1}=0}^{m_{1}-1}\sum _{k_{2}=0}^{m_{2}-1}...\sum _{k_{s}=0}^{m_{s}-1}x(k_{1},k_{2},...,k_{s})h(n_{1}-k_{1},n_{2}-k_{2},...,n_{s}-k_{s})$

Multidimensional Discrete Time Fourier Transform (M-D DTFT)

In addition to convolution, discrete time Fourier transform (DTFT) is another technique which is often used in system analysis.

$X(\Omega _{1},\Omega _{2},...,\Omega _{s})=\sum _{n_{1}=0}^{m_{1}-1}\sum _{n_{2}=0}^{m_{2}-1}...\sum _{n_{s}=0}^{m_{s}-1}x(n_{1},n_{2},...,n_{s})e^{-j(\Omega _{1}n_{1}+\Omega _{1}n_{1}+...+\Omega _{s}n_{s})}$

Practically, to implement M-D DTFT, if assuming the system is separable, we can perform M times 1-D DFTF and matrix transpose with respect to each dimension. With a 1-D FFT operation, GPGPU can conceptually reduce the complexity from Θ(n²) to Θ(n) by the following example of OpenCL implementation. That is, to perform M-D DTFT, the complexity of GPGPU can be achieved by Θ(n²). While some GPGPUs' are also equipped hardware FFT accelerator internally, this implementation might be also optimized by invoking the FFT APIs or libraries provided by GPU manufactures.^[8]

// DTFT in OpenCL
__kernel void convolution(
    __global float *x_re,
    __global float *x_im,
    __global float *X_re,
    __global float *X_im,
    __global int size)
{
    size_t id = get_global_id(0);   // each thread works on an element
    X_re[id] = 0.0;
    X_im[id] = 0.0;

    for (int i = 0; i < size; i++) {
        X_re += (x_re[id] * cos(2 * 3.1415 * id / size) - x_im[id] + sin(2 * 3.1415 * id / size));
        X_im += (x_re[id] * sin(2 * 3.1415 * id / size) + x_im[id] + cos(2 * 3.1415 * id / size));
    }
}

Real Applications

Digital Filter Design

Designing a digital filter in multidimensional area is a big challenge, especially IIR filters. Typically it relies on computers to solve difference equations and obtain a set of approximated solutions. While GPGPU computation is becoming popular, several adaptive algorithms have been proposed to design multidimensional FIR and/or IIR filters by means of GPGPUs.^[9]^[10]^[11]

Radar Signal Reconstruction and Analysis

Radar systems usually require to reconstruct a numerous amount of 3-D or 4-D data samples in real-time. Traditionally, particularly in military, this needs supercomputers' support. Nowadays, GPGPUs are also employed to replace supercomputers to process radar signals. For example, to process synthetic aperture radar (SAR) signals, it usually involves multidimensional FFT computations.^[12]^[13]^[14] GPGPUs can be used to rapidly perform FFT and/or iFFT in this kind of applications.

Self-Driving Car

Many self-driving cars applies 3-D image recognition techniques to auto control the vehicles. Clearly, to accommodate the fast changing exterior environment, the recognition and decision processes must be done in real-time. GPGPUs are excellent devices to achieve the goal.^[15]

Medical Image Processing

In order to have accurate diagnosis, 2-D or 3-D medical signals, such as ultrasound, X-ray, MRI, and CT, often requires very high sampling rate and image resolutions to reconstruct images. By applying GPGPUs' superior computation power, it has been proved that we can acquire better quality on medical images^[16]^[17]

References

^ Chu, Slo-Li; Hsiao, Chih-Chieh (2010-09-01). "OpenCL: Make Ubiquitous Supercomputing Possible". 2010 12th IEEE International Conference on High Performance Computing and Communications (HPCC): 556–561. doi:10.1109/HPCC.2010.56.
^ Lindholm, E.; Nickolls, J.; Oberman, S.; Montrym, J. (2008-03-01). "NVIDIA Tesla: A Unified Graphics and Computing Architecture". IEEE Micro. 28 (2): 39–55. doi:10.1109/MM.2008.31. ISSN 0272-1732.
^ Kim, Hyesoon; Vuduc, Richard; Baghsorkhi, Sara; Choi, Jee; Hwu, Wen-Mei W. (2012). Hill, Mark D. (ed.). Performance Analysis and Tuning for General Purpose Graphics Processing Units (GPGPU). Morgan & Claypool Publishers. doi:10.2200/S00451ED1V01Y201209CAC020. ISBN 978-1-60845-954-4.
^ "Parallel Programming and Computing Platform | CUDA | NVIDIA | NVIDIA". www.nvidia.com. https://plus.google.com/104889184472622775891. Retrieved 2015-11-05. {{cite web}}: External link in |publisher= (help)
^ "OpenCL – The open standard for parallel programming of heterogeneous systems". www.khronos.org. Retrieved 2015-11-05.
^ "C++ AMP (C++ Accelerated Massive Parallelism)". msdn.microsoft.com. Retrieved 2015-11-05.
^ "OpenACC Home | www.openacc.org". www.openacc.org. Retrieved 2015-11-05.
^ "OpenCL™ Optimization Case Study Fast Fourier Transform – Part II – AMD". AMD. Retrieved 2015-11-05.
^ Nehab, Diego; Maximo, André; Lima, Rodolfo S.; Hoppe, Hugues (2011-01-01). "GPU-efficient Recursive Filtering and Summed-area Tables". Proceedings of the 2011 SIGGRAPH Asia Conference. SA '11. New York, NY, USA: ACM: 176:1–176:12. doi:10.1145/2024156.2024210. ISBN 978-1-4503-0807-6.
^ Pharr, Matt; Fernando, Randima (2005). GPU Gems 2: Programming Techniques For High-Performance Graphics And General-Purpose Computation. Pearson Addison Wesley. ISBN 0-321-33559-7.
^ Hwu, Wen-mei W. (2011). GPU Computing Gems Emerald Edition. San Francisco, CA, USA: Morgan Kaufmann Publishers Inc. ISBN 0-12-385963-8.
^ Clemente, C.; Di Bisceglie, M.; Di Santo, M.; Ranaldo, N.; Spinelli, M. (2009-10-01). "Processing of synthetic Aperture Radar data with GPGPU". IEEE Workshop on Signal Processing Systems, 2009. SiPS 2009: 309–314. doi:10.1109/SIPS.2009.5336272.
^ Liu, Bin; Wang, Kaizhi; Liu, Xingzhao; Yu, Wenxian (2009-10-01). "An Efficient SAR Processor Based on GPU via CUDA". 2nd International Congress on Image and Signal Processing, 2009. CISP '09: 1–5. doi:10.1109/CISP.2009.5304418.
^ Monsurro, P.; Trifiletti, A.; Lannutti, F. (2014-06-01). "Implementing radar algorithms on CUDA hardware". Mixed Design of Integrated Circuits Systems (MIXDES), 2014 Proceedings of the 21st International Conference: 455–458. doi:10.1109/MIXDES.2014.6872240.
^ Fang, Jianbin; Varbanescu, A.L.; Shen, Jie; Sips, H.; Saygili, G.; van der Maaten, L. (2012-12-01). "Accelerating Cost Aggregation for Real-Time Stereo Matching". 2012 IEEE 18th International Conference on Parallel and Distributed Systems (ICPADS): 472–481. doi:10.1109/ICPADS.2012.71.
^ "Medical Imaging|NVIDIA". www.nvidia.com. https://plus.google.com/104889184472622775891. Retrieved 2015-11-07. {{cite web}}: External link in |publisher= (help)
^ Heng, Yang; Gu, Lixu (2005-01-01). "GPU-based Volume Rendering for Medical Image Visualization". Engineering in Medicine and Biology Society, 2005. IEEE-EMBS 2005. 27th Annual International Conference of the: 5145–5148. doi:10.1109/IEMBS.2005.1615635.

[1] Chu, Slo-Li; Hsiao, Chih-Chieh (2010-09-01). "OpenCL: Make Ubiquitous Supercomputing Possible". 2010 12th IEEE International Conference on High Performance Computing and Communications (HPCC): 556–561. doi:10.1109/HPCC.2010.56.

[2] Lindholm, E.; Nickolls, J.; Oberman, S.; Montrym, J. (2008-03-01). "NVIDIA Tesla: A Unified Graphics and Computing Architecture". IEEE Micro. 28 (2): 39–55. doi:10.1109/MM.2008.31. ISSN 0272-1732.

[3] Kim, Hyesoon; Vuduc, Richard; Baghsorkhi, Sara; Choi, Jee; Hwu, Wen-Mei W. (2012). Hill, Mark D. (ed.). Performance Analysis and Tuning for General Purpose Graphics Processing Units (GPGPU). Morgan & Claypool Publishers. doi:10.2200/S00451ED1V01Y201209CAC020. ISBN 978-1-60845-954-4.

[4] "Parallel Programming and Computing Platform | CUDA | NVIDIA | NVIDIA". www.nvidia.com. https://plus.google.com/104889184472622775891. Retrieved 2015-11-05. {{cite web}}: External link in |publisher= (help)

[5] "OpenCL – The open standard for parallel programming of heterogeneous systems". www.khronos.org. Retrieved 2015-11-05.

[6] "C++ AMP (C++ Accelerated Massive Parallelism)". msdn.microsoft.com. Retrieved 2015-11-05.

[7] "OpenACC Home | www.openacc.org". www.openacc.org. Retrieved 2015-11-05.

[8] "OpenCL™ Optimization Case Study Fast Fourier Transform – Part II – AMD". AMD. Retrieved 2015-11-05.

[9] Nehab, Diego; Maximo, André; Lima, Rodolfo S.; Hoppe, Hugues (2011-01-01). "GPU-efficient Recursive Filtering and Summed-area Tables". Proceedings of the 2011 SIGGRAPH Asia Conference. SA '11. New York, NY, USA: ACM: 176:1–176:12. doi:10.1145/2024156.2024210. ISBN 978-1-4503-0807-6.

[10] Pharr, Matt; Fernando, Randima (2005). GPU Gems 2: Programming Techniques For High-Performance Graphics And General-Purpose Computation. Pearson Addison Wesley. ISBN 0-321-33559-7.

[11] Hwu, Wen-mei W. (2011). GPU Computing Gems Emerald Edition. San Francisco, CA, USA: Morgan Kaufmann Publishers Inc. ISBN 0-12-385963-8.

[12] Clemente, C.; Di Bisceglie, M.; Di Santo, M.; Ranaldo, N.; Spinelli, M. (2009-10-01). "Processing of synthetic Aperture Radar data with GPGPU". IEEE Workshop on Signal Processing Systems, 2009. SiPS 2009: 309–314. doi:10.1109/SIPS.2009.5336272.

[13] Liu, Bin; Wang, Kaizhi; Liu, Xingzhao; Yu, Wenxian (2009-10-01). "An Efficient SAR Processor Based on GPU via CUDA". 2nd International Congress on Image and Signal Processing, 2009. CISP '09: 1–5. doi:10.1109/CISP.2009.5304418.

[14] Monsurro, P.; Trifiletti, A.; Lannutti, F. (2014-06-01). "Implementing radar algorithms on CUDA hardware". Mixed Design of Integrated Circuits Systems (MIXDES), 2014 Proceedings of the 21st International Conference: 455–458. doi:10.1109/MIXDES.2014.6872240.

[15] Fang, Jianbin; Varbanescu, A.L.; Shen, Jie; Sips, H.; Saygili, G.; van der Maaten, L. (2012-12-01). "Accelerating Cost Aggregation for Real-Time Stereo Matching". 2012 IEEE 18th International Conference on Parallel and Distributed Systems (ICPADS): 472–481. doi:10.1109/ICPADS.2012.71.

[16] "Medical Imaging|NVIDIA". www.nvidia.com. https://plus.google.com/104889184472622775891. Retrieved 2015-11-07. {{cite web}}: External link in |publisher= (help)

[17] Heng, Yang; Gu, Lixu (2005-01-01). "GPU-based Volume Rendering for Medical Image Visualization". Engineering in Medicine and Biology Society, 2005. IEEE-EMBS 2005. 27th Annual International Conference of the: 5145–5148. doi:10.1109/IEMBS.2005.1615635.

[1]

[2]

[3]

[4]

[5]

[6]

[7]

[8]

[9]

[10]

[11]

[12]

[13]

[14]

[15]

[16]

[17]

v t e Digital signal processing
Theory	Detection theory Discrete signal Estimation theory Nyquist–Shannon sampling theorem
Sub-fields	Audio signal processing Digital image processing Speech processing Statistical signal processing
Techniques	Z-transform Advanced z-transform Matched Z-transform method Bilinear transform Constant-Q transform Discrete cosine transform (DCT) Discrete Fourier transform (DFT) Discrete-time Fourier transform (DTFT) Impulse invariance Integral transform Laplace transform Post's inversion formula Starred transform Zak transform
Sampling	Aliasing Anti-aliasing filter Downsampling Nyquist rate / frequency Oversampling Quantization Sampling rate Undersampling Upsampling

v t e Programming languages
Comparison Timeline History
Ada ALGOL Simula APL Assembly BASIC Visual Basic classic .NET C C++ C# COBOL Erlang Elixir Forth Fortran Go Haskell Java JavaScript Julia Kotlin Lisp Lua MATLAB ML Caml OCaml Pascal Object Pascal Perl Raku PHP Prolog Python R Ruby Rust SAS SQL Scratch Shell Smalltalk Swift more...
Lists: Alphabetical Categorical Generational Non-English-based Category

v t e Parallel computing
General	Distributed computing Parallel computing Parallel algorithm Massively parallel Cloud computing High-performance computing Multiprocessing Manycore processor GPGPU Computer network Systolic array
Levels	Bit Instruction Thread Task Data Memory Loop Pipeline
Multithreading	Temporal Simultaneous (SMT) Simultaneous and heterogenous Speculative (SpMT) Preemptive Cooperative Clustered multi-thread (CMT) Hardware scout
Theory	PRAM model PEM model Analysis of parallel algorithms Amdahl's law Gustafson's law Cost efficiency Karp–Flatt metric Slowdown Speedup
Elements	Process Thread Fiber Instruction window Array
Coordination	Multiprocessing Memory coherence Cache coherence Cache invalidation Barrier Synchronization Application checkpointing
Programming	Stream processing Dataflow programming Models Implicit parallelism Explicit parallelism Concurrency Non-blocking algorithm
Hardware	Flynn's taxonomy SISD SIMD Array processing (SIMT) Pipelined processing Associative processing MISD MIMD Dataflow architecture Pipelined processor Superscalar processor Vector processor Multiprocessor symmetric asymmetric Memory shared distributed distributed shared UMA NUMA COMA Massively parallel computer Computer cluster Beowulf cluster Grid computer Hardware acceleration
APIs	Ateji PX Boost Chapel HPX Charm++ Cilk Coarray Fortran CUDA Dryad C++ AMP Global Arrays GPUOpen MPI OpenMP OpenCL OpenHMPP OpenACC Parallel Extensions PVM pthreads RaftLib ROCm UPC TBB ZPL
Problems	Automatic parallelization Deadlock Deterministic algorithm Embarrassingly parallel Parallel slowdown Race condition Software lockout Scalability Starvation
Category: Parallel computing