Jump to content

User:Spectral Decomposition/sandbox

From Wikipedia, the free encyclopedia

Multidimensional digital signal processing (mD DSP) is fundamental to many application areas such as digital image and video[1] processing, medical imaging, geophysical signal analysis, sonar, radar, lidar, array processing, computer vision, computational photography, and augmented and virtual reality. However, as the number of dimensions of a signal increases the computational complexity to operate on the signal increases rapidly. This relationship between the number of dimensions and the amount of complexity, related to both time and space, as studied in the field of algorithm analysis, is analogues to the concept of the curse of dimensionality[2]. This large complexity generally results in a very large execution run-time of a given mD DSP application causing them to become impractical to be used for practical application; especially in real-time applications[3]Cite error: There are <ref> tags on this page without content in them (see the help page)..

Motivation, Problem Statement, Basic Concepts, and Applications

[edit]

Due to the end of frequency scaling[4][5] of of processors[6], which is largely attributed to the effect of Dennard scaling[7] around the year 2005, a common trend of processor manufacturers was to continue to exploit Moore's law[8] by increasing the number of processors on a single chip termed, therefore creating what are termed manycore processor as opposed to [uniprocessor system|uniprocessors][9].

The challenging aspect of implementing mD DSP algorithms is the large magnitude of complexity as described in the previous section makes algorithmic implementation difficult primarily associated with run-time and power consumption. This article primarily addresses basic parallel concepts used to alleviate run-time of these algorithms. The concept of Parallel computing can be applied to mD DSP applications to exploit the fact that if a problem can be expressed in a parallel algorithmic form, then the methodology of parallel programming and multiprocessingcan be used in an attempt to increase the computational throughput of the mD DSP procedure on a given hardware platform. An increase in computational throughput can result in a decreased run-time, i.e. a speedup of a specific mD DSP algorithm.

Increasing throughput can be beneficial to strong scaling[10][11] of a given mD DSP algorithm.

In addition to increasing computational throughput, a generally considered equally important goal is to maximally utilize the Memory bandwidth of a given computing memory architecture. The combination of the computational throughput and memory bandwidth usage can be achieved through the concept of operational intensity, which is summarized in what the roofline model[12]. The concepts of operational intensity and the roofline model in general have recently become popular methods of quantifying the performance of mD DSP algorithms[13]

Another possible benefit of increasing operational intensity is to allow for an increase in weak scaling, which allows the mD DSP procedure to operate on increased data sizes or larger data sets, which is important for application areas such as data mining and the training of deep neural networks using big data.

It should be noted that the goal of parallizing an algorithm is not always to decrease the traditional concept of complexity the algorithm because the term complexity as used in this context typically refers to the RAM [[Abstract machine|| abstract computer model], which is by definition serial. parallel abstract computer models such as PRAM have been proposed to describe complexity for parallel algorithms such as mD signal processing algorithms[14]

Another factor that is important to the performance of mD DSP algorithm implementations is the energy consumption[15]

Existing Approaches

[edit]

Parallel Implementations of Multidimensional Discrete Fourier Transforms

[edit]

As a simple example of a mD-DSP algorithm that is commonly decomposed into a parallel form let’s consider the parallelization of the discrete Fourier transform, which is generally implemented using a form of the Fast Fourier Transform (FFT). There are hundreds of available software libraries that offer optimized FFT algorithms [16], and many of which offer parallelized versions of mD-FFT algorithms [17][18][19] parallel Versions of FFTw[20].

The most straightforward method of paralyzing the DFT is to utilize the row-column decomposition method. The following derivation is a close paraphrasing from the classical text Multidimensional Digital Signal Processing[21]. The row-column can be applied to an arbitrary number of dimensions, but for illustrative purposes, the 2D row-column DFT will be described first. The 2D DFT is defined as

where term is commonly referred to as the twiddle factor of the DFT in the signal processing literature.

Next, we re-writing the DFT equation in the form

where the quantity inside the brackets is a 2D sequence which we will denote as .

We can then express the above equation as the pair of relations

Each column of is the 1D DFT of the corresponding column of . Each row of is the 1D DFT of the corresponding row of .

Expressing the 2D-DFT in the above form allows us to see that we can compute a 2D DFT by decomposing it into row and column DFTs. The DFT of each column of can first be computed where the results of which are placed int an intermediate array. Then we can compute the DFT of each row of the intermediate array.

This row-column decomposition process can easily be extended to compute an MD DFT. First, the 1D DFT is computed with respect to one of independent variables, say , for each value of the remaining variables.

Next, 1D DFTs are computed with respect to the variable for all values of the -tuple . We continue in this fashion until all 1D DFTs have been evaluated with respect to all the spatial variables[21].

The row-column decomposition of the DFT is parallelized in its most simplistic manner by noting that the row and column computations are independent of each other and therefore can be performed on separate processors in parallel. The parallel 1D DFT computations on each processor can then utilize the FFT algorithm for further optimization. One large advantage of this specific method of parallelizing an mD DFT is that each of the of the 1D FFTs being performed in parallel on separate processors can then be performed in a concurrent fashion on Shared memory multithreaded SIMD processors[22] [9].

A specific hardware platform this simultaneous parallel and concurrent DFT implementation technique is highly amenable to is GPUs due to common GPUs having both a set of separate set of multithreaded SIMD processors (which are referred to as "streaming multiprocessors" by in the CUDA programming language, and "compute units" in the OpenCL language) and individual SIMD lanes (commonly referred to loosely as a "core", or more specifically a CUDA "thread processor" or as an OpenCL "processing elemeng") within each multithreaded SIMD processor.

A disadvantage to this technique of applying a seperate FFT on each shared memory multiprocessor is the required interleaving the data among the shared memory. One of the most popular libraries that utilizes this basic form of concurrent FFT computation is in the shared memory version of the FFTw library[23].

Implementations of Multidimensional Discrete Convolution via Shift Registers on an FPGA

[edit]

Convolution on mD signals lends itself well to pipelining due to the fact each of single output convolution operation is independent of every other one. Due to this data independence between each convolution operation between the filters impulse response and the signal a new set of data calculations may begin at the instant the first convolution operation is finished. A common method of performing mD convolution in a raster scan fashion (including dimensions greater than 2) on a traditional general purpose CPU, or even a GPU, is to cache the set of output data from each scan line of each independent dimension into the the local cache. By utilizing the unique custom re-configurable architecture of an field-programmable gate array (FGPA) we can optimize this procedure dramatically by customize the cache structure[24].

In the illustrative example from which the presentation this description is derived from[24] we are going to restrict our discussion on two dimensional signals. In the example we perform a set of convolutional operations between a general 2D signal and a 3x3 filter kernel. As the sequence of convolution operations proceed along each raster line the filter kernel is slid across one dimension of the input signal and the data read from the memory is cached. The first pass loads three new lines of data into cache. The OpenCL code for this procedure is scene below.

// Using a cache to hide poor memory access patterns on a traditional general purpose processor
for (int y = 1; y < yDim-1; ++y) {
  for (int x = 1; x < xDim-1; ++x) {
    // 3x3 Filter Kernel
    for (int y2 = -1; y2 < 1; ++y2) {
      for (int x2 = -1; x2 < 1; ++x2) {
        cache_temp[y][x] += 2D_input_signal[y+y2][x+x2]
                          * kernel[y2][x2];
        //...
      }
    }
  }
}

This caching technique is used to hide poor data to memory access pattern efficiency in terms of coalescing. However, with each successive loop only a single cache-line is updated. If make the reasonable assumption of a 1 pixel per cycle performance point, applying this proposed caching technique to an FPGA results in a cache requirement of 9 reads and one write per cycle. Utilizing this caching technique on an FPGA results in inefficient performance in terms of both power consumption and a creating a larger memory footprint than is required because there is a great deal of redundant reads into the cache. With an FPGA we can customize the cache structure to give rise to a much more efficient result.

The proposed method to alleviate this poor performance with an FPGA implementation as proposed in the corresponding literature[24] is to customize the cache architecture through utilization of the re-configurable hardware resources of the FPGA. The important attribute to note here is that the FPGA creates the hardware based on the code that we write as opposed to writing code to run on a fixed architecture with a fixed instructions.

A description of how to modify the implementation to optimize the cache architecture on an FPGA will now be discussed. Again, let's begin with an initial 2D signal and assume it is of size . We can then remove all of the lines that aren't in the neighborhood of the window. We next can linearize the 2D signal of this restricted segment of the 2D signal after removing lines that aren't in the neighborhood of the window. We can achieve this linearization via a simple row-major data layout. After linearizing the 2D signal into a 1D array, under the assumption that we are not concerned with the boundary conditions of the convolution, we can discard any pixels that only contribute to the boundary computations - which is common in practice for many practical applications. Now, when we slide the window over each array element to perform the next computation, we have effectively created a shift register. The corresponding code for our shift register implementation of achieving this 2D filtering can be seen below.

// Shift register in OpenCL
pixel_t sr[2*W+3];
while(keep_going) {
  // Shift data in
  #pragma unroll
  for(int i=1; i<2*W+3; ++i)
    sr[i] = sr[i-1];
  sr[0] = data_in

  // Tap output data
  data_out = {sr[  0], sr[    1], sr[    2],
              sr[  2], sr[  W+1], sr[  W+2],
              sr[2*W], sr[2*W+1], sr[2*W+2]};
  //...
}

By performing these aforementioned steps, the goal is to manage the data movement to match the FPGAs architectural strengths to achieve the highest performance. These architectural strengths allow custom implementation that can be based on the work being done as opposed to leveraging fixed operations, fixed data paths, and fixed widths, as would be done on a general purpose processor.

Parallel Implementations of Multidimensional FIR Filter Structures

[edit]

The section will describe a method of implementing an mD digital finite impulse response (FIR) filter in a completely parallel realization. The proposed method for a completely parallel realization of a general FIR filter is achieved through the useof a combination of parallel sections consisting of cascaded 1D digital filters[25].

Consider the general desired ideal finite extent mD FIR digital filter in the complex -domain, given as

Placing the coefficients of into an array and performing some algebraic manipulation as described in[25], then we arrive at an expression that allows us to decompose the filter into a filterbank, given as

where

Therefore, the original MD digital filter is approximately decomposed into a parallel filterbank realization composed of a set of seperable parallel filters , such that . This proposed parallel FIR filter realization is represented by the the block diagram as seen in Figure 1.

Figure 1. A completely parallel realization of an M-Dimensional FIR filter.

The completely parallel realization as seen in figure 1 block diagrams can be implemented in hardware by noting that block diagrams, and their corresponding Signal-flow graphs (SFGs) are a common method of graphically representing any DSP algorithm that can be expressed as a linear constant coefficient difference equation. SFGs allow for easy transition from a difference equation into a hardware implementation by allowing one to visualize the difference equation in terms of digital logic components such as shift registers, and basic ALU digital circuit elements such as adders and multipliers. For this specific parallel realization, one could place each parallel section on a separate parallel processor to allow for each section to be mplemented in a completely task-parallel fashion.

Using the fork–join model Parallel_programming_model of task-parallel, the a 'fork' may be applied at the first pickoff point in the figure, and the summing junction can be implemented during a synchronization with a 'join' operation. Implemented an mD FIR filter in this fashion lends itself well to the MapReduce general programming model[26]

Notes

[edit]
  1. ^ Keimel, Christian, Martin Rothbucher, Hao Shen, and Klaus Diepold. "Video is a cube." IEEE Signal Processing Magazine 28, no. 6 (2011): 41-49.
  2. ^ Hanka, Rudolf, and Thomas P. Harte. "Curse of dimensionality: classifying large multi-dimensional images with neural networks." In Computer Intensive Methods in Control and Signal Processing, pp. 249-260. Birkhäuser Boston, 1997.
  3. ^ Saponara, Sergio, Antonio Plaza, Marco Diani, Matthias F. Carlsohn, and Giovanni Corsini. "Special issue on algorithms and architectures for real-time multi-dimensional image processing." J. Real-Time Image Processing 9, no. 3 (2014): 393-396.
  4. ^ Mims, Christopher. "Why CPUs Aren't Getting Any Faster." MIT Technology Review. October 22, 2012. Accessed December 09, 2016. https://www.technologyreview.com/s/421186/why-cpus-arent-getting-any-faster/.
  5. ^ "Introduction to Parallel Programming With CUDA | Udacity." Introduction to Parallel Programming With CUDA | Udacity. Accessed December 07, 2016. https://www.youtube.com/watch?v=ET9wxEuqp_Y.
  6. ^ Sutter, Herb. "The free lunch is over: A fundamental turn toward concurrency in software." Dr. Dobb’s journal 30, no. 3 (2005): 202-210.
  7. ^ Bohr, M., 2007. A 30 year retrospective on Dennard's MOSFET scaling paper. IEEE Solid-State Circuits Society Newsletter, 12(1), pp.11-13.
  8. ^ Moore, Gordon. "Cramming More Components onto Integrated Circuits," Electronics Magazine Vol. 38, No. 8 (April 19, 1965)
  9. ^ a b Hennessy, John L., and David A. Patterson. Computer architecture: a quantitative approach. Elsevier, 2011.
  10. ^ "Introduction to Parallel Programming With CUDA | Udacity." Introduction to Parallel Programming With CUDA | Udacity. Accessed December 07, 2016. https://www.youtube.com/watch?v=539pseJ3EUc.
  11. ^ Humphries, Benjamin, Hansen Zhang, Jiayi Sheng, Raphael Landaverde, and Martin C. Herbordt. "3D FFTs on a Single FPGA." In Field-Programmable Custom Computing Machines (FCCM), 2014 IEEE 22nd Annual International Symposium on, pp. 68-71. IEEE, 2014.
  12. ^ Williams, S., Waterman, A. and Patterson, D., 2009. Roofline: an insightful visual performance model for multicore architectures. Communications of the ACM, 52(4), pp.65-76.
  13. ^ Bernabé, Gregorio, Javier Cuenca, Luis Pedro García, and Domingo Giménez. "Improving an autotuning engine for 3D Fast Wavelet Transform on manycore systems." The Journal of Supercomputing 70, no. 2 (2014): 830-844.
  14. ^ Saybasili, A. Beliz, Alexandros Tzannes, Bernard R. Brooks, and Uzi Vishkin. "Highly Parallel Multi-Dimensional Fast Fourier Transform on Fine-and Coarse-Grained Many-Core Approaches." In Proceedings of the 21st IASTED International Conference, vol. 668, no. 018, p. 107. 2009.
  15. ^ Ukidave, Y., Ziabari, A.K., Mistry, P., Schirner, G. and Kaeli, D., 2013, April. Quantifying the energy efficiency of FFT on heterogeneous platforms. In Performance Analysis of Systems and Software (ISPASS), 2013 IEEE International Symposium on (pp. 235-244). IEEE.
  16. ^ "2DECOMP&FFT - Review on Fast Fourier Transform Software." 2DECOMP&FFT - Review on Fast Fourier Transform Software. Accessed December 07, 2016. http://www.2decomp.org/fft.html.
  17. ^ SIAM Journal on Scientific Computing 2012, Vol. 34, No. 4, pp. C192-C209.
  18. ^ Sandia National Laboratories: Exceptional Service in the National Interest. Accessed December 07, 2016. http://www.sandia.gov/~sjplimp/docs/fft/README.html.
  19. ^ "FFTE: A Fast Fourier Transform Package." FFTE: A Fast Fourier Transform Package. Accessed December 07, 2016. http://www.ffte.jp/.
  20. ^ Parallel FFTw. Accessed December 07, 2016. http://www.fftw.org/parallel/parallel-fftw.html.
  21. ^ a b Dudgeon, Dan E., and Russell M. Mersereau. Multidimensional Digital Signal Processing. Englewood Cliffs, NJ: Prentice-Hall, 1984.
  22. ^ Patterson, David A., and John L. Hennessy. Computer organization and design: the hardware/software interface. Newnes, 2013.
  23. ^ "Parallel Versions of FFTW." Parallel FFTW. Accessed December 07, 2016. http://www.fftw.org/parallel/parallel-fftw.html.
  24. ^ a b c Singh, Deshanand. "Efficient Implementation of Convolutional Neural Networks Using OpenCL on FPGAs." Lecture, Embedded Vision Summit, Cali, Santa Clara, California, May 12, 2015. Accessed November 3, 2016. http://www.embedded-vision.com/platinum-members/embedded-vision-alliance/embedded-vision-training/downloads/pages/may-2015-summit-proceedings.
  25. ^ a b Deng, Tian-Bo. "Completely Parallel Realizations for Multidimensional Digital Filters." Digital Signal Processing 7, no. 3 (1997): 188-198.
  26. ^ Wang, Lizhe, Dan Chen, Rajiv Ranjan, Samee U. Khan, Joanna KolOdziej, and Jun Wang. "Parallel processing of massive EEG data with MapReduce." In Parallel and Distributed Systems (ICPADS), 2012 IEEE 18th International Conference on, pp. 164-171. Ieee, 2012.