Analysis of parallel algorithms
This article discusses the analysis of parallel algorithms running on parallel random-access machines (PRAMs): models of parallel computation where multiple processors have shared access to a random-access memory (RAM).
Overview
Suppose computations are executed on a PRAM machine using p processors. Let Tp denote the time that expires between the start of the computation and its end. Analysis of the computation's running time focuses on the following notions:
- The work of a computation executed by p processors is the total number of primitive operations that the processors perform.[1] Ignoring communication overhead from synchronizing the processors, this is equal to the time used to run the computation on a single processor, denoted T1.
- The span is the length of the longest series of operations that have to be performed sequentially due to data dependencies (the critical path). Minimizing the span is a important activity in designing parallel algorithms, because the span determines the shortest possible execution time.[2] Alternatively, the span can be defined as the time T∞ spent computing using an idealized machine with an infinite number of processors.[3]
- The cost of the computation is the quantity pTp. This expresses the total time spent, by all processors, in both computing and waiting.[1]
Several useful results follow from the definitions of work, span and cost:
- Work law. The cost is always at least the work: pTp ≥ T1. This follows from the fact p processors can perform at most p operations in parallel.[3][1]
- Span law. A finite number p of processors cannot outperform an infinite number, so that Tp ≥ T∞.[3]
Using these definitions and laws, the following measures of performance can be given:
- Speedup is the gain in speed made by parallel execution compared to sequential execution: Sp = T1 ∕ Tp. When the speedup is Ω(n) for input size n (using big O notation), the speedup is linear, which is optimal in the PRAM model because the work law implies that T1 ∕ Tp ≤ p (super-linear speedup can occur in practice due to memory hierarchy effects). The situation Tp ∕ T1 = p is called perfect linear speedup.[3] An algorithm that exhibits linear speedup is said to be scalable.[1]
- Efficiency is the speedup per processor, Sp ∕ p.[1]
- Parallelism is the ratio T1 ∕ T∞. It represents the maximum possible speedup on any number of processors. By the span law, the parallelism bounds the speedup: if p > T1 ∕ T∞, then T1 ∕ Tp ≤ T1 ∕ T∞ < p.[3]
- The slackness is T1 ∕ (pT∞). A slackness less than one implies (by the span law) that perfect linear speedup is impossible on p processors.[3]
Execution on a limited number of processors
Any computation that can run in parallel on N processors can be executed on p < N processors by letting each processor execute multiple units of work. A result called Brent's law states that one can perform such a "simulation" in time[4]
or, less precisely,[1]
References
- ^ a b c d e f Casanova, Henri; Legrand, Arnaud; Robert, Yves (2008). Parallel Algorithms. CRC Press. p. 10.
- ^ Michael McCool; James Reinders; Arch Robison (2013). Structured Parallel Programming: Patterns for Efficient Computation. Elsevier. pp. 4–5.
- ^ a b c d e f Cormen, Thomas H.; Leiserson, Charles E.; Rivest, Ronald L.; Stein, Clifford (2009) [1990]. Introduction to Algorithms (3rd ed.). MIT Press and McGraw-Hill. pp. 779–784. ISBN 0-262-03384-4.
- ^ Gustafson, John L. (2011). "Brent's Theorem". Encyclopedia of Parallel Computing. pp. 182–185.