Analysis of parallel algorithms

This article discusses the analysis of parallel algorithms running on parallel random-access machines (PRAMs): models of parallel computation where multiple processors have shared access to a random-access memory (RAM).

Overview

Suppose computations are executed on a PRAM machine using $p$ processors. Let $T p$ denote the time that expires between the start of the computation and its end. Analysis of the computation's running time focuses on the following notions:

The work of a computation executed by $p$ processors is the total number of primitive operations that the processors perform.^[1] Ignoring communication overhead from synchronizing the processors, this is equal to the time used to run the computation on a single processor, denoted $T 1$ .
The span is the length of the longest series of operations that have to be performed sequentially due to data dependencies (the critical path). Minimizing the span is a important activity in designing parallel algorithms, because the span determines the shortest possible execution time.^[2] Alternatively, the span can be defined as the time $T \infty$ spent computing using an idealized machine with an infinite number of processors.^[3]
The cost of the computation is the quantity $pT p$ . This expresses the total time spent, by all processors, in both computing and waiting.^[1]

Several useful results follow from the definitions of work, span and cost:

Work law. The cost is always at least the work: $pT p \geq T 1$ . This follows from the fact $p$ processors can perform at most $p$ operations in parallel.^[3]^[1]
Span law. A finite number $p$ of processors cannot outperform an infinite number, so that $T p \geq T \infty$ .^[3]

Using these definitions and laws, the following measures of performance can be given:

Speedup is the gain in speed made by parallel execution compared to sequential execution: $S p = T 1 ∕ T p$ . When the speedup is $Ω(n)$ for input size $n$ (using big O notation), the speedup is linear, which is optimal in the PRAM model because the work law implies that $T 1 ∕ T p \leq p$ (super-linear speedup can occur in practice due to memory hierarchy effects). The situation $T p ∕ T 1 = p$ is called perfect linear speedup.^[3] An algorithm that exhibits linear speedup is said to be scalable.^[1]
Efficiency is the speedup per processor, $S p ∕ p$ .^[1]
Parallelism is the ratio $T 1 ∕ T \infty$ . It represents the maximum possible speedup on any number of processors. By the span law, the parallelism bounds the speedup: if $p > T 1 ∕ T \infty$ , then $T 1 ∕ T p \leq T 1 ∕ T \infty < p$ .^[3]
The slackness is $T 1 ∕ (pT \infty)$ . A slackness less than one implies (by the span law) that perfect linear speedup is impossible on $p$ processors.^[3]

Execution on a limited number of processors

Any computation that can run in parallel on $N$ processors can be executed on $p < N$ processors by letting each processor execute multiple units of work. A result called Brent's law states that one can perform such a "simulation" in time^[4]

T_{p}\leq T_{N}+{\frac {T_{1}-T_{N}}{p}},

or, less precisely,^[1]

T_{p}=O\left(T_{N}+{\frac {T_{1}}{p}}\right).

References

^ ^a ^b ^c ^d ^e ^f Casanova, Henri; Legrand, Arnaud; Robert, Yves (2008). Parallel Algorithms. CRC Press. p. 10.
^ Michael McCool; James Reinders; Arch Robison (2013). Structured Parallel Programming: Patterns for Efficient Computation. Elsevier. pp. 4–5.
^ ^a ^b ^c ^d ^e ^f Cormen, Thomas H.; Leiserson, Charles E.; Rivest, Ronald L.; Stein, Clifford (2009) [1990]. Introduction to Algorithms (3rd ed.). MIT Press and McGraw-Hill. pp. 779–784. ISBN 0-262-03384-4.
^ Gustafson, John L. (2011). "Brent's Theorem". Encyclopedia of Parallel Computing. pp. 182–185.

[casanova-1] ^ ^a ^b ^c ^d ^e ^f Casanova, Henri; Legrand, Arnaud; Robert, Yves (2008). Parallel Algorithms. CRC Press. p. 10.

[spp-2] Michael McCool; James Reinders; Arch Robison (2013). Structured Parallel Programming: Patterns for Efficient Computation. Elsevier. pp. 4–5.

[clrs-3] ^ ^a ^b ^c ^d ^e ^f Cormen, Thomas H.; Leiserson, Charles E.; Rivest, Ronald L.; Stein, Clifford (2009) [1990]. Introduction to Algorithms (3rd ed.). MIT Press and McGraw-Hill. pp. 779–784. ISBN 0-262-03384-4.

[4] Gustafson, John L. (2011). "Brent's Theorem". Encyclopedia of Parallel Computing. pp. 182–185.

[1]

[2]

[3]

[4]

v t e Parallel computing
General	Distributed computing Parallel computing Parallel algorithm Massively parallel Cloud computing High-performance computing Multiprocessing Manycore processor GPGPU Computer network Systolic array
Levels	Bit Instruction Thread Task Data Memory Loop Pipeline
Multithreading	Temporal Simultaneous (SMT) Simultaneous and heterogenous Speculative (SpMT) Preemptive Cooperative Clustered multi-thread (CMT) Hardware scout
Theory	PRAM model PEM model Analysis of parallel algorithms Amdahl's law Gustafson's law Cost efficiency Karp–Flatt metric Slowdown Speedup
Elements	Process Thread Fiber Instruction window Array
Coordination	Multiprocessing Memory coherence Cache coherence Cache invalidation Barrier Synchronization Application checkpointing
Programming	Stream processing Dataflow programming Models Implicit parallelism Explicit parallelism Concurrency Non-blocking algorithm
Hardware	Flynn's taxonomy SISD SIMD Array processing (SIMT) Pipelined processing Associative processing MISD MIMD Dataflow architecture Pipelined processor Superscalar processor Vector processor Multiprocessor symmetric asymmetric Memory shared distributed distributed shared UMA NUMA COMA Massively parallel computer Computer cluster Beowulf cluster Grid computer Hardware acceleration
APIs	Ateji PX Boost Chapel HPX Charm++ Cilk Coarray Fortran CUDA Dryad C++ AMP Global Arrays GPUOpen MPI OpenMP OpenCL OpenHMPP OpenACC Parallel Extensions PVM pthreads RaftLib ROCm UPC TBB ZPL
Problems	Automatic parallelization Deadlock Deterministic algorithm Embarrassingly parallel Parallel slowdown Race condition Software lockout Scalability Starvation
Category: Parallel computing