Parallel external memory (model)

In computer science, a parallel external memory (PEM) model is a cache-aware, external-memory abstract machine.^[1] It is the parallel-computing analogy to the single-processor external memory (EM) model. In a similar way, it is the cache-aware analogy to the parallel random-access machine (PRAM). The PEM model consists of a number of processors, together with their respective private caches and a shared main memory.

Model

Definition

The parallel external memory (PEM) model^[1] is a combination of the external memory (EM) model and the parallel random access memory (PRAM) model. The parallel external memory (PEM) model is a computation model which consists of P processors and a two-level memory hierarchy. This memory hierarchy consists of a large external memory (main memory) of size N and P small internal memories (caches). The main memory is shared by all the processors. Each cache is exclusive to a single processor. A processor cannot access another’s cache. The caches have a size M which is partitioned in blocks of size B. The processors can only perform operations on data which are in their cache. The data can be transferred between the main memory and the cache in blocks of size B.

I/O complexity

The complexity measure of the PEM model is the I/O complexity^[1], which determines the number of parallel blocks transfers between the main memory and the cache. During a parallel block transfer each processor can transfer a block. So if P processors load parallelly a data block of size B form the main memory into their caches, it is considered as an I/O complexity of $O(1)$ not $O(P)$ . A program in the PEM model should minimize the data transfer between main memory and caches and operate as much as possible on the data in the caches.

Read / Write conflicts

In the PEM model, there is no direct communication network between the P processors. The processors have to communicate indirectly over the main memory. If multiple processors try to access the same block in main memory concurrently read/write conflicts^[1] occur. Like in the PRAM model three different variations of this problem are considered:

Concurrent Read Concurrent Write (CRCW): The same block in main memory can be read and written by multiple processors concurrently.
Concurrent Read Exclusive Write (CREW): The same block in main memory can be read by multiple processors concurrently. Only one processor can write to a block at a time.
Exclusive Read Exclusive Write (EREW): The same block in main memory cannot be read or written by multiple processors concurrently. Only one processor can access a block at a time.

The following two algorithms^[1] solve the CREW and EREW problem if P ≤ B processors write to the same block simultaneously. A first approach is to serialize the writes. Only one processor after the other writs to the block. This results in a total of P parallel block transfers. A second approach needs $O(\log(P))$ parallel block transfers and an additional block for each processor. The main idea is to schedule the writes in a binary tree fashion and gradually combine the data into a single block. In the first round P processors combine their blocks into P/2 blocks. Then P/2 processors combine their blocks into P/4. This procedure is continued until all the data is combined in one block.

Examples

Prefixsum

Let A be an ordered set of N elements. The prefix sum of A is an ordered set B of N elements, with ${\textstyle B[i]=\sum _{j=0}^{i}A[j]}$ and ${\textstyle 0\leq i<N}$ . If the input set A is located in continuous main memory, the prefix sum of A can be calculated in the PEM model with the optimal $O({\frac {N}{PB}}+\log(P))$ I/O complexity.^[1] This optimal I/O complexity can be accomplished by simulating an optimal PRAM prefix sum algorithm in the PEM model.^[1]

Multiway partitioning

Let $M=\{m_{1},...,m_{d-1}\}$ be a vector of d-1 pivots sorted in increasing order. Let $A$ be am unordered set of N elements. A d-way partition^[1] of $A$ is a set $\Pi =\{A_{1},...,A_{d}\}$ , where $\cup _{i=1}^{d}A_{i}=A$ and $A_{i}\cap A_{j}=\emptyset$ for $1\leq i<j\leq d$ . $A_{i}$ is called the i-th bucket. The number of elements in $A_{i}$ is greater than $m_{i-1}$ and smaller than $m_{i}^{2}$ . In the following algorithm^[1] the input is partitioned into N/P-sized contiguous segments $S_{1},...,S_{P}$ in main memory. The processor i primarily works on the segment $S_{i}$ .

for each processor i in parallel do
 Read the vector of pivots  $M$  into the cache.
 Partition  $S_{i}$  into d buckets and let vector  $M_{i}=\{j_{1}^{i},...,j_{d}^{i}\}$  bet the number of items in each bucket.
end for
Run PEM prefix sum on the set of vectors  $\{M_{1},...,M_{P}\}$  simultaneously.
for each processor i in parallel do
 Write elements  $S_{i}$  into memory locations offset appropriately by  $M_{i-1}$  and  $M_{i}$ .
end for
Using the prefix sums stored in  $M_{P}$  the last processor P calculates the vector  $B$  of bucket sizes and returns it.

If the vector of $d=O({\frac {M}{B}})$ pivots M and the input set A are located in contiguous memory, then the d-way partitioning problem can be solved in the PEM model with $O({\frac {N}{PB}}+\lceil {\frac {d}{B}}\rceil >\log(P)+d\log(B))$ I/O complexity. The content of the final buckets have to be located in contiguous memory.

Selection

The selection problem is about finding the k-th smallest item in an unordered list $A$ of size $N$ . The following code^[1] makes use of PRAMSORT which is a PRAM optimal sorting algorithm which runs in $O(\log N)$ , and SELECT, which is a cache optimal single-processor selection algorithm.

if  $N\leq P$  then 
   ${\texttt {PRAMSORT}}(A,P)$ 
  return  $A[k]$ 
end if 

for each processor  $i$  in parallel do 
  //Find median of each  $S_{i}$ 
   $m_{i}={\texttt {SELECT}}(S_{i},{\frac {N}{2P}})$ 
end for 

// Sort medians
 ${\texttt {PRAMSORT}}(\lbrace m_{1},\dots ,m_{2}\rbrace ,P)$ 

// Partition around median of medians
 $t={\texttt {PEMPARTITION}}(A,m_{P/2},P)$ 

if  $k\leq t$  then 
  return  ${\texttt {PEMSELECT}}(A[1:t],P,k)$ 
else 
  return  ${\texttt {PEMSELECT}}(A[t+1:N],P,k-t)$ 
end if

Under the assumption that the input is stored in contiguous memory, PEMSELECT has an I/O complexity of:

$O({\frac {N}{PB}}+\log PB\cdot \log({\frac {N}{P}}))$

Distribution sort

Distribution sort partitions an input list $A$ of size $N$ into $d$ disjoint buckets of similar size. Every bucket is then sorted recursively and the results are combined into a fully sorted list.

If $P=1$ the task is delegated to a cache-optimal single-processor sorting algorithm.

Otherwise the following algorithm^[1] is used:

// Sample  ${\tfrac {4N}{\sqrt {d}}}$  elements from  $A$ 
for each processor  $i$  in parallel do
  if  $M<|S_{i}|$  then
     $d=M/B$ 
    Load  $S_{i}$  in  $M$ -sized pages and sort pages individually
  else
     $d=|S_{i}|$ 
    Load and sort  $S_{i}$  as single page
  end if
  Pick every  ${\sqrt {d}}/4$ 'th element from each sorted memory page into contiguous vector  $R^{i}$  of samples
end for 

in parallel do
  Combine vectors  $R^{1}\dots R^{P}$  into a single contiguous vector  ${\mathcal {R}}$ 
  Make  ${\sqrt {d}}$  copies of  ${\mathcal {R}}$ :  ${\mathcal {R}}_{1}\dots {\mathcal {R}}_{\sqrt {d}}$ 
end do

// Find  ${\sqrt {d}}$  pivots  ${\mathcal {M}}[j]$ 
for  $j=1$  to  ${\sqrt {d}}$  in parallel do
   ${\mathcal {M}}[j]={\texttt {PEMSELECT}}({\mathcal {R}}_{i},{\tfrac {P}{\sqrt {d}}},{\tfrac {j\cdot 4N}{d}})$ 
end for

Pack pivots in contiguous array  ${\mathcal {M}}$ 

// Partition  $A$ around pivots into buckets  ${\mathcal {B}}$ 
 ${\mathcal {B}}={\texttt {PEMMULTIPARTITION}}(A[1:N],{\mathcal {M}},{\sqrt {d}},P)$ 

// Recursively sort buckets
for  $j=1$  to  ${\sqrt {d}}+1$  in parallel do
  recursively call  ${\texttt {PEMDISTSORT}}$  on bucket  $j$ of size  ${\mathcal {B}}[j]$ 
  using  $O\left(\left\lceil {\tfrac {{\mathcal {B}}[j]}{N/P}}\right\rceil \right)$  processors responsible for elements in bucket  $j$ 
end for

The I/O complexity of PEMDISTSORT is:

$O\left(\left\lceil {\frac {N}{PB}}\right\rceil \left(\log _{d}P+\log _{M/B}{\frac {N}{PB}}\right)+f(N,P,d)\cdot \log _{d}P\right)$

where

$f(N,P,d)=O\left(\log {\frac {PB}{\sqrt {d}}}\log {\frac {N}{P}}+\left\lceil {\frac {\sqrt {d}}{B}}\log P+{\sqrt {d}}\log B\right\rceil \right)$

If the number of processors is chosen that $f(N,P,d)=O\left(\left\lceil {\tfrac {N}{PB}}\right\rceil \right)$ and $M<B^{O(1)}$ the I/O complexity is then:

$O\left({\frac {N}{PB}}\log _{M/B}{\frac {N}{B}}\right)$

Other PEM algorithms


PEM Algorithm	I/O complexity	Constraints
Mergesort^[1]	$O\left({\frac {N}{PB}}\log _{\frac {M}{B}}{\frac {N}{B}}\right)={\textrm {sort}}_{P}(N)$	$P\leq {\frac {N}{B^{2}}},M=B^{O(1)}$
List ranking^[2]	$O\left({\textrm {sort}}_{P}(N)\right)$	$P\leq {\frac {N/B^{2}}{\log B\cdot \log ^{O(1)}N}},M=B^{O(1)}$
Euler tour^[2]	$O\left({\textrm {sort}}_{P}(N)\right)$	$P\leq {\frac {N}{B^{2}}},M=B^{O(1)}$
Expression tree evaluation^[2]	$O\left({\textrm {sort}}_{P}(N)\right)$	$P\leq {\frac {N}{B^{2}\log B\cdot \log ^{O(1)}N}},M=B^{O(1)}$
Finding a MST^[2]	$O\left({\textrm {sort}}_{P}(\|V\|)+{\textrm {sort}}_{P}(\|E\|)\log {\tfrac {\|V\|}{pB}}\right)$	$p\leq {\frac {\|V\|+\|E\|}{B^{2}\log B\cdot \log ^{O(1)}N}},M=B^{O(1)}$

Where ${\textrm {sort}}_{P}(N)$ is the time it takes to sort $N$ items with $P$ processors in the PEM model.

References

^ ^a ^b ^c ^d ^e ^f ^g ^h ⁱ ^j ^k ^l Arge, Lars; Goodrich, Michael T.; Nelson, Michael; Sitchinava, Nodari (2008). "Fundamental parallel algorithms for private-cache chip multiprocessors". Proceedings of the twentieth annual symposium on Parallelism in algorithms and architectures - SPAA '08. New York, New York, USA: ACM Press. doi:10.1145/1378533.1378573. ISBN 9781595939739.
^ ^a ^b ^c ^d Arge, Lars; Goodrich, Michael T.; Sitchinava, Nodari (2010). "Parallel external memory graph algorithms". 2010 IEEE International Symposium on Parallel & Distributed Processing (IPDPS). IEEE. doi:10.1109/ipdps.2010.5470440. ISBN 9781424464425.

[:0-1] ^ ^a ^b ^c ^d ^e ^f ^g ^h ⁱ ^j ^k ^l Arge, Lars; Goodrich, Michael T.; Nelson, Michael; Sitchinava, Nodari (2008). "Fundamental parallel algorithms for private-cache chip multiprocessors". Proceedings of the twentieth annual symposium on Parallelism in algorithms and architectures - SPAA '08. New York, New York, USA: ACM Press. doi:10.1145/1378533.1378573. ISBN 9781595939739.

[:1-2] Arge, Lars; Goodrich, Michael T.; Sitchinava, Nodari (2010). "Parallel external memory graph algorithms". 2010 IEEE International Symposium on Parallel & Distributed Processing (IPDPS). IEEE. doi:10.1109/ipdps.2010.5470440. ISBN 9781424464425.

[1]

[2]