Parallel external memory (model)

In computer science, a parallel external memory (PEM) model is a cache-aware, external-memory abstract machine.^[1] It is the parallel-computing analogy to the single-processor external memory (EM) model. In a similar way, it is the cache-aware analogy to the parallel random-access machine (PRAM). The PEM model consists of a number of processors, together with their respective private caches and a shared main memory.

Model

Definition

The PEM model^[1] is a combination of the EM model and the PRAM model. The PEM model is a computation model which consists of $P$ processors and a two-level memory hierarchy. This memory hierarchy consists of a large external memory (main memory) of size $N$ and $P$ small internal memories (caches). The processors share the The main memory. Each cache is exclusive to a single processor. A processor can't access another’s cache. The caches have a size $M$ which is partitioned in blocks of size $B$ . The processors can only perform operations on data which are in their cache. The data can be transferred between the main memory and the cache in blocks of size $B$ .

I/O complexity

The complexity measure of the PEM model is the I/O complexity^[1], which determines the number of parallel blocks transfers between the main memory and the cache. During a parallel block transfer each processor can transfer a block. So if $P$ processors load parallelly a data block of size $B$ form the main memory into their caches, it is considered as an I/O complexity of $O(1)$ not $O(P)$ . A program in the PEM model should minimize the data transfer between main memory and caches and operate as much as possible on the data in the caches.

Read / Write conflicts

In the PEM model, there is no direct communication network between the P processors. The processors have to communicate indirectly over the main memory. If multiple processors try to access the same block in main memory concurrently read/write conflicts^[1] occur. Like in the PRAM model, three different variations of this problem are considered:

Concurrent Read Concurrent Write (CRCW): The same block in main memory can be read and written by multiple processors concurrently.
Concurrent Read Exclusive Write (CREW): The same block in main memory can be read by multiple processors concurrently. Only one processor can write to a block at a time.
Exclusive Read Exclusive Write (EREW): The same block in main memory cannot be read or written by multiple processors concurrently. Only one processor can access a block at a time.

The following two algorithms^[1] solve the CREW and EREW problem if $P\leq B$ processors write to the same block simultaneously. A first approach is to serialize the write operations. Only one processor after the other writs to the block. This results in a total of $P$ parallel block transfers. A second approach needs $O(\log(P))$ parallel block transfers and an additional block for each processor. The main idea is to schedule the write operations in a binary tree fashion and gradually combine the data into a single block. In the first round $P$ processors combine their blocks into $P/2$ blocks. Then $P/2$ processors combine the $P/2$ blocks into $P/4$ . This procedure is continued until all the data is combined in one block.

Examples

Multiway partitioning

Let $M=\{m_{1},...,m_{d-1}\}$ be a vector of d-1 pivots sorted in increasing order. Let $A$ be am unordered set of N elements. A d-way partition^[1] of $A$ is a set $\Pi =\{A_{1},...,A_{d}\}$ , where $\cup _{i=1}^{d}A_{i}=A$ and $A_{i}\cap A_{j}=\emptyset$ for $1\leq i<j\leq d$ . $A_{i}$ is called the i-th bucket. The number of elements in $A_{i}$ is greater than $m_{i-1}$ and smaller than $m_{i}^{2}$ . In the following algorithm^[1] the input is partitioned into N/P-sized contiguous segments $S_{1},...,S_{P}$ in main memory. The processor i primarily works on the segment $S_{i}$ . The multiway partitioning algorithm (PEM_DIST_SORT^[1]) uses a PEM prefix sum algorithmCite error: A <ref> tag is missing the closing </ref> (see the help page). | $O\left({\textrm {sort}}_{P}(N)\right)$ | $P\leq {\frac {N/B^{2}}{\log B\cdot \log ^{O(1)}N}},M=B^{O(1)}$ |- !Euler tour^[2] | $O\left({\textrm {sort}}_{P}(N)\right)$ | $P\leq {\frac {N}{B^{2}}},M=B^{O(1)}$ |- !Expression tree evaluation^[2] | $O\left({\textrm {sort}}_{P}(N)\right)$ | $P\leq {\frac {N}{B^{2}\log B\cdot \log ^{O(1)}N}},M=B^{O(1)}$ |- !Finding a MST^[2] | $O\left({\textrm {sort}}_{P}(|V|)+{\textrm {sort}}_{P}(|E|)\log {\tfrac {|V|}{pB}}\right)$ | $p\leq {\frac {|V|+|E|}{B^{2}\log B\cdot \log ^{O(1)}N}},M=B^{O(1)}$ |} Where ${\textrm {sort}}_{P}(N)$ is the time it takes to sort $N$ items with $P$ processors in the PEM model.

References

^ ^a ^b ^c ^d ^e ^f ^g ^h Arge, Lars; Goodrich, Michael T.; Nelson, Michael; Sitchinava, Nodari (2008). "Fundamental parallel algorithms for private-cache chip multiprocessors". Proceedings of the twentieth annual symposium on Parallelism in algorithms and architectures - SPAA '08. New York, New York, USA: ACM Press. doi:10.1145/1378533.1378573. ISBN 9781595939739.
^ ^a ^b ^c Cite error: The named reference :1 was invoked but never defined (see the help page).

[:0-1] ^ ^a ^b ^c ^d ^e ^f ^g ^h Arge, Lars; Goodrich, Michael T.; Nelson, Michael; Sitchinava, Nodari (2008). "Fundamental parallel algorithms for private-cache chip multiprocessors". Proceedings of the twentieth annual symposium on Parallelism in algorithms and architectures - SPAA '08. New York, New York, USA: ACM Press. doi:10.1145/1378533.1378573. ISBN 9781595939739.

[:1-2] Cite error: The named reference :1 was invoked but never defined (see the help page).

[1]

[2]