Hypercube (communication pattern)

The $d$ -dimensional hypercube is a network topology for parallel computers with $2^{d}$ processing elements. The topology allows for an efficient implementation of some basic communication primitives such as Broadcast, All-Reduce and Prefix sum.^[1] The processing elements are numbered from $0$ to $2^{d}-1$ . Each processing elements is then adjacent to processing elements whose numbers differ in exactly one bit. The algorithms described on this page utilize this structure efficiently.

Algorithm Outline

Most of the communication primitives presented in this article share a common template.^[2] Initially, each processing element possesses one message that must reach every other processing element during the course of the algorithm. The following pseudo code sketches the communication steps necessary. Hereby, Initialization, Operation and Output are placeholders that depend on the given communication primitive (see next section).

Input: message  $m$ .
Output: depends on Initialization, Operation and Output.
Initialization
 $s:=m$ 
for  $0\leq k<d$  do
     $y:=i{\text{ XOR }}2^{k}$ 
    Send  $s$  to  $y$ 
    Recieve  $m$  from  $y$ 
    Operation $(s,m)$ 
endfor
Output

Each processing element iterates over its neighbors (the expression $i{\text{ XOR }}2^{k}$ negates the $k$ -th bit in $i$ 's binary representation, therefore obtaining the numbers of its neighbors). During an iteration, each processing element exchanges a message with the neighbor and processes the received message afterwards. The processing operation depends on the communication primitive.

Communication Primitives

Prefixsum

At the beginng of a prefix sum operation each processing unit $i$ owns a message $m_{i}$ . At the end each processing unit $i$ should recieve $\bigoplus _{0\leq j\leq i}m_{j}$ , where $\oplus$ is an associative operation. The following pseudo code describes the algorithmn.

input: message  $m_{i}$  of processor  $i$ .
output: prefixsum  $\bigoplus _{0\leq j\leq i}$  of processor  $i$ .
 $x:=m_{i}$  
 $\sigma :=m_{i}$ 
for  $0\leq k\leq d-1$  do
     $y:=i{\text{ XOR }}2^{k}$ 
    Send  $\sigma$  to  $y$ 
    Recieve  $m$  from  $y$ 
     $\sigma :=\sigma \oplus m$ 
    if bit  $k$  in  $i$  is set then  $x:=x\oplus m$ 
endfor

Bei der Präfixsumme besitzt jeder Prozessor $i$ zu Beginn eine Nachricht $m_{i}$ . Das Ziel ist es, dass jeder Prozessor $i$ am Ende $\bigoplus _{0\leq j\leq i}$ für eine assoziative Operation $\oplus$ erhält. Der Algorithmus kann wie folgt in die Algorithmenskizze eingebettet werden:

Eingabe: Nachricht  $m_{i}$  auf Prozessor  $i$ .
Ausgabe: Präfixsumme  $\bigoplus _{0\leq j\leq i}$  auf Prozessor  $i$ .
 $x:=m_{i}$  
 $\sigma :=m_{i}$ 
for  $0\leq k\leq d-1$  do
     $y:=i{\text{ XOR }}2^{k}$ 
    Sende  $\sigma$  an  $y$ 
    Empfange  $m$  von  $y$ 
     $\sigma :=\sigma \oplus m$ 
    if Bit  $k$  in  $i$  gesetzt then  $x:=x\oplus m$ 
endfor

Ein Hyperwürfel der Dimension $d$ kann in zwei Hyperwürfel der Dimension $d-1$ zerlegt werden. Dazu wird im Weiteren der Teilwürfel aller Knoten, deren Nummer in Binärdarstellung mit 0 beginnen, als 0-Teilwürfel bezeichnet. Die restlichen Knoten bilden analog den 1-Teilwürfel. Nachdem in beiden Teilwürfeln die Präfixsumme berechnet wurde, muss die Gesamtsumme der Elemente im 0-Teilwürfel noch auf alle Elemente des 1-Teilwürfels aufaddiert werden. Das liegt daran, dass nach Definition die Rechner im 0-Teilwürfel einen kleineren Rang als die Rechner im 1-Teilwürfel besitzen. In der Implementierung speichert jeder Knoten deswegen neben seiner Präfixsumme (Variable $x$ ) außerdem die Summe über alle Elemente im Teilwürfel (Variable $\sigma$ ). So können in jedem Schritt alle Knoten im 1-Teilwürfel die Gesamtsumme über den 0-Teilwürfel beziehen.

Bei der Laufzeit ergibt sich ein Faktor von $\log p$ für $T_{\text{start}}$ und ein Faktor von $n\log p$ für $T_{\text{byte}}$ : $T(n,p)=(T_{\text{start}}+nT_{\text{byte}})\log p$ .

Hypercubes of dimension $d$ can be split into two hypercubes of dimension $d-1$ .

Gossip / All-Reduce

Gossip operations start with each processing element having a message $m_{i}$ . After the operation is finished each processing unit knows the messages of all other processing elements, with message $x:=m_{0}\cdot m_{1}\dots m_{p}$ . The operation can be implemented following the algorithm template.

input: message  $x:=m_{i}$  at processing unit $i$ .
output: all messages  $m_{1}\cdot m_{2}\dots m_{p}$ .
 $x:=m_{i}$ 
for  $0\leq k<d$  do
     $y:=i{\text{ XOR }}2^{k}$ 
    Send  $x$  to  $y$ 
    Receive  $x'$  from  $y$ 
     $x:=x\cdot x'$ 
endfor

With each iteration the transferred message doubles in length. This leads to a run-time of $T(n,p)\approx \sum _{j=0}^{d-1}(T_{\text{start}}+n\cdot 2^{j}T_{\text{byte}})=\log(p)T_{\text{start}}+(p-1)nT_{\text{byte}}$ .

The same principle can be applied to the All-Reduce operations, but instead of concatenating the messages, it performs an operation on the two messages. So it is a Reduce operation, where all processing units know the result. In Hypercubes a modified Gossip reduces the number of communications compared to Reduce and Broadcast.

All-to-All

Here every processing element has a unique message for all other processing elements.

input: message $m_{ij}$  at processing element  $i$  to processing element  $j$ .
for  $d>k\geq 0$  do
   Receive from processing element  $i{\text{XOR}}2^{k}$ :
       all messages for my  $k$ -dimensional sub cube
   Send to processing element  $i{\text{XOR}}2^{k}$ :
       all messages for his  $k$ -dimensional sub cube
endfor

With each iteration a messages comes closer to its destination by one dimension, if it hasn't arrived yet. So there only $d=\log {p}$ steps needed. In every step $p/2$ are sent. In the first iteration half of the messages aren't meant for the own sub cube. In every following step the sub cube is only half the size, but in the previous step exactly the same number of messages arrived from another processing element.

This results in a run-time of $T(n,p)\approx \log {p}(T_{\text{start}}+{\frac {p}{2}}nT_{\text{byte}})$ .

ESBT-Broadcast

The ESBT-broadcast (Edge-disjoint Spanning Binomial Tree) algorithm^[3] is a pipelined broadcast algorithm with optimal runtime for clusters with hypercube network topology. The algorithm embeds $d$ edge-disjoint binomial trees in the hypercube, such that each neighbor of processing element $0$ is the root of a spanning binomial tree on $2^{d}-1$ nodes. To broadcast a message, the source node splits its message into $k$ chunks of equal size and cyclically sends them to the roots of the binomial trees. Upon receiving a chunk, the binomial trees broadcast it.

The runtime of this algorithm is as follows. In each step, the source node sends one of its $k$ chunks to a binomial tree. Broadcasting the chunk within the binomial tree takes $d$ steps. Thus, it takes $k$ steps to distribute all chunks and additionally $d$ steps until the last binomial tree broadcast has finished, resulting in $k+d$ steps overall. Therefore, the runtime for a message of length $n$ is $T(n,p,k)=\left({\frac {n}{k}}T_{\text{byte}}+T_{\text{start}}\right)(k+d)$ . With the optimal chunk size $k^{*}={\sqrt {\frac {nd\cdot T_{\text{byte}}}{T_{\text{start}}}}}$ , the optimal runtime of the algorithm is $T^{*}(n,p)=n\cdot T_{\text{byte}}+\log(p)\cdot T_{\text{start}}+{\sqrt {n\log(p)\cdot T_{\text{start}}\cdot T_{\text{byte}}}}$ .

Construction of the Binomial Trees

A $3$ -dimensional hypercubes with three ESBT embedded.

This section describes how to construct the binomial trees systematically. First, construct a single binomial spanning tree von $2^{d}$ nodes as follows. Number the nodes from $0$ to $2^{d}-1$ and consider their binary representation. Then the children of each nodes are obtained by negating single leading zeroes. This results in a single binomial spanning tree. To obtain $d$ edge-disjoint copies of the tree, translate and rotate the nodes: for the $k$ -th copy of the tree, apply a XOR operation with $2^{k}$ to each node. Afterwards, right rotate all nodes by $k$ digits. The resulting binomial trees are edge-disjoint and therefore, fulfill the requirements for the ESBT-broadcasting algorithm.

Referenzen

^ Grama, A.(2003). Introduction to Parallel Computing. Addison Wesley; Auflage: 2 ed. ISBN: 978-0201648652.
^ Foster, I.(1995). Designing and Building Parallel Programs: Concepts and Tools for Parallel Software Engineering. Addison Wesley; ISBN: 0201575949.
^ Johnsson, S.L.; Ho, C.-T. (1989). "Optimum broadcasting and personalized communication in hypercubes". IEEE Transactions on Computers. 38 (9): 1249–1268. doi:10.1109/12.29465. ISSN 0018-9340.

[1] Grama, A.(2003). Introduction to Parallel Computing. Addison Wesley; Auflage: 2 ed. ISBN: 978-0201648652.

[2] Foster, I.(1995). Designing and Building Parallel Programs: Concepts and Tools for Parallel Software Engineering. Addison Wesley; ISBN: 0201575949.

[3] Johnsson, S.L.; Ho, C.-T. (1989). "Optimum broadcasting and personalized communication in hypercubes". IEEE Transactions on Computers. 38 (9): 1249–1268. doi:10.1109/12.29465. ISSN 0018-9340.

[1]

[2]

[3]