Parallel breadth-first search

This sandbox is in the article namespace. Either move this page into your userspace, or remove the {{User sandbox}} template.

Serial Breadth-First-Search

The breadth-first-search algorithm is an easy way to explore the nodes of a graph layer by layer from a given starting node s, where s forms layer 0 and all direct neighbors of s form layer 1. In the common sequential algorithm, two data structures are used to store the frontier (the current level) and the next frontier (the next level to be reached). The next frontier becomes the frontier after each layer traversal. [S08]The following pseudo code outlines the idea of a common sequential BFS algorithm. The data structures for the frontier and next frontier are called FS and NS respectively.

1    define bfs_sequential(graph(V,E), source s):
2        for all v in V do
3            d[v] = -1;
4        d[s] = 0; level = 0; FS = {}; NS = {};
5        push(s, FS);
6        while FS !empty do
7            for u in FS do 
8                for each neighbour v of u do 
9                    if d[v] = -1 then
10                       push(v, NS);
11                       d[v] = level;
12           FS = NS, NS = {}, level = level + 1;

Breadth-first search is an important and useful graph algorithm, but parallelization has proven itself difficult. Still it can work on smaller graphs, and in the following we will present some of the core ideas for a parallel BFS.

First Step of parallelization

As a simple and intuitive solution the classical Parallel Random Access Machine approach is just extending the sequential algorithm that is shown above. The two for-loops can be executed in parallel for which the distance update and update of the data structure needs to be atomic. But these two operations also introduce races. The reason is that a neighbor of one node can also be the neighbor of another node in the frontier, then distance of this neighbor will be updated more than one time. Although races waste resource and lead to unnecessary overhead, they don't influence the correctness of BFS. Moreover, a barrier synchronization is needed once for each level, which limits the execution time to be in O(d), where d is graph diameter.[paper]

This simple parallelization's asymptotic complexity is the same as in the sequential algorithm, but there is room for optimization, with the key directions being:[paper]

1. load-balance for edge visiting

2. mitigating barrier synchronization

3. improving the locality of memory references

Recently, work on parallel BFS has been done for different parallel systems. The different approaches on the various systems are highlighted in the following sections.

parallel BFS with shared memory

Compared to parallel BFS with distributed memory, shared memory provides higher memory-bandwidth and lower latency. The developer don’t need to program the message passing. Therefore, the overhead of messages is avoided^[1].

However, memory accesses and work distribution of BFS are shown to be highly irregular. It is very important to make the parallel BFS on shared memory load-balanced. As for systems with Non-uniform memory access data layout or multicore and multiprocessor systems, exploring the data-locality can also speed up parallel process.

Many parallel BFS algorithms on shared memory can be divided into two types: container centric approaches and vertex centric approaches.^[2] In container centric approach, data structures are used for current frontier and next vertex frontier. Next vertex frontier is switched to current frontier at the last of each step. They can be private to processing entity(such as thread) which supports data locality but also need extra load balancing mechanisms. Alternatively, they can also be global to provide implicit load balancing. But then threads work concurrently and more effort are needed for synchronization.

Besides, data structure of containers can also be optimized. The typical data structure in serial BFS and some parallel BFS is FIFO Queue because it is simple and fast where insertion and delete operation cost only O(1).

Another alternative is bag structure^[3]. The insertion operation in bag takes O(1) amortized time which is as fast as FIFO and O(logn) worst-case time. Furthermore, union of two bags takes theta(lgn) time where n is the number of elements in the smaller bag. Bag-split takes also Θ(lgn) time. Under the bag-structure, the number of vertices according to granularity are stored in one bag and bag become the parallel entity. Reducer can be combined with bag to write vertices in parallel and traverse them efficiently.

Vertex centric approaches treat vertices as parallel entity which enables parallel iteration. Each vertex is assigned to a parallel entity. This vertex centric approach might work well only, if the graph depth is very low. It is well-suited for GPU’s if every thread is mapped to exactly one vertex^[2]. This approach is also suitable for NUMA machines because of possibility to enhance data locality and linear memory access.

parallel BFS with distributed memory

1-D Partitioning

1-D Partitioning is the simplest way to combine the parallel BFS with distributed memory, which is based on vertex partition. Once again, load balancing is still an important issue for data partition, which means each processing unit with distributed memory(e.g. processor) should be in charge of approximately same number of vertices and their outgoing edges. For example, each processing unit can store an adjacency matrix of its local vertices, in which each row for each vertex is a row of destination vertex indices of edges.

Different from shared memory BFS, in distributed memory BFS, the neighbor vertex of vertex in current frontier of one processor may be stored in another processor. As a result, each processor is responsible to telling those processors about traversal status through sending them messages. Moreover, each processor should also receive messages from all other processors to be able to construct its local next vertex frontier. Obviously, one all-to-all communication(which means each entity has different messages for all others) is necessary for each step when exchanging the current frontier and next vertex frontier.

The following pseudo code of 1-D distributed memory BFS shows more details for it, which is based on paper^[4]

1    define 1_D_distributed_memory_BFS( graph(V,E), source s):
2        //normal initialization
3        for all v in V do
4            d[v] = -1;
5        d[s] = 0; level = 0; FS = {}; NS = {};
6        //begin BFS traversal
7        while True do:
8            FS = {the set of local vertices with level}
9            //all vertices traversed
10           if FS={} for all processors then:
11               terminate the while loop
12           //construct the NS based on local vertices in current frontier
13           NS = {neighbors of vertices in FS, both local and not local vertices}
14           //synchronization: all-to-all communication
15           for 0 <= j < p do:
16               N_j = {vertices in NS owned by processor j}
17               send N_j to processor j
18               receive N_j_rcv from processor j
19           //combine the received message to form local next vertex frontier then update the level for them
20           NS_rcv = Union(N_j_rcv)
21           for v in NS_rcv and d[v] == -1 do
22               d[v] = level+1

Combined with multi-threading, the following pseudo code of 1-D distributed memory BFS also specifies thread stack and thread barrier, which is based on paper^[5]

1    define 1_D_distributed_memory_BFS_with_threads( graph(V,E), source s):
2        //normal initialization
3        for all v in V do
4            d[v] = -1;
5        level = 1; FS = {}; NS = {};
6        //find the index of the owner processor of source vertex s
7        pu_s = find_owner(s);
8        if pu_s = index_pu then
9            push(s,FS);
10           d[s] = 0;
11       //message initialization
12       for 0 <= j < p do
13           sendBuffer_j = {}   // p shared message buffers
14           recvBuffer_j = {}   // for MPI communication
15           thrdBuffer_i_j = {} //thread-local stack for thread i
16       // begin BFS traversal
17       while FS != {} do
18           //traverse vertices and find owners of neighbor vertices
19           for each u in FS in parallel do
20               for each neighbor v of u do
21                   pu_v = find_owner(v)
22                   push(v,thrdBuffer_i_(pu_v))
23           Thread Barrier
24           //combine thread stack to form sendBuffer
25           for 0 <= j < p do
26               merge thrdBuffer_i_j in parallel
27           //all-to-all communication 
28           All-to-all collective step with master thread: 
29               1. send data in sendBuffer
30               2. receive and aggregate newly-visited vertices into recvBuffer
31           Thread Barrier
32           // update level for newly-visited vertices 
33           for each u in recvBuffer in parallel do
34               if d[u] == -1 then
35                   d[u] = level
36                   push(u, NS_i)
37           Thread Barrier
38           //aggregate NS and form new FS 
39           FS = Union(NS_i)
40           Thread Barrier
41           level = level + 1f

2-D Partitioning

1. "Parallel Breadth-First Search on Distributed Memory Systems"

   - 2D partitioning of processors

2. "A Scalable Distributed Parallel Breadth-First Search Algorithm on BlueGene/L"

   - 2D partitioning of graph edges
   - pseudocode

two-phases of 2D partitioning(more detials): 1. expand 2. fold 3. more details in "Distributed-Memory Breadth-First Search on Massive Graphs"

advantadges of 2D partitioning over 1D partitioning:

- that the processor-column and processorrow communications involve R and C processors, respectively; for 1D partitioning, all P processors are involved in the communication operation.("A Scalable Distributed Parallel Breadth-First Search Algorithm on BlueGene/L")

Implementation optimization strategies

Direction optimization

1. "A Scalable Distributed Parallel Breadth-First Search Algorithm on BlueGene/L"

   - In a bi-directional search, the search starts from both source and destination vertices and continues until a path connecting the source and destination is found.

2. "Distributed-Memory Breadth-First Search on Massive Graphs"

   - direction optimization
   - top-down BFS(pseudocode)
   - combination of bottom-up and top-down BFS(how)

Load balance

1. "Parallel Breadth-First Search on Distributed Memory Systems"

   - a reasonable load-balanced graph traversal by randomly shuffling all the vertex identifiers prior to partitioning.

Data Structure

"Besser: ganz kurz _benennen_+Referenz"

1. CSR

   1. "Parallel Breadth-First Search on Distributed Memory Systems"
       - ‘compressed sparse row’ (CSR)
       - doublycompressed sparse columns (DCSC)
       - more details in "Distributed-Memory Breadth-First Search on Massive Graphs"
   2. "Scalable GPU Graph Traversal"
       - detailed information for CSR
       - combination with prefix sum
   3. "Task-based Parallel Breadth-First Search in Heterogeneous Environments"

2. Bag

   - "A Work-Efficient Parallel Breadth-First Search Algorithm(or How to Cope with the Nondeterminism of Reducers)"

3. Bitmap

   - "Distributed-Memory Breadth-First Search on Massive Graphs"
   - (new paper)"Distributed-Memory Breadth-First Search on Massive Graphs" introduced in "Level-Synchronous Parallel Breadth-First Search Algorithms For Multicore and Multiprocessor Systems"
   - "Level-Synchronous Parallel Breadth-First Search Algorithms For Multicore and Multiprocessor Systems"
   - "Scalable GPU Graph Traversal"

external memory BFS

"More work is needed here, please volunteer and write a paragraph."

BFS parallelization with GPU

"More work is needed here, please volunteer and write a paragraph."

GPU

"More work is needed here, please volunteer and write a paragraph."

combination of GPU and CPU

"More work is needed here, please voluteer and write a paragraph."

Benchmarks

Graph500 is the first benchmark for data-intensive supercomputing problems[1]. This benchmark generates edge tuple with two endpoints at first. Then the kernel 1 will constructs a undirected graph, in which weight of edge will not be assigned if only kernel 2 runs afterwards. Users can choose to run BFS in kernel 2 and/or Single-Source-Shortest-Path in kernel 3 on the constructed graph. The result of those kernels will be verified and running time will be measured.

Graph500 also provides two reference implementations for kernel 2 and 3. In the inference BFS, the exploration of vertexes is simply sending messages to target processors to inform them of visited neighbors. There is no extra load balancing methods. For the synchronization, AML(Active Messages Library, which is an SPMD communication library build on top of MPI3 intend to be used in fine grain applications like Graph500) barrier will ensure the consistent traversal after each layer. Moreover, users should implements their own BFS algorithm, the choice of BFS is not constrained, as long as the output BFS tree is correct.

The correctness of result is based on the comparison with referenced BFS result. Because only 64 search key samples runs kernel 2 and kernel 3 sequentially, the result is also considered correct when this result is different from reference result only because the search key is not sampled. The result of 64 search keys are used to compute mean and variance, with which the performance of a single search is measured.

Different from Top500, the performance metric in Graph500 is Traversed Edges Per Second(TEPS), which is defined as:

TEPS(n) = m / timeK2(n), where timeK2(n) is the measured execution time for running a kernel 2 or 3. m is the number of undirected edges in a traversed component of the graph.

More information about Graph500 can be found at https://graph500.org.

Reference

^ Bader, David A., and Kamesh Madduri. "Designing multithreaded algorithms for breadth-first search and st-connectivity on the Cray MTA-2." 2006 International Conference on Parallel Processing (ICPP'06). IEEE, 2006.
^ ^a ^b Rudolf, and Mathias Makulla. "Level-synchronous parallel breadth-first search algorithms for multicore and multiprocessor systems." FC 14 (2014): 26-31.
^ Leiserson, Charles E., and Tao B. Schardl. "A work-efficient parallel breadth-first search algorithm (or how to cope with the nondeterminism of reducers)." Proceedings of the twenty-second annual ACM symposium on Parallelism in algorithms and architectures. ACM, 2010.
^ Yoo, Andy, et al. "A scalable distributed parallel breadth-first search algorithm on BlueGene/L." Proceedings of the 2005 ACM/IEEE conference on Supercomputing. IEEE Computer Society, 2005.
^ Buluç, Aydin, and Kamesh Madduri. "Parallel breadth-first search on distributed memory systems." Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis. ACM, 2011.

[multithreads-1] Bader, David A., and Kamesh Madduri. "Designing multithreaded algorithms for breadth-first search and st-connectivity on the Cray MTA-2." 2006 International Conference on Parallel Processing (ICPP'06). IEEE, 2006.

[four-2] Rudolf, and Mathias Makulla. "Level-synchronous parallel breadth-first search algorithms for multicore and multiprocessor systems." FC 14 (2014): 26-31.

[work_efficient-3] Leiserson, Charles E., and Tao B. Schardl. "A work-efficient parallel breadth-first search algorithm (or how to cope with the nondeterminism of reducers)." Proceedings of the twenty-second annual ACM symposium on Parallelism in algorithms and architectures. ACM, 2010.

[scalable-4] Yoo, Andy, et al. "A scalable distributed parallel breadth-first search algorithm on BlueGene/L." Proceedings of the 2005 ACM/IEEE conference on Supercomputing. IEEE Computer Society, 2005.

[pbfs_on_dms-5] Buluç, Aydin, and Kamesh Madduri. "Parallel breadth-first search on distributed memory systems." Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis. ACM, 2011.

[1]

[2]

[3]

[4]

[5]