Parallel breadth-first search

Serial Breadth-First-Search

The breadth-first-search algorithm is an easy way to explore the vertexes of a graph layer by layer. It is a basis algorithm in graph theory which is used as a part of many graph algorithms. For instance, BFS is used by Dinic's algorithm to find maximum flow in a graph. Moreover, BFS is also one of kernel algorithms in Graph500 benchmark, which is a benchmark for data-intensive supercomputing problems.^[1].

In the conventional sequential BFS algorithm, two data structures are created to store the frontier and the next frontier. The frontier contains the vertexes that have same distance(it is also called "level") from the source vertex, these vertexes need to be explored in BFS. The neighbors which are not explored yet will be discovered and put into the next frontier. At the beginning of BFS algorithm, a given source vertex s is the only vertex in the frontier. All direct neighbors of s should be visited in the first step, which form the next frontier. After each layer-traversal, the "next frontier" is switched to the frontier and new vertexes will be stored in the next frontier. The following pseudo-code outlines the idea of it, in which the data structures for the frontier and next frontier are called FS and NS respectively.

1    define bfs_sequential(graph(V,E), source s):
2        for all v in V do
3            d[v] = -1;
4        d[s] = 0; level = 0; FS = {}; NS = {};
5        push(s, FS);
6        while FS !empty do
7            for u in FS do 
8                for each neighbour v of u do 
9                    if d[v] = -1 then
10                       push(v, NS);
11                       d[v] = level;
12           FS = NS, NS = {}, level = level + 1;

First Step of parallelization

As a simple and intuitive solution, the classic Parallel Random Access Machine approach is just an extension of the sequential algorithm that is shown above. The two for-loops can be executed in parallel. The update of the next frontier and the increase of distance need to be atomic. But there are two problems in this simple parallelization. Firstly, the distance-checking and distance-updating operations introduce benign races. The reason is that a neighbor of one vertex can also be the neighbor of another vertex in the frontier. As a result, the distance of this neighbor may be examined and updated more than one time. Although these races waste resource and lead to unnecessary overhead, they don't influence the correctness of BFS. Secondly, in spite of the speedup of each layer-traversal due to parallel processing, a barrier synchronization is needed after every layer. This layer-by-layer synchronization indicates that the steps of needed communication equals O(d), where O is the big O notation and d is the graph diameter.

This simple parallelization's asymptotic complexity is the same as sequential algorithm in the worst case, but some optimizations can be made to get/achieve better BFS parallelization, for example:

1. load-balancing for edge visiting

2. mitigating barrier synchronization

3. improving the locality of memory references

Many researches on parallel BFS have been done for different parallel machine models. The different approaches on various systems are highlighted in the following sections.

parallel BFS with shared memory

Compared to parallel BFS with distributed memory, shared memory provides higher memory-bandwidth and lower latency. The developers don’t need to program for message passing. Therefore, the overhead of messages is avoided^[2]

However, memory accesses and work distribution of BFS are shown to be highly irregular. It is very important to make the parallel BFS on shared memory load-balanced. As for systems with Non-uniform memory access data layout or multicore and multiprocessor systems, exploring the data-locality can also speed up parallel process.

Many parallel BFS algorithms on shared memory can be divided into two types: container centric approaches and vertex centric approaches.^[3] In the container centric approach, two data structures are created to store the current frontier and the next vertex frontier. The next vertex frontier is switched to the current frontier at the last of each step. They can be private to processing entity(such as thread) which supports data locality but needs extra load balancing mechanisms. Alternatively, they can be global to provide implicit load balancing. But then threads will work concurrently and more effort are required for synchronization.

Besides, data structure of containers can be optimized. The typical data structure in serial BFS and some parallel BFS is FIFO Queue, as it is simple and fast where insertion and delete operation takes only O(1) time.

Another alternative is the bag-structure^[4]. The insertion operation in a bag takes O(logn) time in the worst-case, whereas it takes only O(1) amortized time which is as fast as FIFO. Furthermore, union of two bags takes Θ(lgn) time where n is the number of elements in the smaller bag. The bag-split operation also takes Θ(lgn) time. Under the bag-structure, the number of vertexes according to granularity are stored in one bag and the bag-structure becomes the parallel entity. The reducer can be combined with the bag-structure to write vertexes in parallel and traverse them efficiently.

The vertex centric approach treats vertexes as parallel entity， which enables parallel iteration. Each vertex is assigned to a parallel entity. This vertex centric approach might only work well if the graph depth is very low. It is well-suited for GPUs if every thread is mapped to exactly one vertex^[3]. This approach is also suitable for Non-uniform memory access machines, since it is possible to enhance data locality and linear memory access.

parallel BFS with distributed memory

1-D Partitioning

1D Partitioning is the simplest way to combine the parallel BFS with distributed memory. It is based on vertex partition. Once again, load balancing is an important issue for data partition. To achieve this, each processor with distributed memory(e.g. processor) should be in charge of approximately same number of vertexes and their outgoing edges. For the implementation of data storage, each processor can store an adjacency matrix of its local vertexes, in which each row for each vertex is a row of outgoing edges represented by destination vertex indices.

Different from shared memory BFS, the neighbor vertex from one processor may be stored in another processor. As a result, each processor is responsible to tell those processors about traversal status through sending them messages. Moreover, each processor should also deal with the messages from all other processors to construct its local next vertex frontier. Obviously, one all-to-all communication(which means each entity has different messages for all others) is necessary in each step when exchanging the current frontier and the next vertex frontier.

The following pseudo-code of 1-D distributed memory BFS shows more details, which comes from the paper^[5]. This algorithm is originally designed for IBM BlueGene/L systems, which has a 3D torus network architecture. Because the synchronization is the main extra cost for parallelized BFS, the authors of this paper also developed a scalable all-to-all communication based on point-to-point communications. After that, they also reduced the number of point-to-point communication, taking advantage of its high-bandwidth torus network.

1    define 1_D_distributed_memory_BFS( graph(V,E), source s):
2        //normal initialization
3        for all v in V do
4            d[v] = -1;
5        d[s] = 0; level = 0; FS = {}; NS = {};
6        //begin BFS traversal
7        while True do:
8            FS = {the set of local vertexes with level}
9            //all vertexes traversed
10           if FS={} for all processors then:
11               terminate the while loop
12           //construct the NS based on local vertexes in current frontier
13           NS = {neighbors of vertexes in FS, both local and not local vertexes}
14           //synchronization: all-to-all communication
15           for 0 <= j < p do:
16               N_j = {vertexes in NS owned by processor j}
17               send N_j to processor j
18               receive N_j_rcv from processor j
19           //combine the received message to form local next vertex frontier then update the level for them
20           NS_rcv = Union(N_j_rcv)
21           for v in NS_rcv and d[v] == -1 do
22               d[v] = level+1

Combined with multi-threading, the following pseudo code of 1D distributed memory BFS also specifies thread stack and thread barrier, which comes from the paper.^[6]

1    define 1_D_distributed_memory_BFS_with_threads( graph(V,E), source s):
2        //normal initialization
3        for all v in V do
4            d[v] = -1;
5        level = 1; FS = {}; NS = {};
6        //find the index of the owner processor of source vertex s
7        pu_s = find_owner(s);
8        if pu_s = index_pu then
9            push(s,FS);
10           d[s] = 0;
11       //message initialization
12       for 0 <= j < p do
13           sendBuffer_j = {}   // p shared message buffers
14           recvBuffer_j = {}   // for MPI communication
15           thrdBuffer_i_j = {} //thread-local stack for thread i
16       // begin BFS traversal
17       while FS != {} do
18           //traverse vertexes and find owners of neighbor vertexes
19           for each u in FS in parallel do
20               for each neighbor v of u do
21                   pu_v = find_owner(v)
22                   push(v,thrdBuffer_i_(pu_v))
23           Thread Barrier
24           //combine thread stack to form sendBuffer
25           for 0 <= j < p do
26               merge thrdBuffer_i_j in parallel
27           //all-to-all communication 
28           All-to-all collective step with master thread: 
29               1. send data in sendBuffer
30               2. receive and aggregate newly-visited vertexes into recvBuffer
31           Thread Barrier
32           // update level for newly-visited vertexes 
33           for each u in recvBuffer in parallel do
34               if d[u] == -1 then
35                   d[u] = level
36                   push(u, NS_i)
37           Thread Barrier
38           //aggregate NS and form new FS 
39           FS = Union(NS_i)
40           Thread Barrier
41           level = level + 1f

2-D Partitioning

Because BFS algorithm always uses the adjacency matrix as the representation of the graph. The natural 2D decomposition of matrix can also be an option to consider. In 2D partitioning, each processor has a 2D index (i,j). The edges and vertexes are assigned to all processors with 2D block decomposition, in which the sub-adjacency matrix is stored.

In general, the parallel edge processing based on 2D partitioning can be organized in 2 communication phases, which are "expand" phase and "fold" phase.^[6]

In the "expand" phase, if the edge list for a given vertex is the column of the adjacency matrix, then for each vertex v in the frontier, the owner of v is responsible to tell other processors in its processor column that v is visited. After this communication, we know which vertexes are the next frontier^[5].

In the "fold" phase, vertexes in resulting next frontier are sent to their owner processors to form the new frontier locally. With 2D partitioning, these processors are in the same processor row.^[5]

In 2D partitioning, only columns or rows of processors participate in communication in "expand" or "fold" phase respectively^[5]. This is the advantage of 2D partitioning over 1D partitioning, because all processors are involved in the all-to-all communication operation in 1D partitioning. Besides, 2D partitioning is also more flexible for better load balancing, which makes a more scalable and storage-efficient approach much easier.

Implementation optimization strategies

Aside from basic ideas of parallel BFS, the optimization strategy is also an important aspect to speed up parallel BFS algorithm and improve the efficiency. There are already several optimizations for parallel BFS, such as direction optimization, load balancing mechanism and improved data structure and so on.

Direction optimization

In original top-down BFS, each vertex should examine all neighbors of vertex in the frontier. This is sometimes not effective, when the graph has a low effective diameter but some vertexes inside have much higher degrees than average, such as a small-world graph. Since there are a great number of wasted attempts to check out already visited vertexes.^[7].

In the paper^[7], the authors introduce a bottom-up BFS where each vertex only needs to check whether any of their parents is in the frontier. This can be determined efficient if the frontier is represented by a bitmap. Compared to the top-down BFS, bottom-up BFS reduces the fail checking by self-examining the parent to prevent contention.

However, bottom-up BFS requires serializing work of one vertex and only works better when a large fraction of vertexes are in the frontier. As a result, a direction optimized BFS should combine the top-down and the bottom-up BFS. More accurately, the BFS should start with the top-down direction, switch to the bottom-up BFS when the number of vertex exceeds a given threshold and vice versa^[7].

Load balance

Load balancing is very important not only in parallel BFS but also in all parallel algorithms, because balanced work can improve the efficiency. In fact, almost all of parallel BFS algorithm designers should observe and analyze the work partitioning of their algorithm and provide a load balancing mechanism for it.

Randomization is one of the useful and simple ways to achieve load balancing. For instance, in paper^[6], the graph is traversed by randomly shuffling all vertex identifiers prior to partitioning.

Data Structure

There are some special data structures that parallel BFS can benefit from, such as CSR(Compressed Sparse Row), bag-structure, bitmap and so on.

In the CSR, all adjacencies of a vertex is sorted and compactly stored in a contiguous chunk of memory, with adjacency of vertex i+1 next to the adjacency of i. The CSR is extremely fast because it costs only constant time to access vertex adjacency. But it is only space-efficient for 1D partitioning^[6]. More information about CSR can be found in ^[8]. For 2D partitioning, DCSC(Doubly compressed sparse columns) for hyper-sparse matrices is more suitable, more information can be found in ^[9]

In the paper.^[4], the authors develop a new data structure called bag-structure. The insertion operation in a bag takes O(1) amortized time and the union of two bags takes Θ(lgn) time. The bag-split takes also Θ(lgn) time. With this bag-structure, parallel BFS is allowed to write the vertexes of a layer in a single data structure in parallel and later efficiently traverse them in parallel^[4]

Moreover, bitmap is also a very useful data structure to memorize which vertexes are already visited, regardless in the bottom-up BFS.^[10] or just to check if vertexes are visited in the top-down BFS^[8]

external memory BFS

"More work is needed here, please volunteer and write a paragraph."

BFS parallelization with GPU

"More work is needed here, please volunteer and write a paragraph."

GPU

"More work is needed here, please volunteer and write a paragraph."

combination of GPU and CPU

"More work is needed here, please volunteer and write a paragraph."

Benchmarks

Graph500 is the first benchmark for data-intensive supercomputing problems^[1]. This benchmark generates an edge tuple with two endpoints at first. Then the kernel 1 will constructs an undirected graph, in which weight of edge will not be assigned if only kernel 2 runs afterwards. Users can choose to run BFS in kernel 2 and/or Single-Source-Shortest-Path in kernel 3 on the constructed graph. The result of those kernels will be verified and running time will be measured.

Graph500 also provides two reference implementations for kernel 2 and 3. In the referenced BFS, the exploration of vertexes is simply sending messages to target processors to inform them of visited neighbors. There is no extra load balancing method. For the synchronization, AML(Active Messages Library, which is an SPMD communication library build on top of MPI3, intend to be used in fine grain applications like Graph500) barrier ensures the consistent traversal after each layer. The referenced BFS is only used for correctness verification of results. Thus, users should implement their own BFS algorithm based on their hardware. The choice of BFS is not constrained, as long as the output BFS tree is correct.

The correctness of result is based on the comparison with result from referenced BFS. Because only 64 search key are sampled to runs kernel 2 and/or kernel 3, the result is also considered correct when this result is different from referenced result only because the search key is not in samples. These 64 search keys also run the kernel sequentially to compute mean and variance, with which the performance of a single search is measured.

Different from TOP500, the performance metric in Graph500 is Traversed Edges Per Second(TEPS), which is defined as:

TEPS(n) = m / timeK2(n), where timeK2(n) is the measured execution time for running a kernel 2 or 3. m is the number of undirected edges in a traversed component of the graph.

More information about Graph500 can be found at https://graph500.org.

Reference

^ ^a ^b Graph500
^ "Designing multithreaded algorithms for breadth-first search and st-connectivity on the Cray MTA-2.", Bader, David A., and Kamesh Madduri. 2006 International Conference on Parallel Processing (ICPP'06). IEEE, 2006.
^ ^a ^b "Level-synchronous parallel breadth-first search algorithms for multicore and multiprocessor systems.", Rudolf, and Mathias Makulla. FC 14 (2014): 26-31.]
^ ^a ^b ^c "A work-efficient parallel breadth-first search algorithm (or how to cope with the nondeterminism of reducers).", Leiserson, Charles E., and Tao B. Schardl. Proceedings of the twenty-second annual ACM symposium on Parallelism in algorithms and architectures. ACM, 2010.
^ ^a ^b ^c ^d "A scalable distributed parallel breadth-first search algorithm on BlueGene/L.", Yoo, Andy, et al. Proceedings of the 2005 ACM/IEEE conference on Supercomputing. IEEE Computer Society, 2005.
^ ^a ^b ^c ^d "Parallel breadth-first search on distributed memory systems.", Buluç, Aydin, and Kamesh Madduri. Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis. ACM, 2011.
^ ^a ^b ^c "Direction-optimizing breadth-first search.", Beamer, Scott, Krste Asanović, and David Patterson. Scientific Programming 21.3-4 (2013): 137-148.
^ ^a ^b "Scalable GPU Graph Traversal", Merrill, Duane, Michael Garland, and Andrew Grimshaw. Acm Sigplan Notices. Vol. 47. No. 8. ACM, 2012.
^ "On the representation and multiplication of hypersparse matrices."Buluc, Aydin, and John R. Gilbert. 2008 IEEE International Symposium on Parallel and Distributed Processing. IEEE, 2008.
^ "Distributed-memory breadth-first search on massive graphs."Buluc, Aydin, et al. arXiv preprint arXiv:1705.04590 (2017).

parallel breadth-first search

[graph500-1] Graph500

[multithreads-2] "Designing multithreaded algorithms for breadth-first search and st-connectivity on the Cray MTA-2.", Bader, David A., and Kamesh Madduri. 2006 International Conference on Parallel Processing (ICPP'06). IEEE, 2006.

[four-3] "Level-synchronous parallel breadth-first search algorithms for multicore and multiprocessor systems.", Rudolf, and Mathias Makulla. FC 14 (2014): 26-31.]

[work_efficient-4] "A work-efficient parallel breadth-first search algorithm (or how to cope with the nondeterminism of reducers).", Leiserson, Charles E., and Tao B. Schardl. Proceedings of the twenty-second annual ACM symposium on Parallelism in algorithms and architectures. ACM, 2010.

[scalable-5] "A scalable distributed parallel breadth-first search algorithm on BlueGene/L.", Yoo, Andy, et al. Proceedings of the 2005 ACM/IEEE conference on Supercomputing. IEEE Computer Society, 2005.

[pbfs_on_dms-6] "Parallel breadth-first search on distributed memory systems.", Buluç, Aydin, and Kamesh Madduri. Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis. ACM, 2011.

[direction-7] "Direction-optimizing breadth-first search.", Beamer, Scott, Krste Asanović, and David Patterson. Scientific Programming 21.3-4 (2013): 137-148.

[scalable_gpu-8] "Scalable GPU Graph Traversal", Merrill, Duane, Michael Garland, and Andrew Grimshaw. Acm Sigplan Notices. Vol. 47. No. 8. ACM, 2012.

[hypersparse_matrix-9] "On the representation and multiplication of hypersparse matrices."Buluc, Aydin, and John R. Gilbert. 2008 IEEE International Symposium on Parallel and Distributed Processing. IEEE, 2008.

[sm_bfs_mg-10] "Distributed-memory breadth-first search on massive graphs."Buluc, Aydin, et al. arXiv preprint arXiv:1705.04590 (2017).

[1]

[2]

[3]

[4]

[5]

[6]

[7]

[8]

[9]

[10]