Testing high-performance computing applications

High performance computing applications run on massively parallel supercomputers consist of concurrent programs designed using multi-threaded, multi-process models. The applications may consist of various constructs (threads, local processes, distributed processes, etc.) with varying degree of parallelism. Although high performance concurrent programs use similar design patterns, models and principles as that of sequential programs, unlike sequential programs, they typically demonstrate non-deterministic behavior. As the number of interactions between the various parallel constructs increase, the probability of bugs exponentially increases. Race conditions, data races, deadlocks, missed signals and live lock are common error types.

Challenges

Broadly speaking, parallel programs can be divided into two categories: explicitly parallel programs and implicitly parallel programs. If parallel language constructs defined for process creation, communication and synchronization are used explicitly while developing a parallel program, then such application becomes an explicitly parallel application. On the other hand, if a tool or parallelizing compiler is used to convert a serial program into a parallel one, then it becomes a case of implicitly parallel program. Both these categories of high performance software are equally prone to many bugs.

Heisenbugs

It is expected that when a concurrent application is tested, it should execute correctly on every possible thread schedule in the underlying operating system. However, in case of parallel applications, the traditional methods of testing are able to detect very few bugs chiefly because of Heisenbugs ^[1] problem. A Heisenbug is an error which changes its behavior or disappears completely when an attempt is made to isolate and probe them via debugger by adding some constructs like synchronization requests or delay statements. This issue causes the bug to go undetected and unfixed.

Non-repeatability

Another issue is caused due to the unpredictable behavior of the scheduler. The differences in the load on the system vary the behavior of the scheduler accordingly. Also, this behavior cannot be changed manually. Thus, to counter the indeterminism introduced by the Operating System scheduler, the program needs to be executed many times under various execution environments. Still, it is not guaranteed that the bug can be reproduced. Most of the times, program runs correctly, but the bug exists in the application and its effect is visible only when some specific conditions are matched. As a result, non-repeatability of the concurrent programs is a major source of roadblock for detecting error. As an example, consider the following.

void thread1(void *t)
{
   mutex_lock(a);
   mutex_lock(b);
   // do some work
   .
   .
   .
   mutex_unlock(b);
   mutex_unlock(a);
}

void thread2(void *t)
{
   mutex_lock(b);
   mutex_lock(a);
   // do some work
   .
   .
   .
   mutex_unlock(a);
   mutex_unlock(b);
}

Clearly, this has a problem of causing deadlocks. Yet, it may cause deadlock in some runs of the program and in some other runs, it may run successfully.

Detecting Probe Effect

Probe effect is seen in parallel programs when delay-statements are inserted in the parallel programs having synchronization problems. This effect, like Heisenbugs, alters the behavior of the high performance software. It can either cause non-functioning concurrent programs to work or properly working parallel programs may cease to function when delays are inserted removed or altered. Detecting the source of a probe effect poses a great challenge in testing parallel applications.
The main difference between Probe effect and Heisenbugs is that Heisenbugs are seen when additional delay statements and/or synchronization requests are added to the concurrent application during testing, while probe effect is seen when the developer adds delay statements to the concurrent applications with poor synchronization.

Testing Strategies

The differences between sequential and concurrent programs lead to the differences in their testing strategies. Existing strategies for the sequential programs can be modified to make them suitable for concurrent applications. Some new strategies have also been developed. Conventionally, testing of the program includes designing the test cases and checking if the program ‘works’ as expected. Thus, the various errors in specification, functionality, etc. are detected by actually running the application and subjecting it to testing methods like Functional Testing, White Box, Black Box, Grey Box Testing and so on.^[2] In addition to these tests, there is a trend of using static analysis for detecting the errors in high performance software. The source code of the application is analyzed using various static analysis methods like Data Flow Analysis, Control Flow Analysis, Cyclomatic Complexities, Thread Escape Analysis, and Static Slicing Analysis to find problems in parallel applications. Using static analysis before the functionality testing can save valuable time for the developer and tester. It can not only detect ‘what the error is’ but also it can find the source of the error. Static analysis techniques can detect problems like lack of synchronization, improper synchronizations, predict occurrence of deadlocks, post-wait errors in rendezvous requests. Here are some of the strategies to test the parallel applications.

Deterministic Scheduling / Reproducible Testing ^[1]

The indeterminacy of scheduling originates from two sources.

1. Time slicing:

Scheduler makes a context switch at equal intervals of time. This time interval depends on speed of individual processors, memory-cache hierarchy state and the load on the system. Even on the same processor, at same load conditions, the time slice interval varies slightly due to minor variations in frequency of the clock oscillator.

2. Page Faults

Scheduler also suspends a program if a page fault occurs and lets the other threads to proceed while the system fetches the page required for previous thread from disk into the memory. As the occurrence of page faults depends upon other processes running in the system, it makes the timing of a context switch indeterminate.

To make the concurrent programs repeatable, an external scheduler is used. The program, which is being tested, is instrumented to add calls to this new scheduler. Such calls are made at the beginning and end of each thread as well as before every synchronization request. This new scheduler, which is independent of OS scheduler, selectively blocks different threads of execution by maintaining a semaphore associated with each thread, such that only one thread is ready for execution at any given time. Thus, it converts parallel non-deterministic application into a serial execution sequence in order to achieve repeatability. The number of scheduling decisions made by the serializing scheduler is given by –

(N * K / P)*{(N + P)!}

Where 

N = number of threads

K = potential context switch points

P = budget of pre-emptive context switches

Feedback Directed Testing

To obtain more accurate results using deterministic scheduling policy, an alternate approach can be chosen. It is known that few pre-emptions in the concurrent program, which are properly placed, can quickly detect many bugs related to data-races.^[1] It has also been proved that bugs are found in clusters. That is, if one bug is found, there is a high probability that more bugs exists in the same region of code. Thus, a concurrent application is tested in multiple passes. Each pass of the testing schedule identifies a section of code which has possibility of having bugs. It is fed back to the next pass of testing routine to detect more errors in the identified region. Thus, using such feedback loops, data races are detected by exploring different scheduling paths, and then calls to the new scheduler are inserted at racing-program locations. Finally, by allowing the racing-locations to execute in different scheduling order, the results are analyzed for unprecedented behavior.

Timing Related Testing^[3]

This strategy is employed to ensure that the application is not prone to the Probe Effect. Sources of errors causing Probe Effect can range from task creation issues to synchronization and communication problems. Requirements of timing related tests are –^[3]

Delay duration must be changed
Delay points must cover necessary program locations
Delay statements must be inserted, removed or relocated to induce probe effect

This strategy defines number of test cases per input data set as –

ⁿC₁ + ⁿC₁ + … + ⁿC₁ = 2ⁿ -1

Where n = total number of synchronization, process creation and communication calls.

This equation has exponential order. In order to reduce the number of test cases, either Deterministic Execution Method (DET) or Multiple Execution Technique (MET) is used. Various issues must be handled for successful timing-related testing.

Achieving delayed execution

Addition of delays is a straightforward task. A typical sleep() statement, supported by almost all the languages can be used to insert delays.

Deciding where to insert delays

The locations in the program where the delays must be inserted to induce probe effect are known as delay-points. As the objective of timing related test cases is to detect synchronization, communication and thread creation errors, the delay statements are added just in front of these statements.

Advantages of Timing Related Testing

Easy to implement on multiple processors without any need of ordering the synchronization requests.
No need to generate concurrency graph
More effective for fault detection
Total number of test cases are less, yet code coverage is more, due to static analysis

All Du-Path Testing

This method applies the concept of define-use pair, in order to determine the paths to be tested.

Verification Strategies

Software verification is a process which proves that the software is working correctly and it is performing the intended task perfectly. Software verification helps the user to build a confidence that the software is indeed performing the task for which it is developed.

Test Calculations

Test Calculations is a very straightforward method. Input is given to the system to generate a result which is already known. This input-result pair can be obtained from previous empirical results and/or manual calculations.^[4] Although it is an intuitive approach to verify the system, a tester cannot rely upon Test Calculations as the only method for verification. This is a global test for the entire system and thus can only be performed when all modules of the application are ready and integrated. Moreover, it can only detect if the bugs are present or not. There is no detailed information regarding the number of bugs detected, their location and source of the bugs. As a result, Test Calculations should only form a part of the complete testing plan.

Symmetry Tests

These tests are primarily used for scientific simulations designed using high performance computing applications. Many times, the output of the simulation cannot be predicted. But, as these simulations are based on scientific laws, there exist symmetries in the theory which must also be honored by correct simulation of theory. Thus, by varying the input conditions along the lines of symmetry and then comparing the obtained results with original results, existence of bugs can be detected.^[4]

Moreover, in scientific computing, most of the data lies in the central region of the simulation conditions. As a result, it is difficult to perform Boundary-value Testing ^[2] with the real time experimental data. Thus, center of simulation (for example, for data value of 10 in Figure 1) is shifted to one of the boundaries so as to test the boundary condition problems effectively.

Parallel Implementation Tests

Parallel implementation tests are usually used for the applications designed using distributed memory programming models like message passing. These tests are applied to the programs that often use regular grids of processors.^[4] A common error in such distributed parallel programs is identified here.

Global Summation

Many parallel databases use distributed parallel processing to execute the queries. While executing an aggregate function like sum, the following strategy is used:^[5]

Compute the sum locally and concurrently at each processor using the data present in the associated disk partition with it.
Add these local subtotals to get the final result.

The final result may contain some error as each processor performs rounding-off on the local results. Hence, it is necessary to show that the aggregate sum is decomposition-independent. While testing the application, one processor is dedicated for the summing operation and all other processors send local data to this processor. The summing processor computes the sum serially and then this result is compared with the result obtained using global summation. If the two values conform to each other, then the parallel-implementation is said to be decomposition-independent.

Available Tools

Microsoft CHESS^[1] – Checker for System Software

This tool eliminates the non-determinacy in the testing of concurrent programs due to scheduler of the underlying Operating System using Deterministic Scheduling Technique. It keeps track of schedule paths executed previously and guarantees that each time a new schedule path is executed.

References

^ ^a ^b ^c ^d Thomas Ball, Sebastian Burckhardt, Peli de Halleux, Madanlal Musuvathi, Shaz Qadeer (May–June 2011). "Predictable and Progressive Testing of Multithreaded Code". Software, IEEE. 28 (3).{{cite journal}}: CS1 maint: multiple names: authors list (link)
^ ^a ^b The Art of Software Testing, Second Edition. John Wiley and Sons. 2004. p. 256. ISBN 978-0-471-46912-4.
^ ^a ^b Cheer-Sun Yang , Lori L. Pollock (1997). "The Challenges in Automated Testing of Multithreaded Programs". Proceedings of the 14th International Conference on Testing Computer Software.
^ ^a ^b ^c Stephen Booth, David Henty (2004). "Verification strategies for high performance computing software". Software Engineering for High Performance Computing System (HPCS) Applications" W3S Workshop - 26th International Conference on Software Engineering. doi:10.1049/ic:20040413.
^ Korth, Henry. Database System Concepts. McGraw-Hill.

[predictable-1] Thomas Ball, Sebastian Burckhardt, Peli de Halleux, Madanlal Musuvathi, Shaz Qadeer (May–June 2011). "Predictable and Progressive Testing of Multithreaded Code". Software, IEEE. 28 (3).{{cite journal}}: CS1 maint: multiple names: authors list (link)

[artofswtesting-2] The Art of Software Testing, Second Edition. John Wiley and Sons. 2004. p. 256. ISBN 978-0-471-46912-4.

[automated-3] Cheer-Sun Yang , Lori L. Pollock (1997). "The Challenges in Automated Testing of Multithreaded Programs". Proceedings of the 14th International Conference on Testing Computer Software.

[verification-4] Stephen Booth, David Henty (2004). "Verification strategies for high performance computing software". Software Engineering for High Performance Computing System (HPCS) Applications" W3S Workshop - 26th International Conference on Software Engineering. doi:10.1049/ic:20040413.

[5] Korth, Henry. Database System Concepts. McGraw-Hill.

[1]

[2]

[3]

[4]

[5]