Jump to content

User:Amber-project.eu/Fault injection

From Wikipedia, the free encyclopedia
This is an old revision of this page, as edited by Amber-project.eu (talk | contribs) at 14:36, 29 May 2009 (Created page with '6. Fault Injection A critical issue in the development of a resilient computer system is the validation of its fault-handling mechanisms . Ineffective or unintended…'). The present address (URL) is a permanent link to this revision, which may differ significantly from the current revision.
(diff) ← Previous revision | Latest revision (diff) | Newer revision → (diff)

6. Fault Injection A critical issue in the development of a resilient computer system is the validation of its fault-handling mechanisms . Ineffective or unintended operation of these mechanisms can significantly impair the dependability of a computer system. Assessing the effectiveness and verifying the correctness of fault-handling mechanisms in computer systems is therefore of vital importance. Fault injection is an important experimental technique for assessment and verification of fault-handling mechanisms. It allows researchers and system designers to study how computer systems react and behave in the presence of faults. Fault injection is used in many contexts and can serve different purposes, such as: • Assess the effectiveness, i.e., fault coverage, of software and hardware implemented fault-handling mechanisms. • Study error propagation and error latency in order to guide the design of fault-handling mechanisms. • Test the correctness of fault-handling mechanisms. • Measure the time it takes for a system to detect or to recover from errors. • Test the correctness of fault-handling protocols in distributed systems. • Verify failure mode assumptions for components or subsystems. Over the years, many researchers have addressed the problem of validating fault-handling mechanisms by fault injection. Numerous papers on assessment and verification of fault-tolerant systems or individual mechanisms, and on fault injection tools have been published. This chapter gives an overview of the current state-of-the-art and selected important historical achievements in the area of fault injection. Fault injection can in principle be carried out in two ways: faults can be injected either in a real system or in a model of a system. By a real system we mean a physical computer system, either a prototype or a commercial product. System models for fault injection experiments can be built using two basic techniques: software simulation and hardware emulation. Hence, there are three main approaches to fault injection: • Injection of faults into real systems; • Software simulation-based fault injection, and • Hardware emulation-based fault injection. There are several techniques that can be used for injection of faults into real systems. These can be divided into three main categories: hardware-implemented fault injection, software-implemented fault injection, and radiation-based fault injection. Software simulation-based fault injection can be performed in simulators operating at different levels of abstraction, such as device level, gate level, functional block level, instruction set architecture (ISA) level, and system level. Hardware-emulation based fault injection uses models of hardware circuits implemented in large Field Programmable Gate Array (FPGA) circuits. These models can provide a highly detailed, almost perfect hardware representation of the system that is being verified or assessed. Before we discuss and compare different fault injection techniques in detail, we must first introduce some terms and basic concepts. We use target system as a generic term for the system under assessment or verification. The target system executes a workload, which is determined by the program executed by the target system and the data processed by the program. The faults injected during the experiments constitute the faultload. We distinguish between a fault injection experiment and a fault injection campaign. A fault injection experiment corresponds to injecting one fault and observing, or recording, how the target system behaves in presence of that fault. To gain statistical confidence in the assessment or the verification of a target system, we need to collect data from many fault injection experiments. A series of fault injection experiments conducted on a target system is called a fault injection campaign. Fault injection techniques can be compared and characterized on the basis of several different properties. The following properties are applicable to all types of fault injection techniques: • Controllability – ability to control the injection of faults in time and space. • Observability – ability to observe and record the effects of an injected fault. • Repeatability – ability to repeat a fault injection experiment and obtain the same result. • Reproducibility – ability to reproduce the results of a fault injection campaign. • Reachability – ability to reach possible fault locations inside an integrated circuit, or within a program. • Fault representativeness – how accurately the faultload represents real faults. • Workload representativeness – how accurately the workload represents real system usage. • System representativeness – how accurately the target system represents the real system The main advantage of performing fault injection in a real system is that the actual implementation of the fault-handling mechanisms is assessed and verified. Thus, system representativeness is usually higher when using a physical system compared to using software simulation or hardware emulation. On the other hand, fault models used in simulation-based and emulation-based fault injection can usually imitate real faults more accurately than artificial faults injected into a real system. Hence, fault representativeness is often higher for simulation-based and emulation-based fault injection. Also, controllability, observability, repeatability, reproducibility and reachability are normally higher in simulation-based and emulation-based fault injection compared to fault injection in real systems. However, there are also several drawbacks and limitations to simulation-based and emulation-based fault injection. The development of the simulation/emulation model can increase development cost. Performing software simulations with an accurate simulation models can be very time-consuming. In fact, simulating a large amount of system activity (e.g., the execution of several million lines of source code) may not be feasible using a highly detailed model of the target system. In simulation-based fault injection, it is therefore essential to make a trade-off between simulation time, on one hand, and the accuracy of the fault model(s) and the system model, on the other hand. Time overhead is much lower in hardware emulation-based fault injection than in software simulation-based fault injection. However, it may still be concern in hardware emulation-based fault injection, e.g., in the verification of real-time systems. All fault injection techniques have specific drawbacks and advantages. Several researchers have therefore proposed hybrid fault injection approaches, in which different techniques are combined in order to increase the scope and confidence in the verification or the assessment of a target system. The remainder of this chapter is organised as follows. In the next section, we present tools and techniques for injection of hardware faults. Section 6.2 presents techniques for injection of software design and implementations faults. Section 6.3 presents fault injection techniques for assessment and testing of protocols used in fault-tolerant distributed systems. In Section 6.4, we list books and surveys of fault injection techniques. Finally, Section 5.5 concludes the chapter. 6.1. Techniques for injecting hardware faults In this section, we describe techniques for injecting or emulating hardware faults. The first three subsections deal with fault injection into real systems, covering hardware-implemented fault injection, software-implemented fault injection, and radiation-based fault injection. Three additional subsections cover software simulation-based fault injection, hardware emulation-based fault injection and hybrid fault injection. Hence, hardware faults can be injected or emulated by all the techniques mentioned in the introduction of this chapter. 6.1.1. Hardware implemented fault injection Hardware-implemented fault injection includes three techniques: pin-level fault injection, power supply disturbances, and test port-based fault injection. In pin-level fault injection, faults are injected via probes connected to electrical contacts of integrated circuits or discrete hardware components. This method was used already in the 1950’s for generating fault dictionaries for system diagnosis. Many experiments and studies using pin-level fault injection were carried out during the 1980’s and early 1990’s. Several pin-level fault injection tools were developed at that time, for example, MESSALINE [Arlat 90] and RIFLE [Madeira 94]. A key feature of these tools was that they supported fully automated fault injection campaigns. The increasing level of integration of electronic circuits has rendered the pin-level technique obsolete as a general method for evaluating fault-handling mechanisms in computer systems. The method is, however, still valid for assessment of systems where faults in electrical connectors pose major problem, such as automotive and industrial embedded systems. Power supply disturbances (PSDs) are rarely used for fault injection because of low repeatability. They have been used mainly as a complement to other fault injection techniques in the assessment of error detection mechanisms for small microprocessors [Karlsson 91, Miremadi 95, Rajabzadeh 04]. The impact of PSDs is usually much more severe than the impact of other commonly used injection techniques, e.g., those that inject single bit-flips, since PSDs tend affect many bits and thereby a larger part of the system state. Interestingly, some error detection mechanisms show lower fault coverage for PSDs than for single bit-flip errors [Rajabzadeh 04]. Test port-based fault injection encompasses techniques that use test ports to inject faults in microprocessors. Many modern microprocessors are equipped with built-in debugging and testing features, which can be accessed through special I/O-ports, known as test access ports (TAPs), or just test ports. Test ports are defined by standards such as the IEEE-ISTO 5001-2003 (Nexus) standard [IEEE-ISTO 5001 03] for real-time debugging, the IEEE 1149.1 Standard Test Access Port and Boundary-Scan Architecture (JTAG) [IEEE Std 1149.1 01], and the Background Debug Mode (BDM) facility. Nexus and JTAG are standardized solutions used by several semiconductor manufacturers, while BDM is a proprietary solution for debugging developed by Freescale, Inc. Tools for test port-based fault injection are usually implemented on top of an existing commercial microprocessor debug tool, since such tools contain all functions and drivers that are needed to access a test port. The type of faults that can be injected via a test port depends on the debugging and testing features supported by the target microprocessor. Normally, faults can be injected in all registers in the instruction set architecture (ISA) of the microprocessor. BDM and Nexus also allows injection of faults in main memory. Test ports could also be used to access hardware structures in the microarchitecture that are invisible to the programmer. However, information on how to access such hardware structures is usually not disclosed by manufacturers of microprocessors. Tools that support test port-based fault injection include GOOFI [Aidemark 01] and INERTE [Yuste 03]. GOOFI supports both JTAG-based and Nexus-based fault injection, while INERTE is specifically designed for Nexus-based fault injection. An environment for BDM-based fault injection is described in [Rebaudengo 99]. Injecting a fault via a test port involves four major steps: i) setting a breakpoint via the test port and waiting for the program to reach the breakpoint, ii) reading the value of the target location (a register or memory word) via the test port, iii) manipulating this value and then writing the new, faulty value back to the target location, and iv) resuming the program execution via a command sent to the test port. The time overhead for injecting a fault depends on the speed of the test port. JTAG and BDM are low-speed ports, whereas Nexus ports can be of four different classes with different speeds. The simplest Nexus port (Class 1) is a JTAG port, which uses serial communication and therefore only needs 4 pins. Ports compliant with Nexus Class 2, 3 or 4 use separate input and output ports, know as auxiliary ports. These are parallel ports that use several pins for data transfer. The actual number of data pins is not fixed by the Nexus standard, but for class 3 and 4 ports the standard recommends 4 to 16 data pins for the auxiliary output port and 1 to 4 data pins for the auxiliary input port. The main advantage of test port-based fault injection is that faults can be injected internally in microprocessors without making any alterations of the system’s hardware or software. Compared to software-implemented fault injection, it provides better or equal capabilities of emulating real hardware faults. Finally, advanced Nexus ports (Class 3 and 4) provides outstanding possibilities for data collection and observing the impact of injected faults within a microprocessor. Existing tools have not fully exploited these possibilities. Hence, microprocessors with high-speed Nexus ports constitute interesting targets for the development of new fault injection tools, which potentially can achieve much better observability than existing tools do. 6.1.2. Software-implemented fault injection of hardware faults Software-implemented fault injection encompasses techniques that inject faults through software executed on the target system. There are basically two approaches that we can use to emulate hardware faults by software: run-time injection and pre run-time injection. In run-time injection, faults are injected while the target system executes a workload. This requires a mechanism that i) stops the execution of the workload, ii) invokes a fault injection routine, and iii) restarts the workload. Thus, run-time injection incurs a significant run-time overhead. In pre run-time injection, faults are introduced by manipulating either the source code or the binary image of the workload before it is loaded into memory. Pre run-time injection usually incurs less run-time overhead than run-time injection, but the total time for conducting a fault injection campaign is usually longer for pre run-time injection since it needs more time for preparing each fault injection experiment. There are several fault injection tools that can emulate the effects of hardware faults by software, but they use different techniques for injecting faults and support different fault models. Most of these tools use run-time injection, since it provides better opportunities for emulating hardware faults than pre run-time injection. Software-implemented fault injection relies on the assumption that the effects of real hardware faults can be emulated either by manipulating the state of the target system via run-time injection, or by modifying the target workload through pre run-time injection. The validity of this approach varies depending of the fault type and where the fault occurs. Consider for example emulation of a soft error, i.e. a bit-flip error induced by a strike of a high energy particle. Flipping bits in main memory or processor registers can easily be done by software. On the other hand, the effect of a bit-flip in a processor’s internal control logic can be difficult, if not impossible, to emulate accurately by software manipulations. Emulating a permanent hardware fault requires a more elaborate set of manipulations than emulating a transient fault. For example, the emulation of a stuck-at fault in a memory word or a processor register would require a sequence of manipulations performed every time the designated word or register is read by a machine instruction. On the other hand, a transient fault requires only a single manipulation. The time overhead imposed by fault emulation thus varies for different fault types. We here describe seven tools that are capable of emulating hardware faults through software. These tools represent important steps in the development of software-implemented fault injection for emulation of hardware faults. The tools are FIAT [Barton 90], FERRARI [Kanawati 92] , FINE [Kao 93], DEFINE [Kao 95], FTAPE [Tsai 96], DOCTOR [Han 95], and Xception [Carreira 98]. These tools use different approaches to emulating hardware faults and implement partly different fault models. Some of the tools also provide support for emulating software faults, as we describe in Section 6.2. Researchers started to investigate software-implemented fault injection in the late 1980’s. In the beginning, the focus was on developing techniques for emulating the effects of hardware faults. Work on emulation of software faults started a few years later. One of the first tools that used software to emulate hardware faults was FIAT [Barton 90], developed at Carnegie Mellon University. FIAT injected faults by corrupting either the code or the data area of a program’s memory image during run-time. Three fault types were supported: zero-a-byte, set-a-byte and two-bit compensation. The last fault type involved complementing any 2 bits in a 32 bit word. Injection of single-bit errors was not considered, because the memory of the target system was protected by parity. More advanced techniques for emulation of hardware faults were included in FERRARI [Kanawati 92], developed at the University of Texas, and in FINE [Kao 93], developed at the University of Illinois. Both these tools supported emulation of transient and permanent hardware faults in systems based on SPARC processors from Sun Microsystems. FERRARI could emulate three types of faults: address line, data line, and condition code faults, while FINE emulated faults in main memory, CPU-registers and the memory bus. DEFINE [Kao 95], which was an extension of FINE, supported fault injection in distributed systems and introduced two new fault models for intermittent faults and communication faults. DOCTOR [Han 95] is a fault injection tool developed at the University of Michigan targeting distributed real-time systems. It supports three fault types: memory faults, CPU faults and communication faults. The memory faults can affect a single-bit, two bits, one byte, and multiple bytes. The target bit(s)/byte(s) can be set, reset and toggled. The CPU faults emulate faults in processor registers, the op-code decoding unit, and the arithmetic logic unit. The communication faults can cause messages to be lost, altered, duplicated or delayed. DOCTOR can inject transient, intermittent and permanent faults, and uses run-time injection for the transient and intermittent faults. Permanent faults are emulated using pre run-time injection. FTAPE [Tsai 96] is a fault injector aimed at benchmarking of fault tolerant commercial systems. It was used to assess and test several prototypes of fault tolerant computers for on-line transaction processing. FTAPE emulates the effects of hardware faults in the CPU, main memory and I/O units. The CPU faults include single and multiple bit-flips and zero/set registers in CPU registers. The memory faults include single and multiple bit and zero/set faults in main memory. The I/O faults include SCSI and disk faults. FTAPE was developed at the University of Illinois in cooperation with Tandem Computers. Xception [Carreira 98] is a fault injection tool developed at the University of Coimbra. This tool uses the debugging and performance monitoring features available in advanced microprocessors to inject faults. Thus it injects faults in a way which is similar to test port-based fault injection. The difference is that Xception controls the setting of breakpoints and performs the fault injections via software executed on the target processor rather than sending commands to a test port. Xception injects faults through exception handlers executing in kernel mode, which can be triggered by the following events: op-code fetch form a specified address, operand load from a specified address, operand store to a specified address, and a specified time passed since start-up. These triggers can be used to inject both permanent and transient faults. Xception can emulate hardware faults in various functional units of the processors such as the integer unit, floating point unit and the address bus. It can also emulate memory faults, including stuck-at-zero, stuck-at-one and bit-flip faults. Xception is unique because it is the only tool mentioned in this section that has been developed into a commercial tool. The Xception tool is still sold by Critical Software, Coimbra, which released the first commercial version of the tool in 1999. 6.1.3. Radiation-based fault injection Modern electronic integrated circuits and systems are sensitive to various forms of external disturbances such electromagnetic inference and particle radiation. One way of validating a fault tolerant system is thus to expose the system to such disturbances. Although computer systems often are used in environments where they can be subjected to electromagnetic interference (EMI), it is not common to use such disturbances to validate fault tolerance mechanisms. The main reason for this is that EMI injections are difficult to control and repeat. In [Arlat 03], EMI was used along with three other fault injection techniques to evaluate error detection mechanisms in a computer node in a distributed real time system. A primary goal of this study was to compare the impact of pin-level fault injection, EMI, heavy-ion radiation and software-implemented fault injection. The study showed that the EMI injections tended to “favour” one particular error detection mechanisms. For some of the fault injection campaigns almost all faults were detected by one specific CPU-implemented error detection mechanism, namely spurious interrupt detection. This illustrates the difficulty in using EMI as a fault injection method. A growing reliability concern for computer systems is the increasing susceptibility of integrated circuits to soft errors, i.e., bit-flips caused when highly ionizing particles hits sensitive regions within in a circuit. Soft errors have been a concern for electronics used in space applications since the 1970’s. In space, soft errors are caused by cosmic rays, i.e., highly energetic heavy-ion particles. Heavy-ions are not a direct threat to electronics at ground-level and airplane flight altitudes, because they are absorbed when they interact with Earth’s atmosphere. However, recent circuit technology generations have become increasingly sensitive to high energy neutrons, which are generated in the upper atmosphere when cosmic rays interact with the atmospheric gases. Such high energy neutrons are a major source of soft errors in ground-based and aviation applications using modern integrated circuits. All modern microprocessors manufactured in technologies with feature sizes below 90 nm are therefore equipped with fault tolerance mechanisms to cope with soft errors. To assess the efficiency of such fault tolerance mechanisms, semiconductor manufactures are now regularly testing their circuits by exposing them to ionising particles. In such tests, it is common to use proton radiation produced by a particle accelerator. One example of such a test can be found in [Kellington 07] which reports on recent proton testing of the IBM POWER 6 processor. The sensitivity of integrated circuits to heavy-ion radiation can be exploited for assessing the efficiency of fault-handling mechanisms. In [Gunneflo 89] and [Karlsson 91], results from fault injection experiments conducted by exposing circuits to heavy-ion radiation from a Californium-252 source is reported. This method was also used in the previously mentioned study [Arlat 03], in which the impact of four different fault injection techniques was compared. In this study, the main processor as well as the communication processor of a node in distributed system was exposed to heavy-ion radiation. The results showed that the impact of the soft errors injected by the heavy-ions varied extensively and that they activated many different error detection mechanisms in the target system. Finally, we note that radiation-based fault injection has very low, or non-existent, repeatability. Due to low controllability, it is not possible to precisely synchronize the activity of the target system with the time and the location of an injection in radiation-based fault injection. Thus it is not possible to repeat an individual experiment. However, the ability to statistically reproduce results over many fault injection campaigns is usually high in particle radiation experiments. Both repeatability and reproducibility are low for EMI-based fault injection. 6.1.4. Simulation-based fault injection As mentioned in the introduction, simulation-based fault injection can be performed at different levels of abstraction, such as the device level, logical level, functional block level, instruction set architecture (ISA) level, and system level. Simulation models at different abstractions layers are often combined in so called mix-mode simulations to overcome limitations imposed by the time overhead incurred by detailed simulations. FOCUS [Choi 92] is an example of a simulation environment that combines device-level and gate-level simulation for fault sensitivity analysis of circuit designs with respect to soft errors. At the logic level and the functional block level, circuits are usually described in a hardware description language (HDL) such as VHDL or Verilog. Several tools have been developed that support automated fault injection experiments with HDL- models, e.g., MEFISTO [Jenn 94] and the tool described in [Delong 96]. Recently, several studies aimed at assessing the soft error vulnerability of complex high-performance processors have been conducted using simulation-based fault injection. In [Wang 06] a novel low-cost approach for tolerating soft errors in the execution core of a high-performance processor is evaluated by combining simulations in a detailed Verilog model with an ISA-level simulator. This approach allowed the authors to study the impact soft errors for seven SPEC2000 integer benchmarks through simulation. DEPEND [Goswami 97] is a tool for simulation-based fault injection at the functional level aimed at evaluating architectures of fault-tolerant computers. A simulation model in DEPEND consists of number of interconnected modules, or components, such as CPUs, communication channels, disks, software systems, and memory. DEPEND is intended for validating system architectures in early design phases and serves a complement to probabilistic modelling techniques such as Markov and Petri net models. DEPEND provides the user with predefined components and fault models, but also allows the user to create new components and new fault models, e.g., the user can use any probability distribution for the time to failure for a component. 6.1.5. Hardware emulation-based fault injection The advent of large Field Programmable Gate Arrays (FPGAs) circuits has provided new opportunities for conducting model-based fault injection with hardware circuits. Circuits designed in a hardware description language (HDL) are usually tested and verified using software simulation. Even if a powerful computer is used in such simulations, it may take considerable time to verify and test a complex circuit adequately. To speed up the test and verification process, techniques have been developed where HDL-designs are tested by hardware emulation in a large FPGA circuit. This technique also provides excellent opportunities for conducting fault injection experiments. Hardware emulation-based fault injection has all the advantages of simulation-based fault injection such as high controllability and high repeatability, but requires less time for conducting a fault injection experiment compared to using software simulation. The use of hardware emulation for studying the impact of faults was first proposed in [Kwang-Ting 99]. The authors of that paper used the method for fault simulation, i.e., for assessing the fault coverage of test patterns used in production testing. Fault injection can be performed in hardware emulation models through compile time reconfiguration and run-time reconfiguration. Here reconfiguration refers to the process of adding hardware structures to the model which are necessary to perform the experiments. In compile-time reconfiguration, these hardware structures are added by instrumentation of the HDL models. An approach for compile-time instrumentation for injection of single event upsets (soft errors) is described in [Civera 03]. This work presents different instrumentation techniques that allow injection of transient faults in sequential memory element as well as in microprocessor-based systems. One disadvantage of compile-time reconfiguration is that the circuit must be re-synthesised for each reconfiguration, which can impose a severe overhead on the time it takes to conduct a fault injection campaign. In order to avoid re-synthesizing the target circuit, a technique for run-time reconfiguration is proposed in [Antoni 03]. This technique relies on directly modifying the bit-stream that is used to program the FPGA-circuit. By exploiting partial reconfiguration capabilities available in some FPGA circuits, this technique achieved substantial time-savings compared to other emulation-based approaches to fault injection. A tool for conducting hardware emulation-based fault injection called FADES is presented in [Andrés 06]. This tool uses run-time configuration and can inject several different types of transient faults, including bit-flips, pulse, and delay faults, as wells as faults that cause digital signals to assume voltage levels between “1” and “0”. 6.1.6. Hybrid approaches for injecting hardware faults Hybrid approaches to fault injection combine several fault injection techniques to improve the accuracy and scope of the verification, or the assessment, of a target system. An approach for combining software-implemented emulation of hardware faults and simulation-based fault injection is presented in [Guthoff 95]. In this approach, the physical target is run until the program execution hits a fault injection trigger, which causes the physical system to halt. The architected state of the physical system is then transferred to the simulation model, in which a fault is injected, e.g., in the non-visible parts of the microarchitecture. The simulator is run until the effects of the fault have stabilized in the architected state of the simulated processor. This state is then transferred back to the physical system, which subsequently is restarted so that the system-level effects of the fault can be determined. An extension of the FERRARI tool which allows it to control a hardware fault injector is described in [Kanawati 95]. The hardware fault injector can inject logic-0/logic-1 faults into the memory bus lines of a SPARC 1 based workstation. The authors used the hardware fault injector to study the sensitivity of the computer in different operational modes. The results showed that system was more likely to crash from bus faults when the processor operated in kernel mode, compared to when it operated in user mode. This study showed that it is feasible to extend a tool for software-implemented fault injection with other techniques at reasonable cost, since many of the central functions of a tool are independent of the injection technique. A more recent tool that supports the use of different fault injections techniques is NFTAPE [Stott 00], developed at the University of Illinois. This tool is aimed at injecting faults in distributed systems using a technique called LightWeight Fault Injectors (LWFI). The purpose of the LWFI is to separate the implementation of the fault injector from the rest of the tool. NFTAPE provides a standardized interface for the LWFIs, which simplifies the integration and use of different types of fault injectors. NFTAPE has been used with several types of fault injectors using hardware-implemented, software-implemented, and simulation-based fault injection. 6.2. Techniques for injecting or emulating software faults Software faults are currently the dominating source of computer system failures. Making computer systems resilient to software faults is therefore highly desirable in many application domains. Much effort has been invested by both academia and industry in the development of techniques that can tolerate and handle software faults. In this context, fault injection plays an important role in assessing the efficiency of these techniques. Hence, several attempts have been made to develop fault injection techniques that can accurately imitate the impact of real software faults. The current state-of-the-art techniques in this area rely exclusively on software-implemented fault injection. There are two fundamental approaches to injecting software faults into a computer system: fault injection and error injection [Durães 06]. Fault injection imitates mistakes of programmers by changing the code executed by the target system, while error injection attempts to emulate the consequences of software faults by manipulating the state of the target system. Regardless of the injection technique, the main challenge is to find fault sets or error sets that are representative of real software faults. Other important challenges include the development of methods that allow software faults to be injected without access to the source code, and techniques for reducing the time it takes to perform an injection campaign. First, we discuss emulation of software faults by error injection, and then software fault injection. 6.2.1. Emulating software faults by error injection There are two common techniques for emulating software faults by error injection: program state manipulation and parameter corruption. Program state manipulation involves changing variables, pointers and other data stored in main memory or CPU-registers. Parameter corruption corresponds to modifying parameters of functions, procedures and system calls. The latter is also known as API parameter corruption and falls under category of robustness testing, which is discussed in Chapter 8. Here we discuss techniques for emulating software faults by program state manipulation. Many of the tools that we described in conjunction with emulation of hardware faults through software-implemented fault injection, e.g., FIAT [Barton 90], FERRARI [Kanawati 92], FTAPE [Tsai 96], DOCTOR [Han 95] and Xception [Carreira 98], can potentially be used to emulate software faults since they are designed to manipulate the system state. However, none of these tools provide explicit support for defining errors that can emulate software faults and the representativeness of the injected faults is questionable. An approach for generating representative error sets that emulates real software faults is presented in [Christmansson 96]. This approach was based on a study of software faults encountered in one release of a large IBM operating systems product. Based on their knowledge of the observed faults, the authors developed a procedure for generating a representative error set for error injection. The study addressed four important questions related to emulation of software fault by error injection: what error model(s) should be used; where should errors be injected; when should errors be injected; and how should a representative operational profile (workload) be designed? This work shows the feasibility of generating representative error sets when data on software faults is available. An experimental comparison between fault and error injection is presented in [Christmansson 98]. Fault and error injection experiments were carried on a safety-critical real-time control application. A total of 200 assignment, checking and interface faults were injected by mutating the source code, which was written in C. The failure symptoms produced by these faults were compared with failure symptoms produced by bit flip errors injected in processor registers, and in the data and stack areas of the main memory. A total 675 errors were injected. A comparison of the failure distributions were made for eight different workload activations (test cases). The authors conclude that the choice of test case caused greater variations in the distribution of the failure symptoms than the choice of fault type, when fault injection was used. On the other hand, for error injection the choice of error type caused greater variations in the failure distribution than the choice of test case. There were also significant differences between the failure distributions obtain with fault injection and with error injection. The authors claim that these differences occurred because a time-based trigger was used to control the error injections. They also claim that the fault types considered could be emulated more or less perfectly by using a break-point based trigger, although no experimental evidence is presented to support this claim. This study points out that it may be difficult to find error sets that emulate software faults accurately, and that the selection of the test case (workload activation) is as important as the selection of the fault/error model for the outcome of an injection campaign. 6.2.2. Techniques for injection of software faults An obvious way to inject software faults into a system is to manipulate the source code, object code or machine code. Such manipulations are known as mutations. Mutations have been used extensively in the area of program testing as a method for evaluating the effectiveness of different test methods. They have also been used for the assessment of fault-handling mechanisms. The studies presented in [Ng 96] and [Ng 99] inject software faults in an operating system through simple mutation of the object code. The primary goal of the fault models used in these studies was to generate a wide variety of operating system crashes, rather than achieving a high degree of representativeness with respect to real soft faults. FINE [Kao 93] and DEFINE [Kao 95] were among the first tools that supported emulation of software faults by mutations . The mutation technique used by these tools requires access to assembly language listings of the target program. FINE and DEFINE emulates four types of software faults: initializations, assignment, condition check, and function faults. These fault models were defined based on experience collected from studies of field failure data. An interesting technique, called Generic Software Fault Injection Technique (G-SWFIT), for emulation of software faults by mutations at the machine-code level is presented in [Durães 02]. This technique analyses the machine code to identify locations that corresponds to high-level language constructs that often results in design faults. The main advantage of G-SWFIT is that software faults can be emulated without access to the source code. A set of operators for injection of representative software faults using G-SWFIT was presented in [Durães 03]. These operators were derived from a field failure data study of more than 500 real software faults. These two works jointly represent a unique contribution, since they provide the first fault injection environment that can inject software faults which have been proven to be representative of real software faults. They also constitute the foundation of a methodology for definition of faultloads based on software faults for dependability benchmarking presented in [Durães 06]. 6.3. Techniques for testing protocols for fault-tolerant distributed systems Several fault injection tools and frameworks have been developed for testing of fault-handling protocols in distributed systems. The aim of this type of testing is to reveal design and implementation flaws in the tested protocol. The tests are performed by manipulating the content and/or the delivery of messages sent between nodes in the target system. We call this message-based fault injection. It resembles robustness testing discussed in Chapter 8 in the sense that the faults are injected into the inputs of the target system. A careful definition of the failure mode assumptions is crucial in the design of distributed fault-handling protocols. The failure mode assumptions provide a model of how faults in different subsystems (computing nodes, communication interfaces, and networks) affect a distributed system. A failure mode thus describes the impact of subsystem failures in a distributed system. Commonly assumed failure modes include Byzantine failures, timing failures, omission failures, crash failures, fail-stop failures and fail-signal failures. At the system-level, these subsystem failures correspond to faults. Hence, tools for message-based fault injection intend to inject faults that correspond to different subsystem failure modes. The experimental environment for fault tolerance algorithms (EFA) [Echtle 92] is an early example of a fault injector for message-based fault injection. The EFA environment provides a fault injection language that the protocol tester uses to specify the test cases. The tool inserts fault injectors in each node of the target system and can implement several different fault types, including message omissions, sending a message several times, generating spontaneous messages, changing the timing of messages, and corrupting the contents of messages. A similar environment is provided by the DOCTOR tool [Han 95], which can cause messages to be lost, altered, duplicated or delayed. Specifying test cases is a key problem in testing of distributed fault-handling protocols. A technique for defining test cases from Petri-net models of protocols in the EFA environment is described in [Echtle 95]. An approach for defining test cases from an execution tree description of a protocol is described in [Avresky 96]. A framework for testing distributed applications and communication protocols called ORCHESTRA is described in [Dawson 95, Dawson 96]. This tool inserts a probe/fault injection layer (PFI) between any two consecutive layers in a protocol stack. The PFI layer can inject deterministic and randomly generated faults in both outgoing and incoming messages. ORCHESTRA was used in a comparative study of six commercial implementations of the TCP protocol reported in [Dawson 97]. The failure of a distributed protocol often depends on the global state of the distributed system. It is therefore desirable for a human tester to control the global state of the target system. This involves controlling the states of a number of individually executing nodes, which is a challenging problem. Two tools that address this problem are CESIUM [Alvarez 97] and LOKI [Chandra 00]. A fault injection environment for testing of Web-services, called WS-FIT, is presented in [Looker 05a]. This fault injector can decode and inject meaningful faults into SOAP-messages. It uses an instrumented SOAP API that includes hooks allowing manipulation of both incoming and outgoing messages. A comparison of this method and fault injection by code insertion is presented in [Looker 05b]. 6.4. Surveys and books on fault injection Surveys of fault injection tools and techniques can be found [Clark 95, Hsueh 97]. Both these surveys are more than ten years old and are therefore partly outdated. Two books that address fault injection are Software fault injection : inoculating programs against errors by Jeffrey Voas and Gary McGraw [Voas 98] and Fault injection techniques and tools for embedded systems reliability evaluation by Benso and Prinetto (Eds.) [Benso 03]. 6.5. Concluding remarks We have provided a preliminary overview of the state-of-the-art and historical achievements in the field of fault injection. The overview covers tool and techniques for injecting three main fault types: physical hardware faults, software design and implementation faults, and faults affecting messages in distributed systems. Our overview shows that many fault injection tools and techniques were proposed from the late 1980’s until the first years of the new millennium. Rather few new fault injection tools and techniques have been described in the literature during last four years. This does not mean that the interest, or need, for fault injection has diminished. Instead, we see a clear trend that researchers and practitioners now put more focus on using fault injection to verify and assess systems and individual fault-handling mechanisms, than on developing new tools and techniques. Over the years, numerous papers have been published that describe results of such assessment and verification exercises. We would like to emphasize that the goal of this overview has not been to provide a complete record of all such experiments. A more detailed survey and analysis of fault injection techniques will be conducted during the AMBER project and presented in a future deliverable.