Jump to content

Predictive failure analysis

From Wikipedia, the free encyclopedia
This is an old revision of this page, as edited by 121.45.220.187 (talk) at 01:33, 9 May 2012 (Removed obsolete warning-- there are references now). The present address (URL) is a permanent link to this revision, which may differ significantly from the current revision.

Predictive Failure Analysis refers to computer mechanisms that analyse trends in corrected errors to predict future failures of hardware components and proactively enabling mechanisms to avoid them. Predictive Failure Analysis (PFA) was originally used as term for a proprietary IBM technology for monitoring the likelihood of hard disk drives to fail, although the term is now used generically for a variety of technologies for judging the imminent failure of CPU's, memory and I/O devices [1]. See also first failure data capture.

Disks

The term PFA was introduced in 1992 applied to IBM 0662-S1x drive (1052 MB Fast-Wide SCSI-2 disk at 5400 rpm), and was industry's first such technology. The technology is based on measuring several key (mainly mechanical) parameters of the drive unit, for example heads flying height. The parameters are compared against predefined thresholds and the health status is evaluated by the drive firmware. If the drive appears likely to fail soon, a notification is sent to the disk controller. The major drawbacks of the technology were the binary result and the unidirectional communications - notification is sent by the drive firmware, and the only status visible to the host was presence or absence of a notification. The technology was merged with IntelliSafe to form the Self-Monitoring, Analysis, and Reporting Technology.

Processor and Memory

High counts of corrected RAM intermittent errors by ECC can be predictive of future DIMM failures [2] and so automatic offlining for memory and CPU caches can be used to avoid future errors.

References

  1. ^ Intel Corp (2011). "Intel Xeon Processor E7 Family: supporting next generation RAS servers. White paper". {{cite web}}: Unknown parameter |accessed= ignored (help)
  2. ^ Bianca Schroeder, Eduardo Pinheiro, Wolf-Dietrich Weber (2009). "DRAM Errors in the Wild: A Large-Scale Field Study. Proceedings SIGMETRICS, 2009".{{cite web}}: CS1 maint: multiple names: authors list (link)

See Also