Floating-point arithmetic
You must add a |reason=
parameter to this Cleanup template – replace it with {{Cleanup|August 2006|reason=<Fill reason here>}}
, or remove the Cleanup template.
Floating-point is a means of representing real numbers in terms of digits or bits in a computer or calculator, similar to how scientific notation is used to represent exact values. A floating-point number is often stored as three parts:
- A significand or mantissa (indicating the digits that define the number's magnitude)
- An exponent or scale (indicating the position of the radix point)
- A sign (indicating whether the number is positive or negative)
Computation with floating-point numbers plays a very important role in an enormous variety of applications in science, engineering, and industry. The ability to perform floating point operations is an important measure of performance for computers intended for such applications. The extent of this ability is measured in "FLOPS" (FLoating-point Operations Per Second).
Floating-point numbers are intended to model the mathematical real numbers. But while the real numbers form a continuum that can be subdivided without limit, floating-point numbers have only finite resolution—they can only represent discrete points on the real number line. (In common "double precision" representation, consecutive points differ by about 1 part in 1016.) That is, they can only represent a subset of the reals. Because of this, floating-point numbers are sometimes thought of as just an approximation to a real number, or as representing a real number to within some tolerance, but this is not correct. A floating-point number (that is, a string of bits in a computer) represents a real number exactly. It just might not be the real number that is the intended value of the situation, if that value is not in the representable subset.
Basics
A floating-point representation requires, first of all, a choice of base or radix for the significand, and a choice of the number of digits in the significand. In this article, the base will be denoted by b, and the number of digits, or precision, by p. The significand is a number consisting of p digits in radix b, so each digit lies in the range 0 through b-1. A base of 2 (that is, binary representation) is nearly always used in computer hardware, though some computers use b=10 or b=16. A base of 10 (that is, decimal representation) is used in the familiar scientific notation.
As an example, the revolution period of Jupiter's moon Io could be represented in scientific notation as 1.528535047 × 105 seconds. The string of digits "1528535047" is the significand, and the exponent is 5.
Now this could be represented in many different ways. For example the value can be any of
- 1528.535047 × 102
- 1528535047. × 10-4
- 0.000001528535047 × 1011
A benefit of scientific notation is that it avoids representations with many leading zeros by allowing the decimal point to be put in a more efficient place. Floating-point notation mandates a specific place for the point—just after the leftmost nonzero digit, and lets the exponent handle the rest. So, in this case, the correct representation is 1.528535047 × 105.
This, plus the requirement that the leftmost digit of the significand be nonzero, is called normalization. By doing this, one no longer needs to express the point explicitly; the exponent provides that information. In decimal floating-point notation with precision of 10, the revolution period of Io is simply e=5; s=1528535047.
Some people (and some computer representations) prefer a different convention for the presumed location of the point, such as to the left of the leftmost digit. This simply offsets the exponent.
When a (nonzero) floating-point number is normalized, its leftmost digit is nonzero. The value of the significand obeys 1 ≤ s < b. (Zero cannot be represented in normalized floating-point notation, and computers have to deal with it separately.)
The mathematical value of a floating-point number is usually s.ssssssss...sss × be.
In binary radix, the significand is a string of bits (1's and 0's) of length p, of which the leftmost bit is 1. The number π, represented in binary, is
- 11.0010010000111111011010101000100010000101101000110000100011010011... but is
- 11.0010010000111111011011 when rounded to a precision of 24 bits.
In binary floating-point, this is e=1 ; s=110010010000111111011011.
The actual real number that a floating-point number represents is the number that one could obtain by placing an infinite number of zeros to the right of the significand:
- e=1; s=1100100100001111110110110000000000000000...
This number with a 24-bit significand has a decimal value of (exactly!)
- 3.1415927410125732421875, whereas the true value of π is
- 3.1415926535897932384626433832795...
The result of rounding π to 24-bit binary floating-point differs from the true value by about 0.03 parts per million, and matches the decimal representation of π in only the first 7 digits. (Accuracy of about 7 digits is the rule of thumb for rounding of real numbers to 24-bit floating-point.)
The problem is that floating-point numbers with a limited number of digits can represent only a subset of the reals, and that π is not in that subset. The act of rounding a real number to a floating-point representation consists of finding the representable real number that is closest to the given real number.
One doesn't need numbers as sophisticated as π to exhibit this phenomenon. The decimal number 0.1 is not representable in binary floating-point of any finite precision. The exact binary representation would have a "1100" sequence continuing endlessly:
- e=-4; s=1100110011001100110011001100110011..., but when rounded to 24 bits it becomes
- e=-4; s=110011001100110011001101 which is actually 0.100000001490116119384765625 in decimal.
Banker's rounding
When the bits after the last available result bit are 1000000000..... exactly (that is, it is known that there are an infinite number of zeros after the initial 1), rounding upward or downward is equally accurate. In this case, the technique of Banker's rounding is usually used. The rounding is done in the direction that makes the resulting significand even (this allows a result of zero).
Mantissa
The word mantissa is often used as a synonym for significand. Purists may not consider this usage to be correct, since the mantissa is traditionally defined as the fractional part of a logarithm, while the characteristic is the integer part. This terminology comes from the way logarithm tables were used before computers became commonplace. Log tables were actually tables of mantissas. Therefore, a mantissa is the logarithm of the significand.
Computer representation
- This section describes some general issues, but mostly follows the IEEE standard.
To represent a floating-point number in a computer datum, the exponent has to be encoded into a bit field. Since the exponent could be negative, one could use two's complement representation. Instead, a fixed constant is added to the exponent, with the intention that the result be a positive number capable of being packed into a fixed bit field. For the common 32 bit "single precision" or "float" format of the IEEE standard, this constant is 127, so the exponent is said to be represented in "excess 127" format. The result of this addition is placed in an 8 bit field.
Since the leftmost significand bit of a (normalized) floating-point number is always 1, that bit is not actually placed in the computer datum. The computer's hardware acts as though that "1" had been provided. This is the "implicit bit" or "hidden bit" of the IEEE standard. Because of this, a 24 bit significand is actually placed in a 23 bit field.
Finally, a sign bit is required. This is set to 1 to indicate that the entire floating-point number is negative, or 0 to indicate that it is positive. (In the past, some computers have used a kind of two's complement encoding for the entire number, rather than simple "sign/magnitude" format.)
The entire floating-point number is packed into a 32 bit word, with the sign bit leftmost, followed by the exponent in excess 127 format in the next 8 bits, followed by the significand (without the hidden bit) in the rightmost 23 bits.
For the approximation to π, we have
- sign=0 ; e=1 ; s=110010010000111111011011 (including the hidden bit)
- e+127 = 128 = 10000000 in (8 bit) binary
- final 32 bit result = 0 10000000 10010010000111111011011 = 0x40490FDB
As noted above, this number is not really π, but is exactly 3.1415927410125732421875.
In the common 64 bit "double precision" or "double" format of the IEEE standard, the offset added to the exponent is 1023, and the result is placed into an 11 bit field. The precision is 53 bits. After removal of the hidden bit, 52 bits remain. The result comprises 1+11+52=64 bits. The approximation to π is
- sign=0 ; e=1 ; s=11001001000011111101101010100010001000010110100011000 (including the hidden bit)
- e+1023 = 1024 = 10000000000 in (11 bit) binary
- final 64 bit result = 0 10000000000 1001001000011111101101010100010001000010110100011000 = 0x400921FB54442D18
This number is exactly 3.141592653589793115997963468544185161590576171875.
Overflow, underflow, and zero
The necessity to pack the offset exponent into a fixed-size bit field places limits on the exponent. For the standard 32 bit format, e+127 must fit into an 8 bit field, so −127 ≤ e ≤ 128. The values −127 and +128 are reserved for special meanings, so the actual range for normalized floating-point numbers is −126 ≤ e ≤ 127. This means that the smallest normalized number is
- e=−126 ; s=100000000000000000000000
which is about 1.18 × 10−38, and is represented in hexadecimal as 00800000. The largest representable number is
- e=+127 ; s=111111111111111111111111
which is about 3.4 × 1038, and is represented in hexadecimal as 7F7FFFFF. For double precision the range is about 2.2 × 10−308 to 1.8 × 10308.
Any floating-point computation that gives a result (after rounding to a representable value) higher than the upper limit is said to overflow. Under the IEEE standard, such result is set to a special value "infinity", which has the appropriate sign bit, the reserved exponent +128, and a bit pattern in the significand (typically zero) to indicate that this is infinity. Such numbers are generally printed as "+INF" or "-INF".
Floating-point hardware is generally designed to handle operands of infinity in a reasonable way, such as
- (+INF) + (+7) = (+INF)
- (+INF) × (-2) = (-INF)
A floating-point computation that (after rounding) gives a nonzero result lower than the lower limit is said to underflow. This could happen, for example, if 10−25 is multiplied by 10−25 in single precision. Under the IEEE standard, the reserved exponent −127 is used, and the significand is set as follows.
First, if the number is zero, it is represented by an exponent of −127 and a significand field of all zeros. This means that zero is represented in hexadecimal as 00000000.
Otherwise, if normalizing the number would lead to an exponent of −127 or less, it is only normalized until the exponent is −127. That is, instead of shifting the significand bits left until the leftmost bit is 1, they are shifted until the exponent reaches −127. For example, the smallest non-underflowing number is
- e=−126 ; s=1.00000000000000000000000 (about 1.18 × 10−38)
A number 1/16th as large would be
- e=−130 ; s=1.00000000000000000000000 (about 7.3 × 10−40)
If it is partially normalized, one gets
- e=−127 ; s=0.00100000000000000000000
This does not have a leading bit of 1, so using the "hidden bit" mechanism won't work. What is done is to store the significand without removing the leftmost bit, since there is no guarantee that it is 1. This means that the precision is only 23 bits, not 24. The exponent of −127 is stored in the usual excess 127 format, that is, all zeros. The final representation is
- 0 00000000 00010000000000000000000 = 00080000 in hexadecimal
Whenever the exponent is −127 (that is, all zeros in the datum), the bits are interpreted in this special format. Such a number is said to be "denormalized" (a "denorm" for short), or, in more modern terminology, "subnormal".
The smallest possible (subnormal) nonzero number is
- 0 00000000 00000000000000000000001 = 00000001 in hexadecimal
- e=−127 ; s=0.0000000000000000000001
Which is 2−149, or about 1.4 × 10−45
The handling of the number zero can be seen to be a completely ordinary case of a subnormal number.
The creation of denormalized numbers is often called "gradual underflow". As numbers get extremely small, significand bits are slowly sacrificed. The alternative is "sudden underflow", in which any number that can't be normalized is simply set to zero by the computer hardware. Gradual underflow is difficult for computer hardware to handle, so hardware often uses software to assist it, through interrupts. This can create a performance penalty, and where this is critical sudden underflow might be used.
Behavior of computer arithmetic
The standard behavior of computer hardware is to round the ideal (infinitely precise) result of an arithmetic operation to the nearest representable value, and give that representation as the result. In practice, there are other options. IEEE-754-compliant hardware allows one to set the rounding mode to any of the following:
- round to nearest (the default; by far the most common mode)
- round up (toward +∞; negative results round toward zero)
- round down (toward −∞; negative results round away from zero)
- round toward zero (sometimes called "chop" mode; it is similar to the common behavior of float-to-integer conversions, which convert −3.9 to −3)
In the default rounding mode the IEEE 754 standard mandates the round-to-nearest behavior described above for all fundamental algebraic operations, including square root. ("Library" functions such as cosine and log are not mandated.) This means that IEEE-compliant hardware's behavior is completely determined in all 32 or 64 bits.
The mandated behavior for dealing with overflow and underflow is that the appropriate result is computed, taking the rounding mode into consideration, as though the exponent range were infinitely large. If that resulting exponent can't be packed into its field correctly, the overflow/underflow action described above is taken.
The arithmetical distance between two consecutive representable floating point numbers is called an "ULP", for Unit in the Last Place. For example, the numbers represented by 45670123 and 45670124 hexadecimal is one ULP. An ULP is about 10−7 in single precision, and 10−16 in double precision. The mandated behavior of IEEE-compliant hardware is that the result be within one-half of an ULP.
Computer handling of floating point
For ease of presentation and understanding, decimal radix with 7 digit precision will be used in the examples. The fundamental principles are the same in any radix or precision.
To add or subtract two numbers, they must have their decimal (or binary) points lined up. This is done by comparing the exponent fields and shifting the smaller number to the right. In the example below, the second number is shifted right by three digits because it has the smaller exponent. It is unnormalized at this point. The actual addition is then performed:
e=5; s=1.234567 (123456.7) + e=2; s=1.017654 (101.7654) e=5; s=1.234567 + e=5; s=0.001017654 (after shifting) -------------------- e=5; s=1.235584654 (true sum: 123558.4654)
This is the "true" result, relative to the exact meaning of the incoming operands, but it has too many digits. It must be rounded to seven digits (to match the precision) and then normalized if necessary. The final result is e=5; s=1.235585 (that is, 123558.5, with a rounding error of +.000346)
It should be noted that the low 3 digits of the second operand (654) are essentially lost. Except for their possible influence on carrying and rounding, they have no effect. This is loss of significance. Whenever numbers of different magnitudes are added or subtracted, the lowest digits (or bits) of the smaller one are lost. In the most serious case of this, the smaller operand can be totally "absorbed" (that is, it has no effect at all):
e=5; s=1.234567 + e=-3; s=9.876543 e=5; s=1.234567 + e=5; s=0.00000009876543 (after shifting) ---------------------- e=5; s=1.23456709876543 (true sum) e=5; s=1.234567 (after rounding/normalization)
Another problem of loss of significance occurs when two nearly equal numbers are subtracted.
e=1; s=3.141600 - e=1; s=3.141593 ---------------- e=1; s=0.000007 (true difference) e=-5; s=7.000000 (after rounding/normalization)
Nearly all of the digits of the normalized result are meaningless. This is cancellation. It occurs when nearly equal numbers are subtracted, or numbers of opposite sign but nearly equal magnitude are added. Although the trailing digits are zero, their value could be anything. The numbers entering the calculation are presumably not known to be exact values, so the calculation might have been described as
e=1; s=3.141600??????... - e=1; s=3.141593??????... ---------------- e=1; s=0.000007?????? e=-5; s=7.??????
To multiply, the significands are multiplied while the exponents are added, and the result is rounded and normalized.
e=3; s=4.734612 × e=5; s=5.417242 ----------------------- e=8; s=25.648538980104 (true product) e=8; s=25.648539 (after rounding) e=9; s=2.5648539 (after normalization)
Division is done similarly, but that is more complicated.
There are no cancellation or absorption problems with multiplication or division, though overflow and underflow problems may occur, and small errors may accumulate as operations are performed repeatedly. In practice, the way these operations are carried out in digital logic can be quite complex. (see Booth's multiplication algorithm and digital division)
The enormous complexity of modern division algorithms once led to a famous error. [1] An early version of the Intel Pentium chip was shipped with a division instruction that, on rare occasions, gave slightly incorrect results. Many computers had been shipped before the error was discovered. Until the defective computers were replaced, patched versions of compilers were developed that could avoid the failing cases.
Exceptional values and exceptions under the IEEE standard
In addition to the "infinity" value that is produced when an overflow occurs, there is a special value "NaN" ("not a number") that is produced by such operations as taking the square root of a negative number. NaN is encoded with the reserved exponent of 128 (or 1024), and a significand field that distinguishes it from infinity.
The intention of the INF and NaN values is that, under the most common circumstances, they can just propagate from one operation to the next (any operation with NaN as an operand produces NaN as a result), and they only need to be attended to at a point that the programmer chooses.
In addition to the creation of exceptional values, there are "events" that may occur, though some of them are quite benign:
- An overflow occurs as described previously, producing an infinity.
- An underflow occurs as described previously, producing a denorm.
- A zerodivide occurs whenever a divisor is zero, producing an infinity of the appropriate sign. (The sign of zero is meaningful here.) Note that a very small but nonzero divisor can still cause an overflow and produce an infinity.
- An "operand error" occurs whenever a NaN has to be created. This occurs whenever any operand to an operation is a NaN, or some other obvious thing happens, such a sqrt(-2.0) or log(-1.0).
- An "inexact" event occurs whenever the rounding of a result changed that result from the true mathematical value. This occurs almost all the time, and is usually ignored. It is looked at only in the most exacting applications.
Computer hardware is typically able to raise exceptions ("traps") when these events occur. How this is done is very system-dependent. Usually all exceptions are masked (disabled). Sometimes overflow, zerodivide, and operand error are enabled.
Accuracy problems
Because floating-point numbers cannot faithfully mimic the real numbers, and floating-point operations cannot faithfully mimic true arithmetic operations, there are many problems that arise in writing mathematical software that uses floating-point. First, while addition and multiplication are both commutative (a+b = b+a and a×b = b×a), they are not associative. Using 7-digit decimal arithmetic:
1234.567 + 45.67844 = 1280.245 1280.245 + 0.0004 = 1280.245 but 45.67844 + 0.0004 = 45.67884 45.67884 + 1234.567 = 1280.246
They are also not distributive:
1234.567 × 3.333333 = 4115.223 1.234567 × 3.333333 = 4.115223 4115.223 + 4.115223 = 4119.338 but 1234.567 + 1.234567 = 1235.802 1235.802 × 3.333333 = 4119.340
Aside from that, the rounding actions that are performed after each arithmetic operation lead to inaccuracies that can accumulate in unpredictable ways. Consider the 24-bit (single precision) representation of (decimal) 0.1 that was given previously:
- e=-4; s=110011001100110011001101 (0.100000001490116119384765625 exactly)
The square of that is
- .010000000298023226097399174250313080847263336181640625 exactly
The representable number closest to this is
- e=-7; s=101000111101011100001011 (.010000000707805156707763671875 exactly)
But the representable number closest to 0.01 itself is
- e=-7; s=101000111101011100001010 (.00999999977648258209228515625 exactly)
What this means is that, in C or C++ or similar languages, the following calculation will return false instead of true:
((float) 1 / (float) 10) * ((float) 1 / (float) 10) == (float) 1 / (float) 100
In addition to loss of significance, inability to represent things like π and 0.1 exactly, and other slight inaccuracies, the following phenomena may occur:
- Cancellation: subtraction of nearly equal operands may cause extreme loss of accuracy. This is perhaps the most common and serious accuracy problem.
- Conversions to integer are unforgiving: converting (63.0/9.0) to integer yields 7, but converting (0.63/0.09) may yield 6. This is because conversions generally truncate rather than rounding.
- Limited exponent range: results might overflow, yielding infinity.
- Testing for safe division is problematical: Checking that the divisor is not zero does not guarantee that a division will not overflow and yield infinity.
- Comparison for exact equality of two numbers is problematical. Programmers often perform comparisons within some tolerance, but that doesn't necessarily make the problem go away.
Minimizing the effect of accuracy problems
Because of the problems noted above, naive use of floating point arithmetic can lead to many problems. A good understanding of numerical analysis is essential to the creation of robust floating point software. The subject is actually quite complicated, and the reader is referred to the references at the bottom of this article.
In addition to careful design of programs, careful handling by the compiler is essential. Certain "optimizations" that compilers might make (for example, reordering operations) can work against the goals of well-behaved software. There is some controversy about the failings of compilers and language designs in this area. See the external references at the bottom of this article.
Floating point arithmetic is at its best when it is simply being used to measure real-world quantities over a wide range of scales (such as the orbital period of Io or the mass of the proton), and at its worst when it is expected to model the interactions of quantities expressed as decimal strings that are expected to be exact. An example of the latter case is financial calculations. For this reason, financial software tends not to use a binary floating-point number representation. See: http://www2.hursley.ibm.com/decimal/. The "decimal" data type of the C# and Java programming languages, and the IEEE 854 standard, are designed to avoid the problems of binary floating point, and make the arithmetic always behave as expected when numbers are printed in decimal.
Double precision floating point arithmetic is more accurate than just about any physical measurement one could make. For example, it could indicate the distance from the Earth to the Moon with an accuracy of about 50 nanometers. So, if one were designing an integrated circuit chip with 100 nanometer features, that stretched from the Earth to the Moon, double precision arithmetic would be somewhat problematical, but only somewhat.
What makes floating point arithmetic troublesome is that people write mathematical algorithms that perform operations an enormous number of times, and so small errors grow. A few examples are matrix inversion, eigenvector computation, and differential equation solving. These algorithms must be very carefully designed if they are to work well.
People often carry expectations from their mathematics training into the field of floating point computation. For example, it is known that , and that , and that eigenvectors are degenerate if the eigenvalues are equal. These facts can't be counted on when the quantities involved are the result of floating point computation.
While a treatment of the techniques for writing high-quality floating-point software is far beyond the scope of this article, here are a few simple tricks:
The use of the equality test (if (x==y) ...) is usually not a good idea when it is based on expectations from pure mathematics. Such things are sometimes replaced with "fuzzy" tests (if (abs(x-y) < 1.0E-13) ...). The wisdom of doing this varies greatly. It is often better to organize the code in such a way that such tests are unnecessary.
An awareness of when loss of significance can occur is useful. For example, if one is adding a very large number of numbers, the individual addends are very small compared with the sum. This can lead to loss of significance. Suppose, for example, that one needs to add many numbers, all approximately equal to 3. After 1000 of them have been added, the running sum is about 3000. A typical addition would then be something like
3253.671 + 3.141276 -------- 3256.812
The low 3 digits of the addends are effectively lost. The Kahan summation algorithm may be used to reduce the errors.
Another thing that can be done is to rearrange the computation in a way that is mathematically equivalent but less prone to error. As an example, Archimedes approximated π by calculating the perimeters of polygons inscribing and circumscribing a circle, starting with hexagons, and successively doubling the number of sides. The recurrence formula for the circumscribed polygon is:
Here is a computation using IEEE "double" (53 bits of significand precision) arithmetic:
i 6 × 2i × ti, first form 6 × 2i × ti, second form 0 3.4641016151377543863 3.4641016151377543863 1 3.2153903091734710173 3.2153903091734723496 2 3.1596599420974940120 3.1596599420975006733 3 3.1460862151314012979 3.1460862151314352708 4 3.1427145996453136334 3.1427145996453689225 5 3.1418730499801259536 3.1418730499798241950 6 3.1416627470548084133 3.1416627470568494473 7 3.1416101765997805905 3.1416101766046906629 8 3.1415970343230776862 3.1415970343215275928 9 3.1415937488171150615 3.1415937487713536668 10 3.1415929278733740748 3.1415929273850979885 11 3.1415927256228504127 3.1415927220386148377 12 3.1415926717412858693 3.1415926707019992125 13 3.1415926189011456060 3.1415926578678454728 14 3.1415926717412858693 3.1415926546593073709 15 3.1415919358822321783 3.1415926538571730119 16 3.1415926717412858693 3.1415926536566394222 17 3.1415810075796233302 3.1415926536065061913 18 3.1415926717412858693 3.1415926535939728836 19 3.1414061547378810956 3.1415926535908393901 20 3.1405434924008406305 3.1415926535900560168 21 3.1400068646912273617 3.1415926535898608396 22 3.1349453756585929919 3.1415926535898122118 23 3.1400068646912273617 3.1415926535897995552 24 3.2245152435345525443 3.1415926535897968907 25 3.1415926535897962246 26 3.1415926535897962246 27 3.1415926535897962246 28 3.1415926535897962246 The true value is 3.1415926535897932385...
While the two forms of the recurrence formula are clearly equivalent, the first subtracts 1 from a number extremely close to 1, leading to huge cancellation errors. Note that, as the recurrence is applied repeatedly, the accuracy improves at first, but then it deteriorates. It never gets better than about 8 digits, even though 53-bit arithmetic should be capable of about 16 digits of precision. When the second form of the recurrence is used, the value converges to 15 digits of precision.
A few nice properties
One can sometimes take advantage of a few nice properties:
- Any integer strictly less than 224 can be exactly represented in the single precision format, and any integer strictly less than 2;53 can be exactly represented in the double precision format. Furthermore, any reasonable power of 2 times such a number can be represented. This property is sometimes used in purely integer applications, to get 53-bit integers on machines that have double precision floats but only 32-bit integers.
- The bit representations are monotonic, as long as exceptional values are avoided and the signs are handled properly. Floating point numbers are equal if and only if their integer bit representations are equal. Comparisons for larger or smaller can be done with integer comparisons on the bit patterns, as long as the signs match. However, the actual floating point comparisons provided by hardware typically have much more sophistication in dealing with exceptional values.
- To a rough approximation, the bit representation of a floating point number is proportional to its base 2 logarithm, with an average error of about 3%. (This is because the exponent field is in the more significant part of the datum.) This can be exploited in some applications, such as volume ramping in digital sound processing.
IEEE standard
The IEEE has standardized the computer representation for binary floating-point numbers in IEEE 754. This standard is followed by almost all modern machines. Notable exceptions include IBM Mainframes, which support IBM's own format (in addition to IEEE 754 data types), and Cray vector machines, where the T90 series had an IEEE version, but the SV1 still uses Cray floating-point format.
The standard allows for many different precision levels, of which the 32 bit ("single") and 64 bit ("double") are by far the most common, since they are supported in common programming languages. Computer hardware (for example, the Intel Pentium series and the Motorola 68000 series) often provides an 80 bit format, with 15 exponent bits and 64 significand bits, with no hidden bit. There is controversy about the failure of most programming languages to make these hardware formats available to programmers (with some notable exceptions such as the D programming language). Software vendors may also provide additional extended formats, such as the H-P "quad" format (1 sign bit, 15 exponent bits, and 113 significand bits, 1 of which is hidden.)
As of 2000, the IEEE 754 standard is currently under revision. See IEEE 754r.
See also
- Significant digits
- Fixed-point arithmetic
- Computable number
- IEEE Floating Point Standard
- IBM Floating Point Architecture
- FLOPS
- −0 (number)
- half precision – single precision – double precision – quad precision – minifloat
- Scientific notation
- Numerical Recipes
References
External links
- An edited reprint of the paper What Every Computer Scientist Should Know About Floating-Point Arithmetic, by David Goldberg, published in the March, 1991 issue of Computing Surveys.
- David Bindel’s Annotated Bibliography on computer support for scientific computation.
- Donald Knuth. The Art of Computer Programming, Volume 2: Seminumerical Algorithms, Third Edition. Addison-Wesley, 1997. ISBN 0-201-89684-2. Section 4.2: Floating Point Arithmetic, pp.214 – 264.
- Press et. al. Numerical Recipes in C++. The Art of Scientific Computing, ISBN 0-521-75033-4.
- Kahan, William and Darcy, Joseph (2001). How Java’s floating-point hurts everyone everywhere. Retrieved Sep. 5, 2003 from http://www.cs.berkeley.edu/~wkahan/JAVAhurt.pdf.
- Introduction to Floating point calculations and IEEE 754 standard by Jamil Khatib
- Survey of Floating-Point Formats This page gives a very brief summary of floating-point formats that have been used over the years.