Double-precision floating-point format
In computing, a double precision is a usually binary floating-point computer numbering format that occupies 8 bytes (64 bits in modern computers) in computer memory.
In IEEE 754-2008 the 64-bit base 2 format is officially referred to as binary64. It was called double in IEEE 754-1985.
One of the first programming languages to provide single- and double-precision floating-point data types was Fortran. Before the widespread adoption of IEEE 754-1985, the representation and properties of the double float data type depended on the computer manufacturer and computer model.
Double precision floating point provides a relative precision of about 16 decimal digits and magnitude range from about 10−308 to about 10+308. In computers that have 64-bit floating-point arithmetic units, most numerical computing is done in double-precision floating point, since the use of single-precision provides little speed advantage.[1][2]
Double precision is known as double in C, C++ and Java.[3]
Floating-point formats |
---|
IEEE 754 |
|
Other |
Alternatives |
Tapered floating point |
IEEE 754 double precision binary floating-point format: binary64
The IEEE 754 standard specifies a binary64 as having:
- Sign bit: 1 bit
- Exponent width: 11 bits
- Significand precision: 53 bits (52 explicitly stored)
The format is written with the significand having an implicit lead bit of value 1, unless the exponent is stored with all zeros. Thus only 52 bits of the significand appearing in the memory format but the total precision is 53 bits (approximately 16 decimal digits, ). The bits are laid out as follows:
Exponent encoding
The double precision binary floating-point exponent is encoded using an offset binary representation, with the zero offset being 1023; also known as exponent bias in the IEEE 754 standard.
- Emin = 001H−3FFH = −1022
- Emax = 7FEH−3FFH = 1023
- Exponent bias = 3FFH = 1023
Thus, as defined by the offset binary representation, in order to get the true exponent the offset of 1023 has to be subtracted from the stored exponent. The value of Emax is 1023 instead of 1024 because an exponent consisting of all 1's is considered a special case[4].
The stored exponents 000H and 7FFH are interpreted specially.
Exponent | Significand zero | Significand non-zero | Equation |
---|---|---|---|
000H | zero, −0 | subnormal numbers | (−1)signbit×2−1022× 0.significandbits |
001H, ..., 7FEH | normalized value | (−1)signbit×2exponentbits−1023×1.significandbits | |
7FFH | ±infinity | NaN (quiet, signalling) |
The minimum positive (subnormal) value is 2−1074 ≈ 5 × 10−324. The minimum positive normal value is 2−1022 ≈ 2.225 × 10−308. The maximum representable value is ≈ 1.79769 × 10308.
Double precision examples
These examples are given in bit representation, in hexadecimal, of the floating point value. This includes the sign, (biased) exponent, and significand.
3ff0 0000 0000 0000 = 1 3ff0 0000 0000 0001 = 1.0000000000000002, the next higher number > 1 3ff0 0000 0000 0002 = 1.0000000000000004 4000 0000 0000 0000 = 2 c000 0000 0000 0000 = −2
7fef ffff ffff ffff ≈ 1.7976931348623157 × 10308 (max double precision)
0000 0000 0000 0000 = 0 8000 0000 0000 0000 = −0
7ff0 0000 0000 0000 = infinity fff0 0000 0000 0000 = -infinity
3fd5 5555 5555 5555 ≈ 1/3
By default, 1/3 rounds down, instead of up like single precision, because of the odd number of bits in the significand.
So the bits beyond the rounding point are 0101...
which is less than 1/2 of a unit in the last place.
Each of the 52 bits of the significand, bit 51 to bit 0, represents a value, starting at 1 and halves for each bit, as follows
bit 51 = 1 bit 50 = 0.5 bit 49 = 0.25 bit 48 = 0.125 bit 47 = 0.0625 . . bit 0 = ~0.0000000000000004440892 (~4.440892e-16)
In more detail:
Given the hexadecimal representation 3fd5 5555 5555 5555, Sign = 0x0 Exponent = 0x3fd = 1021 Exponent Bias = 1023 (above) Mantissa = 0x5 5555 5555 5555 Value = 2(Exponent − Exponent Bias) × 1.Mantissa – Note the Mantissa must not be converted to decimal here = 2−2 × (0x15 5555 5555 5555 × 2−52) = 2−54 × 0x15 5555 5555 5555 = 0.333333333333333314829616256247390992939472198486328125 ≈ 1/3
See also
- IEEE Standard for Floating-Point Arithmetic (IEEE 754)
- ISO/IEC 10967, Language Independent Arithmetic
- Primitive data type
- Numerical stability
- Single precision floating-point format
References
- ^ "Are doubles faster than floats in c#?"
- ^ "C/C++ V7.0 (for AIX): Single-precision and double-precision performance" "With these architectures, ... single-precision instructions are ... are executed with the same speed as double-precision operations."
- ^ http://java.sun.com/docs/books/tutorial/java/nutsandbolts/datatypes.html
- ^ http://steve.hollasch.net/cgindex/coding/ieeefloat.html