Double-precision floating-point format

In computing, a double precision is a usually binary floating-point computer numbering format that occupies 8 bytes (64 bits in modern computers) in computer memory.

In IEEE 754-2008 the 64-bit base 2 format is officially referred to as binary64. It was called double in IEEE 754-1985.

One of the first programming languages to provide single- and double-precision floating-point data types was Fortran. Before the widespread adoption of IEEE 754-1985, the representation and properties of the double float data type depended on the computer manufacturer and computer model.

Double precision floating point provides a relative precision of about 16 decimal digits and magnitude range from about 10⁻³⁰⁸ to about 10⁺³⁰⁸. In computers that have 64-bit floating-point arithmetic units, most numerical computing is done in double-precision floating point, since the use of single-precision provides little speed advantage.^[1]^[2]

Double precision is known as double in C, C++ and Java.^[3]

IEEE 754 double precision binary floating-point format: binary64

The IEEE 754 standard specifies a binary64 as having:

Sign bit: 1 bit
Exponent width: 11 bits
Significand precision: 53 bits (52 explicitly stored)

The format is written with the significand having an implicit lead bit of value 1, unless the exponent is stored with all zeros. Thus only 52 bits of the significand appearing in the memory format but the total precision is 53 bits (approximately 16 decimal digits, $\log _{10}(2^{53})\approx 15.955$ ). The bits are laid out as follows:

Exponent encoding

The double precision binary floating-point exponent is encoded using an offset binary representation, with the zero offset being 1023; also known as exponent bias in the IEEE 754 standard.

E_min = 001_H−3FF_H = −1022
E_max = 7FE_H−3FF_H = 1023
Exponent bias = 3FF_H = 1023

Thus, as defined by the offset binary representation, in order to get the true exponent the offset of 1023 has to be subtracted from the stored exponent. The value of E_max is 1023 instead of 1024 because an exponent consisting of all 1's is considered a special case^[4].

The stored exponents 000_H and 7FF_H are interpreted specially.

Exponent	Significand zero	Significand non-zero	Equation
000_H	zero, −0	subnormal numbers	(−1)^signbit×2⁻¹⁰²²× 0.significandbits
001_H, ..., 7FE_H	normalized value		(−1)^signbit×2^{exponentbits−1023}×1.significandbits
7FF_H	±infinity	NaN (quiet, signalling)

The minimum positive (subnormal) value is 2⁻¹⁰⁷⁴ ≈ 5 × 10⁻³²⁴. The minimum positive normal value is 2⁻¹⁰²² ≈ 2.225 × 10⁻³⁰⁸. The maximum representable value is ≈ 1.79769 × 10³⁰⁸.

Double precision examples

These examples are given in bit representation, in hexadecimal, of the floating point value. This includes the sign, (biased) exponent, and significand.

3ff0 0000 0000 0000   = 1
3ff0 0000 0000 0001   = 1.0000000000000002, the next higher number > 1
3ff0 0000 0000 0002   = 1.0000000000000004
4000 0000 0000 0000   = 2
c000 0000 0000 0000   = −2

7fef ffff ffff ffff   ≈ 1.7976931348623157 × 10³⁰⁸ (max double precision)

0000 0000 0000 0000   = 0
8000 0000 0000 0000   = −0

7ff0 0000 0000 0000   = infinity
fff0 0000 0000 0000   = -infinity

3fd5 5555 5555 5555   ≈ 1/3

By default, 1/3 rounds down, instead of up like single precision, because of the odd number of bits in the significand. So the bits beyond the rounding point are 0101... which is less than 1/2 of a unit in the last place.

Each of the 52 bits of the significand, bit 51 to bit 0, represents a value, starting at 1 and halves for each bit, as follows

bit 51 = 1
bit 50 = 0.5
bit 49 = 0.25
bit 48 = 0.125
bit 47 = 0.0625
.
.
bit  0 = ~0.0000000000000004440892 (~4.440892e-16)

In more detail:

Given the hexadecimal representation 3fd5 5555 5555 5555,
  Sign = 0x0
  Exponent = 0x3fd = 1021
  Exponent Bias = 1023 (above)
  Mantissa = 0x5 5555 5555 5555
  Value = 2^{(Exponent − Exponent Bias)} × 1.Mantissa – Note the Mantissa must not be converted to decimal here
        = 2⁻² × (0x15 5555 5555 5555 × 2⁻⁵²)
        = 2⁻⁵⁴ × 0x15 5555 5555 5555
        = 0.333333333333333314829616256247390992939472198486328125
        ≈ 1/3

References

^ "Are doubles faster than floats in c#?"
^ "C/C++ V7.0 (for AIX): Single-precision and double-precision performance" "With these architectures, ... single-precision instructions are ... are executed with the same speed as double-precision operations."
^ http://java.sun.com/docs/books/tutorial/java/nutsandbolts/datatypes.html
^ http://steve.hollasch.net/cgindex/coding/ieeefloat.html

External links

[1] "Are doubles faster than floats in c#?"

[2] "C/C++ V7.0 (for AIX): Single-precision and double-precision performance" "With these architectures, ... single-precision instructions are ... are executed with the same speed as double-precision operations."

[3] ttp://java.sun.com/docs/books/tutorial/java/nutsandbolts/datatypes.html

[4] ttp://steve.hollasch.net/cgindex/coding/ieeefloat.html

[1]

[2]

[3]

[4]

IEEE 754 double precision binary floating-point format: binary64

Exponent encoding

Double precision examples

See also

References

External links