Jump to content

Double-precision floating-point format

From Wikipedia, the free encyclopedia
This is an old revision of this page, as edited by 81.0.116.244 (talk) at 16:11, 3 April 2010 (Double precision examples: added sample of minimal fraction like it is in http://en.wikipedia.org/wiki/Single_precision_floating-point_format). The present address (URL) is a permanent link to this revision, which may differ significantly from the current revision.

In computing, a double precision is a usually binary floating-point computer numbering format that occupies 8 bytes (64 bits in modern computers) in computer memory.

In IEEE 754-2008 the 64-bit base 2 format is officially referred to as binary64. It was called double in IEEE 754-1985.

One of the first programming languages to provide single- and double-precision floating-point data types was Fortran. Before the widespread adoption of IEEE 754-1985, the representation and properties of the double float data type depended on the computer manufacturer and computer model.

Double precision floating point provides a relative precision of about 16 decimal digits and magnitude range from about 10−308 to about 10+308. In computers that have 64-bit floating-point arithmetic units, most numerical computing is done in double-precision floating point, since the use of single-precision provides little speed advantage.[1][2]

Double precision is known as double in C, C++ and Java.[3]

IEEE 754 double precision binary floating-point format: binary64

The IEEE 754 standard specifies a binary64 as having:

The format is written with the significand having an implicit lead bit of value 1, unless the exponent is stored with all zeros. Thus only 52 bits of the significand appearing in the memory format but the total precision is 53 bits (approximately 16 decimal digits, ). The bits are laid out as follows:

Exponent encoding

The double precision binary floating-point exponent is encoded using an offset binary representation, with the zero offset being 1023; also known as exponent bias in the IEEE 754 standard.

  • Emin = 001H−3FFH = −1022
  • Emax = 7FEH−3FFH = 1023
  • Exponent bias = 3FFH = 1023

Thus, as defined by the offset binary representation, in order to get the true exponent the offset of 1023 has to be subtracted from the stored exponent. The value of Emax is 1023 instead of 1024 because an exponent consisting of all 1's is considered a special case[4].

The stored exponents 000H and 7FFH are interpreted specially.

Exponent Significand zero Significand non-zero Equation
000H zero, −0 subnormal numbers (−1)signbit×2−1022× 0.significandbits
001H, ..., 7FEH normalized value (−1)signbit×2exponentbits−1023×1.significandbits
7FFH ±infinity NaN (quiet, signalling)

The minimum positive (subnormal) value is 2−1074 ≈ 5 × 10−324. The minimum positive normal value is 2−1022 ≈ 2.225 × 10−308. The maximum representable value is ≈ 1.79769 × 10308.

Double precision examples

These examples are given in bit representation, in hexadecimal, of the floating point value. This includes the sign, (biased) exponent, and significand.

3ff0 0000 0000 0000   = 1
3ff0 0000 0000 0001   = 1.0000000000000002, the next higher number > 1
3ff0 0000 0000 0002   = 1.0000000000000004
4000 0000 0000 0000   = 2
c000 0000 0000 0000   = −2
7fef ffff ffff ffff   ≈ 1.7976931348623157 × 10308 (max double precision)
0000 0000 0000 0000   = 0
8000 0000 0000 0000   = −0
7ff0 0000 0000 0000   = infinity
fff0 0000 0000 0000   = -infinity
3fd5 5555 5555 5555   ≈ 1/3

By default, 1/3 rounds down, instead of up like single precision, because of the odd number of bits in the significand. So the bits beyond the rounding point are 0101... which is less than 1/2 of a unit in the last place.


Each of the 52 bits of the significand, bit 51 to bit 0, represents a value, starting at 1 and halves for each bit, as follows

bit 51 = 1
bit 50 = 0.5
bit 49 = 0.25
bit 48 = 0.125
bit 47 = 0.0625
.
.
bit  0 = ~0.0000000000000004440892 (~4.440892e-16)

In more detail:

Given the hexadecimal representation 3fd5 5555 5555 5555,
  Sign = 0x0
  Exponent = 0x3fd = 1021
  Exponent Bias = 1023 (above)
  Mantissa = 0x5 5555 5555 5555
  Value = 2(Exponent − Exponent Bias) × 1.Mantissa – Note the Mantissa must not be converted to decimal here
        = 2−2 × (0x15 5555 5555 5555 × 2−52)
        = 2−54 × 0x15 5555 5555 5555
        = 0.333333333333333314829616256247390992939472198486328125
        ≈ 1/3

See also

References