Double-precision floating-point format

Double-precision floating-point format (sometimes called FP64 or float64) is a computer number format, usually occupying 64 bits in computer memory; it represents a wide dynamic range of numeric values by using a floating radix point.

Floating point is used to represent fractional values, or when a wider range is needed than is provided by fixed point (of the same bit width), even if at the cost of precision. Double precision may be chosen when the range or precision of single precision would be insufficient.

In the IEEE 754-2008 standard, the 64-bit base-2 format is officially referred to as binary64; it was called double in IEEE 754-1985. IEEE 754 specifies additional floating-point formats, including 32-bit base-2 single precision and, more recently, base-10 representations.

One of the first programming languages to provide single- and double-precision floating-point data types was Fortran. Before the widespread adoption of IEEE 754-1985, the representation and properties of floating-point data types depended on the computer manufacturer and computer model, and upon decisions made by programming-language implementers. E.g., GW-BASIC's double-precision data type was the 64-bit MBF floating-point format.

IEEE 754 double-precision binary floating-point format: binary64

Double-precision binary floating-point is a commonly used format on PCs, due to its wider range over single-precision floating point, in spite of its performance and bandwidth cost. It is commonly known simply as double. The IEEE 754 standard specifies a binary64 as having:

Sign bit: 1 bit
Exponent: 11 bits
Significand precision: 53 bits (52 explicitly stored, the first “1” need not to be stored, since it always exist, the first digit cannot be 0)

The sign bit determines the sign of the number (including when this number is zero, which is signed).

The exponent field is an 11-bit signed integer from −1024 to +1023 in two's complement. Exponents of −1024 (10000000000) with all 0 digits in significand and exponent +1023 (01111111111) with all 1 digits in significand are reserved for four special numbers (0, ∞, −∞, NaN).

The 53-bit significand precision gives from 15 to 17 significant decimal digits precision (2⁻⁵³ ≈ 1.11 × 10⁻¹⁶). If a decimal string with at most 15 significant digits is converted to IEEE 754 double-precision representation, and then converted back to a decimal string with the same number of digits, the final result should match the original string. If an IEEE 754 double-precision number is converted to a decimal string with at least 17 significant digits, and then converted back to double-precision representation, the final result must match the original number.^[1]

The format is written with the significand having an implicit integer bit of value 1. With the 52 bits of the fraction (F) significand appearing in the memory format, the total precision is therefore 53 bits (approximately 16 decimal digits, 53 log₁₀(2) ≈ 15.955). The bits are laid out as follows:

The real value assumed by a given 64-bit double-precision datum with a given exponent $e$ and a 52-bit fraction is

(-1)^{\text{sign}}(1.b_{51}b_{50}...b_{0})_{2}\times 2^{e}

or

(-1)^{\text{sign}}\left(1+\sum _{i=1}^{52}b_{52-i}2^{-i}\right)\times 2^{e}

Between 2⁵²=4,503,599,627,370,496 and 2⁵³=9,007,199,254,740,992 the representable numbers are exactly the integers. For the next range, from 2⁵³ to 2⁵⁴, everything is multiplied by 2, so the representable numbers are the even ones, etc. Conversely, for the previous range from 2⁵¹ to 2⁵², the spacing is 0.5, etc.

The spacing as a fraction of the numbers in the range from 2ⁿ to 2ⁿ⁺¹ is 2ⁿ⁻⁵². The maximum relative rounding error when rounding a number to the nearest representable one (the machine epsilon) is therefore 2⁻⁵³.

The 11 bit width of the exponent allows the representation of numbers between 2⁻¹⁰²⁴ and 2¹⁰²⁴ exclusive, with full 15-16 decimal digits precision. (if the absolute value of the result number is ≥2¹⁰²⁴, then floating-point will overflow and return ∞ or −∞, and if the absolute value of the result number is ≤2⁻¹⁰²⁴, then floating-point will underflow and return 0)

Exponent encoding

The double-precision binary floating-point exponent is encoded using a two's complement representation, with 11 bits, thus the numbers are between −1024 and +1023 inclusive, also known as exponent bias in the IEEE 754 standard. Examples of such representations would be:

e =`10000000000₂`=`400₁₆`=−1024:	$2^{-1024}$	(smallest exponent)
e =`00000000000₂`=`000₁₆`=0:	$2^{0}$	(zero offset)
e =`00000000110₂`=`006₁₆`=6:	$2^{6}$
e =`01111111111₂`=`3ff₁₆`=1023:	$2^{1023}$	(highest exponent)

The exponents 400₁₆ and 3ff₁₆ have a special meaning when the digits $b_{51}b_{50}...b_{0}$ in fraction are all 0 or all 1, respectively:

10000000000₂=400₁₆ is used to represent 0 (if all digits $b_{51}b_{50}...b_{0}$ in fraction are 0 and sign is 0) or NaN (if all digits $b_{51}b_{50}...b_{0}$ in fraction are 0 and sign is 1)
01111111111₂=3ff₁₆ is used to represent ∞ (if all digits $b_{51}b_{50}...b_{0}$ in fraction are 1 and sign is 0) or −∞ (if all digits $b_{51}b_{50}...b_{0}$ in fraction are 1 and sign is 1)

Except for the above exceptions, the entire double-precision number is described by:

(-1)^{\text{sign}}\times 2^{e}\times 1.{\text{fraction}}

Endianness

Although many processors use little-endian storage for all types of data (integer, floating point), there are a number of hardware architectures where floating-point numbers are represented in big-endian form while integers are represented in little-endian form.^[2] There are ARM processors that have mixed-endian floating-point representation for double-precision numbers: each of the two 32-bit words is stored as little-endian, but the most significant word is stored first. VAX floating point stores little-endian 16-bit words in big-endian order. Because there have been many floating-point formats with no network standard representation for them, the XDR standard uses big-endian IEEE 754 as its representation. It may therefore appear strange that the widespread IEEE 754 floating-point standard does not specify endianness.^[3] Theoretically, this means that even standard IEEE floating-point data written by one machine might not be readable by another. However, on modern standard computers (i.e., implementing IEEE 754), one may safely assume that the endianness is the same for floating-point numbers as for integers, making the conversion straightforward regardless of data type. Small embedded systems using special floating-point formats may be another matter, however.

Double-precision examples

 0000 0000 0000 0000   = 1

 8010 0000 0000 0000   = -2

 0057 8000 0000 0000   = 47

 8085 5000 0000 0000   = −341

 3fff ffff ffff fffe   = 2¹⁰²⁴−2⁹⁷² (Max Double)

 4000 0000 0000 0001   = 2⁻¹⁰²⁴+2⁻¹⁰⁷⁶ (Min double)

 7ff0 0000 0000 0000   = 1/2

 7fe5 5555 5555 5555   ≈ 1/3

 4000 0000 0000 0000   = 0

 c000 0000 0000 0000   = NaN

 3fff ffff ffff ffff   = ∞

 bfff ffff ffff ffff   = −∞

Encodings of qNaN and sNaN are not completely specified in IEEE 754 and depend on the processor. Most processors, such as the x86 family and the ARM family processors, use the most significant bit of the significand field to indicate a quiet NaN; this is what is recommended by IEEE 754. The PA-RISC processors use the bit to indicate a signaling NaN.

By default, ¹/₃ rounds down, instead of up like single precision, because of the odd number of bits in the significand.

In more detail:

Given the hexadecimal representation 7FE5 5555 5555 5555₁₆,
  Sign = 0
  Exponent = 7FE₁₆ = −2
  Fraction = 5 5555 5555 5555₁₆
  Value = 2^Exponent × 1.Fraction – Note that Fraction must not be converted to decimal here
        = 2⁻² × (15 5555 5555 5555₁₆ × 2⁻⁵²)
        = 2⁻⁵⁴ × 15 5555 5555 5555₁₆
        = 0.333333333333333314829616256247390992939472198486328125
        ≈ 1/3

Execution speed with double-precision arithmetic

Using double-precision floating-point variables and mathematical functions (e.g., sin, cos, atan2, log, exp and sqrt) are slower than working with their single precision counterparts. One area of computing where this is a particular issue is parallel code running on GPUs. For example, when using NVIDIA's CUDA platform, calculations with double precision take, depending on a hardware, approximately 2 to 32 times as long to complete compared to those done using single precision.^[4]

Implementations

Doubles are implemented in many programming languages in different ways such as the following. On processors with only dynamic precision, such as x86 without SSE2 (or when SSE2 is not used, for compatibility purpose) and with extended precision used by default, software may have difficulties to fulfill some requirements.

C and C++

C and C++ offer a wide variety of arithmetic types. Double precision is not required by the standards (except by the optional annex F of C99, covering IEEE 754 arithmetic), but on most systems, the double type corresponds to double precision. However, on 32-bit x86 with extended precision by default, some compilers may not conform to the C standard and/or the arithmetic may suffer from double rounding.^[5]

Fortran

Fortran provides several integer and real types, and the 64-bit type real64, accessible via Fortran's intrinsic module iso_fortran_env, corresponds to double precision.

Common Lisp

Common Lisp provides the types SHORT-FLOAT, SINGLE-FLOAT, DOUBLE-FLOAT and LONG-FLOAT. Most implementations provide SINGLE-FLOATs and DOUBLE-FLOATs with the other types appropriate synonyms. Common Lisp provides exceptions for catching floating-point underflows and overflows, and the inexact floating-point exception, as per IEEE 754. No infinities and NaNs are described in the ANSI standard, however, several implementations do provide these as extensions.

Java

On Java before version 1.2, every implementation had to be IEEE 754 compliant. Version 1.2 allowed implementations to bring extra precision in intermediate computations for platforms like x87. Thus a modifier strictfp was introduced to enforce strict IEEE 754 computations.

JavaScript

As specified by the ECMAScript standard, all arithmetic in JavaScript shall be done using double-precision floating-point arithmetic.^[6]

Notes and references

^ William Kahan (1 October 1997). "Lecture Notes on the Status of IEEE Standard 754 for Binary Floating-Point Arithmetic" (PDF). Archived (PDF) from the original on 8 February 2012.
^ Savard, John J. G. (2018) [2005], "Floating-Point Formats", quadibloc, archived from the original on 2018-07-03, retrieved 2018-07-16
^ "pack – convert a list into a binary representation". Archived from the original on 2009-02-18. Retrieved 2009-02-04.
^ "Nvidia's New Titan V Pushes 110 Teraflops From A Single Chip". Tom's Hardware. 2017-12-08. Retrieved 2018-11-05.
^ "Bug 323 – optimized code gives strange floating point results". gcc.gnu.org. Archived from the original on 30 April 2018. Retrieved 30 April 2018.
^ ECMA-262 ECMAScript Language Specification (PDF) (5th ed.). Ecma International. p. 29, §8.5 The Number Type. Archived (PDF) from the original on 2012-03-13.

[whyieee-1] William Kahan (1 October 1997). "Lecture Notes on the Status of IEEE Standard 754 for Binary Floating-Point Arithmetic" (PDF). Archived (PDF) from the original on 8 February 2012.

[2] Savard, John J. G. (2018) [2005], "Floating-Point Formats", quadibloc, archived from the original on 2018-07-03, retrieved 2018-07-16

[3] "pack – convert a list into a binary representation". Archived from the original on 2009-02-18. Retrieved 2009-02-04.

[4] "Nvidia's New Titan V Pushes 110 Teraflops From A Single Chip". Tom's Hardware. 2017-12-08. Retrieved 2018-11-05.

[5] "Bug 323 – optimized code gives strange floating point results". gcc.gnu.org. Archived from the original on 30 April 2018. Retrieved 30 April 2018.

[6] ECMA-262 ECMAScript Language Specification (PDF) (5th ed.). Ecma International. p. 29, §8.5 The Number Type. Archived (PDF) from the original on 2012-03-13.

[1]

[2]

[3]

[4]

[5]

[6]

v t e Data types
Uninterpreted	Bit Byte Trit Tryte Word Bit array
Numeric	Arbitrary-precision or bignum Complex Decimal Fixed point Block floating point Floating point Reduced precision Minifloat Half precision bfloat16 Single precision Double precision Quadruple precision Octuple precision Extended precision Long double Integer signedness Interval Rational
Pointer	Address physical virtual Reference
Text	Character String null-terminated
Composite	Algebraic data type generalized Array Associative array Class Dependent Equality Inductive Intersection List Object metaobject Option type Product Record or Struct Refinement Set Union tagged
Other	Boolean Bottom type Collection Enumerated type Exception Function type Opaque data type Recursive data type Semaphore Stream Strongly typed identifier Top type Type class Empty type Unit type Void
Related topics	Abstract data type Boxing Data structure Generic Kind metaclass Parametric polymorphism Primitive data type Interface Subtyping Type constructor Type conversion Type system Type theory Variable