Talk:Bfloat16 floating-point format
![]() | Computing C‑class Low‑importance | |||||||||
|
Error in bit layout illustration
The file Bfloat16_format.svg shows the wrong bit indices (first field is 0-7 when it should be 0-6). I tried uploading a newer version with that fixed. It shows up. But the SVG itself is the old version.
Maybe it's because I used the exact same filename. I still believe this behavior is incorrect. Tried changing the filename and reuploading. That version is not even showing up. Now I can't even upload newer versions. Maybe this was flagged for moderation.
I don't know. I give up. Uploading experience is really bad. --nachokb (talk) 09:21, 27 September 2018 (UTC)
- It seems to be correct, now. I suppose that your issue was due to caching. Vincent Lefèvre (talk) 09:49, 27 September 2018 (UTC)
History
It would be good to have a section explaining where this format was introduced and a history of its adoption. — Steven G. Johnson (talk) 20:04, 3 June 2021 (UTC)
Add additional information about rounding
![]() | The user below has a request that an edit be made to Bfloat16 floating-point format. That user has an actual or apparent conflict of interest. The requested edits backlog is high. Please be very patient. There are currently 173 requests waiting for review. Please read the instructions for the parameters used by this template for accepting and declining them, and review the request below and make the edit if it is well sourced, neutral, and follows other Wikipedia guidelines and policies. |
- Specific text to be added or removed
1. Use Shortened instead of Truncated
To avoid confusion with the rounding mechanism, the first change is to replace “truncated” with “shortened” when it comes to describing bfloat16. “The bfloat16 format, being a shortened IEEE 754 single-precision 32-bit float, …”. In such a way, the text won’t be misunderstood with the truncation rounding in conversion.
2. Add a Format Conversion Section Add a format conversion section to detail the rounding mechanism. The proposed text is as follows:
Rounding and Conversion
The most common use case is the conversion between binary32 and bfloat16. The following section describes the conversion process and its rounding scheme in the conversion. Note that there are other possible scenarios of format conversions to or from bfloat16. For example, int16 and bfloat16.
a. From IEEE754 binary32 (32-bit floating point) to bfloat16
When bfloat16 was first introduced as a storage format[1], the conversion from IEEE754 binary32 (32-bit floating point) to bfloat16 is truncation (round-to-zero). Later on, when it becomes the input of matrix multiplication units, the conversion can have various rounding mechanisms. Unfortunately, the rounding mechanisms supported by the hardware are platform-dependent, due to the lack of an industry standard. For example, Google TPU uses round-to-nearest-even[2]; ARM uses round-to-nearest-odd; NVIDIA supports four rounding schemes outlined in IEEE754, including round-to-nearest-even.
b. From bfloat16 to IEEE754 binary32 (32-bit floating point)
Since IEEE binary32 can represent all exact values in bfloat16, the conversion simply pads 16 zeros in the mantissa bits.
[1] Joshua V. Dillon, Ian Langmore, Dustin Tran, Eugene Brevdo, Srinivas Vasudevan, Dave Moore, Brian Patton, Alex Alemi, Matt Hoffman, Rif A. Saurous (2017-11-28). TensorFlow Distributions (Report). arXiv:1711.10604. Bibcode:2017arXiv171110604D. Accessed 2018-05-23. All operations in TensorFlow Distributions are numerically stable across half, single, and double floating-point precisions (as TensorFlow dtypes: tf.bfloat16 (truncated floating point), tf.float16, tf.float32, tf.float64). Class constructors have a validate_args flag for numerical asserts
[2] Google The bfloat16 numerical format https://cloud.google.com/tpu/docs/bfloat16
[3] ARM bfloat16-processing-for-neural-networks-on-armv8_2d00_a https://community.arm.com/arm-community-blogs/b/ai-and-ml-blog/posts/bfloat16-processing-for-neural-networks-on-armv8_2d00_a
[4] NVIDIA CUDA devtools: 1.3.5. Bfloat16 Precision Conversion and Data Movementhttps://docs.nvidia.com/cuda/cuda-math-api/group__CUDA__MATH____BFLOAT16__MISC.html#group__CUDA__MATH____BFLOAT16__MISC_1gf532ce241c3f0b983136d3b130ce0cf3
- Reason for the change: To reflect the recent status that the industry vendors have supported various rounding mechanisms other than truncation.
- References supporting change: NVIDIA, ARM
2A00:79E1:ABC:12B:BDC7:5327:147A:E174 (talk) 22:09, 18 July 2023 (UTC)
Disclosure: the requestor works at Google, and some content of the edit request is related to Google's product. 2A00:79E1:ABC:12B:BDC7:5327:147A:E174 (talk) 22:09, 18 July 2023 (UTC)