Skip to content

Instantly share code, notes, and snippets.

@raldone01
Created December 14, 2024 23:06
Show Gist options
  • Save raldone01/305c90ec5a8fc3a1a57fddcb205e81f1 to your computer and use it in GitHub Desktop.
Save raldone01/305c90ec5a8fc3a1a57fddcb205e81f1 to your computer and use it in GitHub Desktop.
Floating numbers

754 Floating Point Formats

$ 2^{bit_precision} = 10^{decimal_digits} $ $ bit_precision = log_{2}(10) {decimal_digits} $ $ decimal_digits = log_{10}(2) {bit_precision}$

Type Sign Exponent Significand Total Exponent Bias Bits Precision Number of Decimal Digits Wiki
f16 Half (IEEE 754-2008) 1 5 10 (11) 16 15 11 ~3.3 https://en.wikipedia.org/wiki/Half-precision_floating-point_format
f32 Single 1 8 23 (24) 32 127 24 ~7.2 https://en.wikipedia.org/wiki/Single-precision_floating-point_format
f64 Double 1 11 52 (53) 64 1023 53 ~15.9 https://en.wikipedia.org/wiki/Double-precision_floating-point_format
x86 extended precision 1 15 64 80 16383 64 ~19.2 Don't care legacy anyway.
f128 Quad 1 15 112 (113) 128 16383 113 ~34.0 https://en.wikipedia.org/wiki/Quadruple-precision_floating-point_format
f256 Quad 1 19 236 (237) 256 262143 237 ~71.3 https://en.wikipedia.org/wiki/Octuple-precision_floating-point_format
bfloat16 1 8 7 (8) 16 127 8 ~2.4 https://en.wikipedia.org/wiki/Bfloat16_floating-point_format
Type Sign Combination Significand continuation Total Exponent Bias Bits Precision Number of Decimal Digits Wiki
decimal64 1 13 50 64 ? ~53 16 https://en.wikipedia.org/wiki/Decimal64_floating-point_format
decimal128 1 17 110 128 ? ~112 34 https://en.wikipedia.org/wiki/Decimal128_floating-point_format
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment