This is the
talk page for discussing improvements to the
IEEE 754 article. This is not a forum for general discussion of the article's subject. |
Article policies
|
Find sources: Google ( books · news · scholar · free images · WP refs) · FENS · JSTOR · TWL |
Archives: 1Auto-archiving period: 365 days |
This article is rated C-class on Wikipedia's
content assessment scale. It is of interest to the following WikiProjects: | ||||||||||||||||||||||||||||
|
The lede says Many hardware floating-point units use the IEEE 754 standard
. I suspect this is a significant understatement. Are there any hardware floating point implementations today which use other formats?
RoySmith
(talk) 18:46, 24 June 2023 (UTC)
I reverted a recent edit regarding exponent range. The complication is the IEEE 754 binary has one significand bit before the binary point, where some other formats have the binary point on the left. If you don't consider that, then the exponent is one higher, but actually it isn't. That is, if you need to compare to other formats, you will be off by one bit, or 0.30 decimal digits. But if you don't need to do that comparison, then it is fine. Note that, unless I forgot, the decimal formats have the decimal point on the right of the significand. Gah4 ( talk) 07:33, 9 September 2023 (UTC)
In an exercise I was conducting to learn more about the more novel GPU FP formats I stumbled upon some inconsistencies in the table describing some properties of IEEE-754 FP formats and submitted an edit. I found that some entries in the column Decimal Emax were not consistent to the values stated in the respective web pages for the individual formats binary128, binary64, binary32 and binary16, but the binary256 entry was consistent. My edits were revoked by @ Gah4, hence I initiate this discussion topic (on Gah4's suggestion).
I created a spreadsheet page to redo the calculations as a fun pastime and came to the same result as the binaryXX pages in Wikipedia so I made the edit.
As an example, the binary128 page states the largest regular number that can be represented is 216383×(2−2−112) ≈ 1.1897314953572317650857593266280070162×104932. The log10 of this value is ≈ 4932+log101.18973 which is larger than 4932. The table on the page I edited states a number smaller than 4932.
Since my spreadsheet used the same formulas for each binary format and agreed with the binaryXX pages as well as one of the entries on the IEEE-754 page I felt encouraged that my calculations were right.
Images of the current version of the table
, my suggested edited tabele
, and a screenshot of my sheet are provided.
The spreadsheet itself is provided as a URL link.
Love to hear your comments. Nsmeds ( talk) 16:17, 10 September 2023 (UTC)
Here is a playground where I intend to suggest a slightly modified version of the table. As it stands right now there are some repetitions that can be avoided and some info that could be added. Have a bit of patience and I will have a suggestion in a few days. (Editing tables is a pain)
I want to add information about sub-normal numbers and compact some information. I will try to not make too many changes here, but instead make small edits in a personal sandbox and then larger updates here when I think a discussion could be useful.
I need to work around the unfortunate wrapping in some places and that some columns are unnecessarily wide.
Here is the current table:
Name | Common name | Base | Significand digits [a] | Decimal digits [b] | Exponent bits | log10 MAX | Exponent bias [1] | E min | E max | Notes |
---|---|---|---|---|---|---|---|---|---|---|
binary16 | Half precision | 2 | 11 | 3.31 | 5 | 4.51 | 24−1 = 15 | −14 | +15 | Interchange |
binary32 | Single precision | 2 | 24 | 7.22 | 8 | 38.23 | 27−1 = 127 | −126 | +127 | Basic binary |
binary64 | Double precision | 2 | 53 | 15.95 | 11 | 307.95 | 210−1 = 1023 | −1022 | +1023 | Basic binary |
binary128 | Quadruple precision | 2 | 113 | 34.02 | 15 | 4931.77 | 214−1 = 16383 | −16382 | +16383 | Basic binary |
binary256 | Octuple precision | 2 | 237 | 71.34 | 19 | 78913.2 | 218−1 = 262143 | −262142 | +262143 | Interchange |
decimal32 | 10 | 7 | 7 | 7.58 | 97 - 2.2·10-15 | 101 | −95 | +96 | Interchange | |
decimal64 | 10 | 16 | 16 | 9.58 | 385 - 2.2·10-33 | 398 | −383 | +384 | Basic decimal | |
decimal128 | 10 | 34 | 34 | 13.58 | 6145 - 2.2·10-69 | 6176 | −6143 | +6144 | Basic decimal |
Note that in the table above, the minimum exponents listed are for normal numbers; the special subnormal number representation allows even smaller numbers to be represented (with some loss of precision). For example, the smallest positive number that can be represented in binary64 is 2−1074; contributions to the −1074 figure include the E min value −1022 and all but one of the 53 significand bits (2−1022 − (53 − 1) = 2−1074).
Decimal digits is the precision of the format expressed in terms of an equivalent number of decimal digits. It is computed as digits × log10 base. E.g. binary128 has approximately the same precision as a 34 digit decimal number.
log10 MAX is a measure of the range of the encoding. Its integer part is the largest exponent shown on the output of a value in scientific notation with one leading digit in the significand before the decimal point (e.g. 1.698·1038 is near the largest value in binary32, 9.999999·1096 is the largest value in decimal32) Nsmeds ( talk) 19:00, 13 September 2023 (UTC)
Suggestion for a revised table:
Significand | Exponent | Properties [c] | ||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|
Name | Common name | Radix | Digits [d] | Decimal digits [e] | Min | Max | Bias [1] | MAXVAL | log10 MAXVAL | MINVAL>0 (normal) | MINVAL>0 (subnorm) | Notes |
binary16 | Half precision | 2 | 11 | 3.31 | -14 | 15 | 15 | 65504 | 4.8 | 6.10·10-5 | 5.96·10-8 | Interchange |
binary32 | Single precision | 2 | 24 | 7.22 | -126 | 127 | 127 | 1.70·1038 | 38.5 | 1.18·10-38 | 1.40·10-45 | Basic binary |
binary64 | Double precision | 2 | 53 | 15.95 | -1022 | +1023 | 1023 | 8.99·10307 | 308.2 | 2.23·10-308 | 4.94·10-324 | Basic binary |
binary128 | Quadruple precision | 2 | 113 | 34.02 | −16382 | +16383 | 16383 | 5.95·104931 | 4932.0 | 3.36·10-4932 | 6.48·10-4966 | Basic binary |
binary256 | Octuple precision | 2 | 237 | 71.34 | −262142 | +262143 | 262143 | 1.61·1078193 | 78913.2 | 2.48·10-78913 | 2.25·10-78984 | Interchange |
decimal32 | 10 | 7 | 7 | −95 | +96 | 101 | ≈1.0·1097 | 97 - 2.2·10-15 | 1·10-95 | 1·10-101 | Interchange | |
decimal64 | 10 | 16 | 16 | −383 | +384 | 398 | ≈1.0·10385 | 385 - 2.2·10-33 | 1·10-191 | 1·10-206 | Basic decimal | |
decimal128 | 10 | 34 | 34 | −6143 | +6144 | 6176 | ≈1.0·106145 | 6145 - 2.2·10-69 | 1·10-6143 | 1·10-6176 | Basic decimal |
Note that in the table above, the min exponent value listed is for normal binary numbers; the special subnormal number format allow for values in smaller magnitude to be represented, but at a loss of precision. The decimal format does not define a "subnormal" form of values as such, but numbers with a leading 0 in the mantissa and an exponent with the minimal value of the format can be seen as an analog to the subnormals of the binary formats.
Decimal digits is the precision of the format expressed in terms of an equivalent number of decimal digits. It is computed as digits × log10 base. Eg binary128 has approximately the same precision as a 34 digit decimal number.
log10 MAXVAL is a measure of the range of the encoding. Its integer part is the largest exponent shown on the output of a value in scientific notation with one leading digit in the significand before the decimal point (eg 1.698·1038 is near the largest value in binary32, 9.999999·1096 is the largest value in decimal32). The value in the table is rounded towards zero.
References
There is a recent edit noting that IEEE-754 values are sortable as sign-magnitude. I believe this is true for most sign-magnitude floating point formats, at least for normalized values when they can be unnormalized. (I am not sure about denormals, though.) The PDP-10 floating point format uses two's complement on the whole word for negative values, such that they are comparable using integer compare instructions. Not many processors supply a sign-magnitude compare operation, though. Gah4 ( talk) 20:38, 10 November 2023 (UTC)
It lacks many details for professionals.
It lacks simplicity as well.
They can't decide even audience after that many years.
It is really important. Boh39083 ( talk) 05:03, 19 November 2023 (UTC)
I did a revert to a change on decimal exponent values. I believe it is right, because of the way they are defined, but I start this in case someone wants to discuss it, as I noted in the edit summary. Gah4 ( talk) 17:20, 19 January 2024 (UTC)
@ Nsmeds: Though I agree that an introduction in IEEE 754#History would be useful, there are several issues with what has been added in 1210457800, so that I'm going to revert this change (mainly because of the first point below):
BTW, in section History of Floating-point arithmetic, there is a paragraph on the standardization, which could serve as a basis:
Initially, computers used many different representations for floating-point numbers. The lack of standardization at the mainframe level was an ongoing problem by the early 1970s for those writing and maintaining higher-level source code; these manufacturer floating-point standards differed in the word sizes, the representations, and the rounding behavior and general accuracy of operations. Floating-point compatibility across multiple computing systems was in desperate need of standardization by the early 1980s, leading to the creation of the IEEE 754 standard once the 32-bit (or 64-bit) word had become commonplace. This standard was significantly based on a proposal from Intel, which was designing the i8087 numerical coprocessor; Motorola, which was designing the 68000 around the same time, gave significant input as well.
— Vincent Lefèvre ( talk) 01:54, 27 February 2024 (UTC)
Cite error: There are <ref group=lower-alpha>
tags or {{efn}}
templates on this page, but the references will not show without a {{reflist|group=lower-alpha}}
template or {{notelist}}
template (see the
help page).
This is the
talk page for discussing improvements to the
IEEE 754 article. This is not a forum for general discussion of the article's subject. |
Article policies
|
Find sources: Google ( books · news · scholar · free images · WP refs) · FENS · JSTOR · TWL |
Archives: 1Auto-archiving period: 365 days |
This article is rated C-class on Wikipedia's
content assessment scale. It is of interest to the following WikiProjects: | ||||||||||||||||||||||||||||
|
The lede says Many hardware floating-point units use the IEEE 754 standard
. I suspect this is a significant understatement. Are there any hardware floating point implementations today which use other formats?
RoySmith
(talk) 18:46, 24 June 2023 (UTC)
I reverted a recent edit regarding exponent range. The complication is the IEEE 754 binary has one significand bit before the binary point, where some other formats have the binary point on the left. If you don't consider that, then the exponent is one higher, but actually it isn't. That is, if you need to compare to other formats, you will be off by one bit, or 0.30 decimal digits. But if you don't need to do that comparison, then it is fine. Note that, unless I forgot, the decimal formats have the decimal point on the right of the significand. Gah4 ( talk) 07:33, 9 September 2023 (UTC)
In an exercise I was conducting to learn more about the more novel GPU FP formats I stumbled upon some inconsistencies in the table describing some properties of IEEE-754 FP formats and submitted an edit. I found that some entries in the column Decimal Emax were not consistent to the values stated in the respective web pages for the individual formats binary128, binary64, binary32 and binary16, but the binary256 entry was consistent. My edits were revoked by @ Gah4, hence I initiate this discussion topic (on Gah4's suggestion).
I created a spreadsheet page to redo the calculations as a fun pastime and came to the same result as the binaryXX pages in Wikipedia so I made the edit.
As an example, the binary128 page states the largest regular number that can be represented is 216383×(2−2−112) ≈ 1.1897314953572317650857593266280070162×104932. The log10 of this value is ≈ 4932+log101.18973 which is larger than 4932. The table on the page I edited states a number smaller than 4932.
Since my spreadsheet used the same formulas for each binary format and agreed with the binaryXX pages as well as one of the entries on the IEEE-754 page I felt encouraged that my calculations were right.
Images of the current version of the table
, my suggested edited tabele
, and a screenshot of my sheet are provided.
The spreadsheet itself is provided as a URL link.
Love to hear your comments. Nsmeds ( talk) 16:17, 10 September 2023 (UTC)
Here is a playground where I intend to suggest a slightly modified version of the table. As it stands right now there are some repetitions that can be avoided and some info that could be added. Have a bit of patience and I will have a suggestion in a few days. (Editing tables is a pain)
I want to add information about sub-normal numbers and compact some information. I will try to not make too many changes here, but instead make small edits in a personal sandbox and then larger updates here when I think a discussion could be useful.
I need to work around the unfortunate wrapping in some places and that some columns are unnecessarily wide.
Here is the current table:
Name | Common name | Base | Significand digits [a] | Decimal digits [b] | Exponent bits | log10 MAX | Exponent bias [1] | E min | E max | Notes |
---|---|---|---|---|---|---|---|---|---|---|
binary16 | Half precision | 2 | 11 | 3.31 | 5 | 4.51 | 24−1 = 15 | −14 | +15 | Interchange |
binary32 | Single precision | 2 | 24 | 7.22 | 8 | 38.23 | 27−1 = 127 | −126 | +127 | Basic binary |
binary64 | Double precision | 2 | 53 | 15.95 | 11 | 307.95 | 210−1 = 1023 | −1022 | +1023 | Basic binary |
binary128 | Quadruple precision | 2 | 113 | 34.02 | 15 | 4931.77 | 214−1 = 16383 | −16382 | +16383 | Basic binary |
binary256 | Octuple precision | 2 | 237 | 71.34 | 19 | 78913.2 | 218−1 = 262143 | −262142 | +262143 | Interchange |
decimal32 | 10 | 7 | 7 | 7.58 | 97 - 2.2·10-15 | 101 | −95 | +96 | Interchange | |
decimal64 | 10 | 16 | 16 | 9.58 | 385 - 2.2·10-33 | 398 | −383 | +384 | Basic decimal | |
decimal128 | 10 | 34 | 34 | 13.58 | 6145 - 2.2·10-69 | 6176 | −6143 | +6144 | Basic decimal |
Note that in the table above, the minimum exponents listed are for normal numbers; the special subnormal number representation allows even smaller numbers to be represented (with some loss of precision). For example, the smallest positive number that can be represented in binary64 is 2−1074; contributions to the −1074 figure include the E min value −1022 and all but one of the 53 significand bits (2−1022 − (53 − 1) = 2−1074).
Decimal digits is the precision of the format expressed in terms of an equivalent number of decimal digits. It is computed as digits × log10 base. E.g. binary128 has approximately the same precision as a 34 digit decimal number.
log10 MAX is a measure of the range of the encoding. Its integer part is the largest exponent shown on the output of a value in scientific notation with one leading digit in the significand before the decimal point (e.g. 1.698·1038 is near the largest value in binary32, 9.999999·1096 is the largest value in decimal32) Nsmeds ( talk) 19:00, 13 September 2023 (UTC)
Suggestion for a revised table:
Significand | Exponent | Properties [c] | ||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|
Name | Common name | Radix | Digits [d] | Decimal digits [e] | Min | Max | Bias [1] | MAXVAL | log10 MAXVAL | MINVAL>0 (normal) | MINVAL>0 (subnorm) | Notes |
binary16 | Half precision | 2 | 11 | 3.31 | -14 | 15 | 15 | 65504 | 4.8 | 6.10·10-5 | 5.96·10-8 | Interchange |
binary32 | Single precision | 2 | 24 | 7.22 | -126 | 127 | 127 | 1.70·1038 | 38.5 | 1.18·10-38 | 1.40·10-45 | Basic binary |
binary64 | Double precision | 2 | 53 | 15.95 | -1022 | +1023 | 1023 | 8.99·10307 | 308.2 | 2.23·10-308 | 4.94·10-324 | Basic binary |
binary128 | Quadruple precision | 2 | 113 | 34.02 | −16382 | +16383 | 16383 | 5.95·104931 | 4932.0 | 3.36·10-4932 | 6.48·10-4966 | Basic binary |
binary256 | Octuple precision | 2 | 237 | 71.34 | −262142 | +262143 | 262143 | 1.61·1078193 | 78913.2 | 2.48·10-78913 | 2.25·10-78984 | Interchange |
decimal32 | 10 | 7 | 7 | −95 | +96 | 101 | ≈1.0·1097 | 97 - 2.2·10-15 | 1·10-95 | 1·10-101 | Interchange | |
decimal64 | 10 | 16 | 16 | −383 | +384 | 398 | ≈1.0·10385 | 385 - 2.2·10-33 | 1·10-191 | 1·10-206 | Basic decimal | |
decimal128 | 10 | 34 | 34 | −6143 | +6144 | 6176 | ≈1.0·106145 | 6145 - 2.2·10-69 | 1·10-6143 | 1·10-6176 | Basic decimal |
Note that in the table above, the min exponent value listed is for normal binary numbers; the special subnormal number format allow for values in smaller magnitude to be represented, but at a loss of precision. The decimal format does not define a "subnormal" form of values as such, but numbers with a leading 0 in the mantissa and an exponent with the minimal value of the format can be seen as an analog to the subnormals of the binary formats.
Decimal digits is the precision of the format expressed in terms of an equivalent number of decimal digits. It is computed as digits × log10 base. Eg binary128 has approximately the same precision as a 34 digit decimal number.
log10 MAXVAL is a measure of the range of the encoding. Its integer part is the largest exponent shown on the output of a value in scientific notation with one leading digit in the significand before the decimal point (eg 1.698·1038 is near the largest value in binary32, 9.999999·1096 is the largest value in decimal32). The value in the table is rounded towards zero.
References
There is a recent edit noting that IEEE-754 values are sortable as sign-magnitude. I believe this is true for most sign-magnitude floating point formats, at least for normalized values when they can be unnormalized. (I am not sure about denormals, though.) The PDP-10 floating point format uses two's complement on the whole word for negative values, such that they are comparable using integer compare instructions. Not many processors supply a sign-magnitude compare operation, though. Gah4 ( talk) 20:38, 10 November 2023 (UTC)
It lacks many details for professionals.
It lacks simplicity as well.
They can't decide even audience after that many years.
It is really important. Boh39083 ( talk) 05:03, 19 November 2023 (UTC)
I did a revert to a change on decimal exponent values. I believe it is right, because of the way they are defined, but I start this in case someone wants to discuss it, as I noted in the edit summary. Gah4 ( talk) 17:20, 19 January 2024 (UTC)
@ Nsmeds: Though I agree that an introduction in IEEE 754#History would be useful, there are several issues with what has been added in 1210457800, so that I'm going to revert this change (mainly because of the first point below):
BTW, in section History of Floating-point arithmetic, there is a paragraph on the standardization, which could serve as a basis:
Initially, computers used many different representations for floating-point numbers. The lack of standardization at the mainframe level was an ongoing problem by the early 1970s for those writing and maintaining higher-level source code; these manufacturer floating-point standards differed in the word sizes, the representations, and the rounding behavior and general accuracy of operations. Floating-point compatibility across multiple computing systems was in desperate need of standardization by the early 1980s, leading to the creation of the IEEE 754 standard once the 32-bit (or 64-bit) word had become commonplace. This standard was significantly based on a proposal from Intel, which was designing the i8087 numerical coprocessor; Motorola, which was designing the 68000 around the same time, gave significant input as well.
— Vincent Lefèvre ( talk) 01:54, 27 February 2024 (UTC)
Cite error: There are <ref group=lower-alpha>
tags or {{efn}}
templates on this page, but the references will not show without a {{reflist|group=lower-alpha}}
template or {{notelist}}
template (see the
help page).