![]() | This page is an archive of past discussions. Do not edit the contents of this page. If you wish to start a new discussion or revive an old one, please do so on the current talk page. |
Spitzak, you've repeatedly asserted that MySQL UTF8mb3 and CESU-8 are exactly the same in the edit comments. I believe you, but I can't follow you, because the source materials seem to say otherwise, and the citations seem insufficient.
In Unicode Technical Report #26, CESU-8 is explicitly defined to support supplemental characters: "In CESU-8, supplementary characters are represented as six-byte sequences". Whereas the MySQL 8.0 Reference Manual explicitly states that supplemental characters are not supported: "Supports BMP characters only (no support for supplementary characters)". And the MySQL 3.23, 4.0, 4.1 Reference Manual (when utf8mb3 first appears, as "utf8") says the same: "The ucs2 and utf8 character sets do not support supplementary characters that lie outside the BMP."
How do you reconcile these conflicting definitions of CESU-8 and utf8mb3? Is one of them wrong, or do they require further interpretation? If so, is that cited somewhere? I checked the citations, but I'm not seeing how they back up what you're saying -- they only seem to note that utf8mb3 doesn't support supplemental characters. If what you're saying is in fact true, I think further explication is needed beyond saying it is so, because the MySQL docs and UTR#26 seem to suggest that utf8mb3 and CESU-8 are definitionally different, at least when perused by a non-expert like myself trying to learn about the subject.
While I think the introductory paragraph is trying to shed some light, "many programs" is vague and not cited, and nor is it cited that MySQL is definitively one of those many programs, and nor is it cited that MySQL "transforms UCS-2 codes to three bytes or fewer" for utf8mb3. Does it? How do we know?
If what you're trying to say is that when UTF-16 supplemental characters are converted to UTF-8 as though they are UCS-2 (and not UTF-16), the result is what came to be called CESU-8, then I think you also need to say that while utf8mb3 is not intended to support supplemental characters at all, it functionally operates as CESU-8 if they are present. And ideally that should be backed up with a citation, or an example sufficient to demonstrate that this article is not the only place where one will find this assertion.
And, even if you're right that utf8mb3 and CESU-8 (and Oracle UTF8) are technically identical, it's still not correct to say that "MySQL calls [UTF-16 supplemental characters converted to UTF-8 as though they were UCS-2 characters] utf8mb3", because MySQL quite clearly defines utf8mb3 as being BMP-only; so MySQL is not "calling" anything involving supplemental characters utf8mb3.
Having now been trying to understand this for hours, I think this Oracle document explains it pretty well: "The UTF8 character set encodes characters in one, two, or three bytes...If supplementary characters are inserted into a UTF8 database...the supplementary characters are treated as two separate, user-defined characters that occupy 6 bytes." If what you're saying is correct (and I don't know that it is, because I don't have anything authoritative saying so), then it sounds like this could be equally applicable to utf8mb3. The article could make that clear, if properly cited or demonstrated.
TL;DR: It's not accurate to describe utf8mb3 as having any representation of supplemental characters, even if it can technically can do so as described by CESU-8, because it is defined otherwise. Further, claiming utf8mb3 is technically identical to CESU-8 warrants citation or demonstration, and the claim would benefit from greater clarity. Ivanxqz ( talk) 00:45, 15 September 2020 (UTC)
I decided to rewrite the CESU-8 section for what I think is greater clarity and accuracy. I included that CESU-8 in utf8mb3 is possible (though unsupported), on the basis of Spitzak's claim that it's the case. I noted that it needs a citation. I think it's not actually true, though, on the basis of Bernt's counter-demonstration at Talk:CESU-8#Comments, which I also just verified myself, and also the original references regarding utf8mb3 in the previous version, but I'll leave it for now. ( Spitzak? Can you show somewhere why your claim that utf8mb3 can support supplemental characters via CESU-8 is accurate?)
I also gave utf8mb3 its own section again, since it is definitionally not CESU-8, even if technically it's the same thing (which, again, I don't think it is). It's like saying that Mountain Standard Time and Pacific Daylight Time are the same thing; they represent the exact same time of day in California and Arizona in the winter, but they're not the same thing, because they have different definitions. Ivanxqz ( talk) 10:53, 16 September 2020 (UTC)
https://softwareengineering.stackexchange.com/questions/102205/should-utf-16-be-considered-harmful
Under "Adoption": "Internally in software usage is even lower, with UCS-2, UTF-16, and UTF-32 in use, particularly in the Windows API, but also by Python". What I don't like about this is that Windows API only has a Unicode API for one encoding (plus legacy; codepages). It used to be UCS-2 (in now discontinued Windows versions, I believe they all are), but it's now UTF-16. And it doesn't have direct indexing, to Unicode characters so what follows isn't too helpful (it's outdated from UCS-2 era): "This is due to a belief that direct indexing of code points is more important than 8-bit compatibility". I think we should concentrate first on the main alternative to UTF-8 in use, UTF-16, then possibly explain programming languages. Since there are many and that text misrepresents Python (it also stores Latin1 internally) maybe just leave it out? Just as text on other encodings such as GB 18030 were moved to another page, possibly we need not mention all UTF-8 alternatives, or what all programming languages do, e.g. Python, as it's not strictly about adoption, rather non-adoption? comp.arch ( talk) 12:34, 26 March 2021 (UTC)
![]() | This page is an archive of past discussions. Do not edit the contents of this page. If you wish to start a new discussion or revive an old one, please do so on the current talk page. |
Spitzak, you've repeatedly asserted that MySQL UTF8mb3 and CESU-8 are exactly the same in the edit comments. I believe you, but I can't follow you, because the source materials seem to say otherwise, and the citations seem insufficient.
In Unicode Technical Report #26, CESU-8 is explicitly defined to support supplemental characters: "In CESU-8, supplementary characters are represented as six-byte sequences". Whereas the MySQL 8.0 Reference Manual explicitly states that supplemental characters are not supported: "Supports BMP characters only (no support for supplementary characters)". And the MySQL 3.23, 4.0, 4.1 Reference Manual (when utf8mb3 first appears, as "utf8") says the same: "The ucs2 and utf8 character sets do not support supplementary characters that lie outside the BMP."
How do you reconcile these conflicting definitions of CESU-8 and utf8mb3? Is one of them wrong, or do they require further interpretation? If so, is that cited somewhere? I checked the citations, but I'm not seeing how they back up what you're saying -- they only seem to note that utf8mb3 doesn't support supplemental characters. If what you're saying is in fact true, I think further explication is needed beyond saying it is so, because the MySQL docs and UTR#26 seem to suggest that utf8mb3 and CESU-8 are definitionally different, at least when perused by a non-expert like myself trying to learn about the subject.
While I think the introductory paragraph is trying to shed some light, "many programs" is vague and not cited, and nor is it cited that MySQL is definitively one of those many programs, and nor is it cited that MySQL "transforms UCS-2 codes to three bytes or fewer" for utf8mb3. Does it? How do we know?
If what you're trying to say is that when UTF-16 supplemental characters are converted to UTF-8 as though they are UCS-2 (and not UTF-16), the result is what came to be called CESU-8, then I think you also need to say that while utf8mb3 is not intended to support supplemental characters at all, it functionally operates as CESU-8 if they are present. And ideally that should be backed up with a citation, or an example sufficient to demonstrate that this article is not the only place where one will find this assertion.
And, even if you're right that utf8mb3 and CESU-8 (and Oracle UTF8) are technically identical, it's still not correct to say that "MySQL calls [UTF-16 supplemental characters converted to UTF-8 as though they were UCS-2 characters] utf8mb3", because MySQL quite clearly defines utf8mb3 as being BMP-only; so MySQL is not "calling" anything involving supplemental characters utf8mb3.
Having now been trying to understand this for hours, I think this Oracle document explains it pretty well: "The UTF8 character set encodes characters in one, two, or three bytes...If supplementary characters are inserted into a UTF8 database...the supplementary characters are treated as two separate, user-defined characters that occupy 6 bytes." If what you're saying is correct (and I don't know that it is, because I don't have anything authoritative saying so), then it sounds like this could be equally applicable to utf8mb3. The article could make that clear, if properly cited or demonstrated.
TL;DR: It's not accurate to describe utf8mb3 as having any representation of supplemental characters, even if it can technically can do so as described by CESU-8, because it is defined otherwise. Further, claiming utf8mb3 is technically identical to CESU-8 warrants citation or demonstration, and the claim would benefit from greater clarity. Ivanxqz ( talk) 00:45, 15 September 2020 (UTC)
I decided to rewrite the CESU-8 section for what I think is greater clarity and accuracy. I included that CESU-8 in utf8mb3 is possible (though unsupported), on the basis of Spitzak's claim that it's the case. I noted that it needs a citation. I think it's not actually true, though, on the basis of Bernt's counter-demonstration at Talk:CESU-8#Comments, which I also just verified myself, and also the original references regarding utf8mb3 in the previous version, but I'll leave it for now. ( Spitzak? Can you show somewhere why your claim that utf8mb3 can support supplemental characters via CESU-8 is accurate?)
I also gave utf8mb3 its own section again, since it is definitionally not CESU-8, even if technically it's the same thing (which, again, I don't think it is). It's like saying that Mountain Standard Time and Pacific Daylight Time are the same thing; they represent the exact same time of day in California and Arizona in the winter, but they're not the same thing, because they have different definitions. Ivanxqz ( talk) 10:53, 16 September 2020 (UTC)
https://softwareengineering.stackexchange.com/questions/102205/should-utf-16-be-considered-harmful
Under "Adoption": "Internally in software usage is even lower, with UCS-2, UTF-16, and UTF-32 in use, particularly in the Windows API, but also by Python". What I don't like about this is that Windows API only has a Unicode API for one encoding (plus legacy; codepages). It used to be UCS-2 (in now discontinued Windows versions, I believe they all are), but it's now UTF-16. And it doesn't have direct indexing, to Unicode characters so what follows isn't too helpful (it's outdated from UCS-2 era): "This is due to a belief that direct indexing of code points is more important than 8-bit compatibility". I think we should concentrate first on the main alternative to UTF-8 in use, UTF-16, then possibly explain programming languages. Since there are many and that text misrepresents Python (it also stores Latin1 internally) maybe just leave it out? Just as text on other encodings such as GB 18030 were moved to another page, possibly we need not mention all UTF-8 alternatives, or what all programming languages do, e.g. Python, as it's not strictly about adoption, rather non-adoption? comp.arch ( talk) 12:34, 26 March 2021 (UTC)