![]() | This page is an archive of past discussions. Do not edit the contents of this page. If you wish to start a new discussion or revive an old one, please do so on the current talk page. |
http://www.theregister.co.uk/2013/10/04/verity_stob_unicode/ is a secondary source, published in a well-known technology magazine, that comments on the UTF-8 Everywhere publication, and there is no indication that it affiliates with the authors of the later. Please explain how it does not count as a valid source for the above claim. Thanks. 82.80.119.226 ( talk) 17:00, 15 June 2015 (UTC)
The "manifesto" is not a bad piece, but it's one group's opinion and not a very compelling case for "UTF-8 everywhere". Each of the three principal encodings has strengths and weaknesses; one should feel free to use the one that best meets the purpose at hand. Since all three are supported by well-vetted open source libraries, one can safely choose at will. The manifesto's main point is that encoding UTF-16 is tricky; but it's no more tricky than UTF-8, and I've seen home-rolled code making mistakes with either. This really is a tempest in a tea-cup. -- Elphion ( talk) 01:53, 17 June 2015 (UTC)
The manifesto's FAQ #4 is a good example of what the authors get wrong. The question asks whether any encoding is allowable internally, and the authors respond (reasonably) that they have nothing against this. But then they argue that std::string is used for all kinds of different encodings, and we should just all agree on UTF-8. First, this doesn't answer the question, which in context is talking about UTF-8 vs UTF-16: nobody would use std::string (a sequence of bytes) for UTF-16 -- you would use std::wstring instead. Second, std::string has a precise meaning: it is a sequence of bytes, period. It knows nothing of Unicode, or of the UTF-8 encoding, or how to advance a character, or how to recognize illegal encodings, etc., etc. If you are processing a UTF-8 string, you should use a class that is UTF-8 aware -- one specifically designed for UTF-8 that does know about such things. Passing your UTF-8 string to the outside as std::string on the assumption that the interface will treat it as Unicode is just asking for trouble, and asking the world to adopt that assumption (despite the definition of std::string) is naive and will never come to pass. It will only foster religious wars. If you need to pass a UTF-8 string as std::string to the outside (to the OS or another library), test and document the interface between your UTF-8 string and the outside world for clarity. The manifesto's approach only muddies the waters.
The argument that UTF-8 is the "least weak" of all the encodings is silly. The differences in "weakness" are minuscule and mostly in the eyes of the beholder. As a professional programmer dealing with Unicode, you should know all of these and be prepared to deal with them. The important question instead is which encoding best suits your purposes as a programmer for the task at hand. As long as your end result is a valid Unicode string (in whatever encoding) and you communicate this to whatever interface you're sending it to, nobody should have cause for complaint. The interface may need to alter the string to conform to the expectations on the other side (Windows, e.g.). It is precisely the interfaces where expectations on both sides need to be spelled out. Leaving them to convention is the wrong approach.
I would say that all three major encodings being international standards endorsed by multiple actors is sufficient warrant for using them.
-- Elphion ( talk) 18:24, 18 June 2015 (UTC)
I don't think it belongs in the article. sverdrup ( talk) 12:11, 29 June 2015 (UTC)
Move the following material to Talk (originally in the lead of the article and since August 2014 commented out)
UTF-8 is also increasingly being used as the default character encoding in operating systems, programming languages, APIs, and software applications.{{cn}}
Possible Citations, I'm sure you could find lots more, but these are good samples showing that the low-level api is in UTF-8:
https://developer.apple.com/library/mac/qa/qa1173/_index.html
https://developer.gnome.org/pango/stable/pango-Layout-Objects.html#pango-layout-set-text
http://wayland.freedesktop.org/docs/html/protocol-spec-interface-wl_shell_surface.html (look for set_title)
End moved material -- Elphion ( talk) 12:35, 10 September 2015 (UTC)
Graph indicates that UTF-8 (light blue) exceeded other main encodings of text on the Web, that by 2010 it was nearing 50% prevalent, and up to 85% by August 2015.[2]
While the 85% statement may be true, the graph doesn't indicate any such thing. NotYourFathersOldsmobile ( talk) 22:13, 10 September 2015 (UTC)
112.134.187.248 made a whole lot of mis-informed edits. With any luck he will look here. Sorry but it was almost impossible to fix without a complete revert, though he was correct exactly once when he used "will" instead of "could", I will try to keep that..
1. An indic code point cannot take 6 bytes. He is confusing things with CESU-8 I think, though even that will encode these as 3 bytes as they are still in the BMP.
2. You cannot pass ASCII to a function expecting UTF-16. Declaring the arguments to be byte arrays does not help, and I doubt any sane programmer would do that either. Therefore it is true that UTF-16 requires new apis.
3. Many attempts to claim there is some real-world chance of Asian script being larger in UTF-8 than UTF-16, despite obvious measurements of real documents that show this does not happen. Sorry you are wrong. Find a REAL document on-line that is larger. Stripping all the markup and newlines does not count.
4. Deletion of the fact that invalid byte sequences can be stored in UTF-8, by somehow pretending that they will magically not happen. Sorry, try actually programming before you believe such utter rubbish. Especially odd because some other edits indicate that he things invalid sequences will happen and are somehow a disadvantage of UTF-8.
5. Belief that markup tags using non-ASCII can remove the fact that it makes UTF-8 smaller. Sorry but markup contains far more slashes and angle brackets and spaces and quotes and lots and lots of ASCII-only tags so this will just not happen, no matter how much you wish it could.
6. Claim that invalid sequences of 4 bytes are somehow a problem, while ignoring invalid sequences in UTF-16 and all other invalid sequences in UTF-8. This is despite earlier edits where he basically pretends invalid sequences magically don't happen. Sorry you can't have it both ways.
7. Complete misunderstanding of why multibyte sequences in UTF-16 cause more problems than in UTF-8: because they are RARE. Believe me, NOBODY screws up UTF-8 by "confusing it with ASCII" because they will locate their mistake the very first time a non-ASCII character is used. That is the point of this advantage.
Spitzak ( talk) 17:34, 19 October 2015 (UTC)
UTF-8 does not require the arguments to be nul-terminated. They can be passed with a length in code units. And UTF-16 can be nul-terminated as well (with a 16-bit zero character), or it can also be passed with a length in code units. Please stop confusing nul termination. The reason an 8-bit api can be used for both ASCII and UTF-8 is because sizeof(*pointer) is the same for both arguments. For many languages, including C, you cannot pass an ASCII array and a UTF-16 array as the same argument, because sizeof(*pointer) is 1 for ASCII and 2 for UTF-16. This is why all structures and apis have to be duplicated.
Everybody seems to be completely unable to comprehend how vital the ability to handle invalid sequences is. The problem is that many systems (the most obvious are Unix and Windows filenames) do not prevent invalid sequences. This means that unless your software can handle invalid sequences and pass them to the underlying system, you cannot access some possible files! For example you cannot make a program that will rename files with invalid names to have valid names. Far more important is the ability to defer error detection to much later when it can be handled cleanly. Therefore you need a method to take a fully arbitrary sequence of code units and store it in your internal encoding. I hope it is obvious that the easiest way to do this with invalid UTF-8 is to keep it as an array of code units. What is less obvious is that using standard encodings you can translate invalid UTF-16 to "invalid" UTF-8 by using the obvious 3-byte encoding of any erroneous surrogate half. This means that UTF-8 can hold both invalid Unix and invalid Windows filenames. The opposite is not true unless you divert in seriously ways away from how the vast majority of UTF-8/UTF-16 translators work by defining some really odd sequences as valid/invalid.
The idea that SGML can be shorter if you just use enough tags that contain Chinese letters is ridiculous. I guess if the *only* markup was <tag/> (ie no arguments and as short as possible with only 3 ascii characters per markup), and if every single tag used contained only Chinese letters, and they averaged to greater than 3 characters per tag, then the markup would be shorter in UTF-16. I hope you can see how totally stupid any such suggestion is.
I and others mentioned an easy way for you to refute the size argument: point to an actual document on-line that is shorter in UTF-16 verses UTF-8. Should be easy, go do it. Claiming "OR" for others inability to show any examples is really low. Spitzak ( talk) 06:13, 2 November 2015 (UTC)
175.157.213.232 ( talk) 11:56, 6 November 2015 (UTC) 175.157.213.232 ( talk) 12:00, 6 November 2015 (UTC) 175.157.213.232 ( talk) 12:48, 6 November 2015 (UTC)
References
In this section it is stated:
"Modified UTF-8 uses the 2-byte overlong encoding of U+0000 (the NUL character), 11000000 10000000 (hex C0 80), rather than 00000000 (hex 00). This allows the byte 00 to be used as a string terminator."
It ides not explain why a single byte 00000000 can not be used as a string terminator. Given that all other bytes commencing 0 are treated as ASCII, why should 00 be any different?
FreeFlow99 ( talk) 11:53, 8 September 2015 (UTC)
Generating tables similar to the ones in UTF-8#Examples — suzukaze ( t・ c) 06:21, 25 February 2016 (UTC)
arguing against 'undid revision 715814058'...'not an improvement':
unicode character encoding form: utf-8, utf-16, utf-32.
unicode simple character encoding scheme: utf-8, utf-16be, utf-16le, utf-32be, utf-32le.
unicode compound character encoding scheme: utf-16, utf-32.
"avoid the complications of endianness and byte order marks in the alternative utf-16 and utf-32 encodings"
character encoding forms doesn't have endianness and byte order marks, so this is about character encoding schemes.
mentioning one simple character encoding scheme and ignoring all others is misleading; it's trying to compare the three character encoding forms in character encoding scheme context.
this confusion is because name ambiguity (e.g. utf-32 cef corresponds to utf-32 ces, utf-32be ces and utf-32le ces; utf-8 cef only corresponds to utf-8 ces (big endian)).
revision 715814058 mentions all simple character encoding schemes in places discussing character encoding scheme information (endianness, byte order marks, backward compatibility).
177.157.17.126 (
talk)
23:18, 29 April 2016 (UTC)
" 'avoid the complications of endianness' by using a separate protocol that is more complicated than endianness."
Yes, that is exactly right. Errors due to mismatch in numerical byte order between software and hardware were so common that a term (endianess) was coined to describe the phenomenon, and over time people learned to pay attention to the problem. One approach is to communicate information on the transfer protocol. Another is to add structure to the data so that the information becomes self-documenting, and therefore less reliant on the accuracy of the transfer description. The latter is the apporach of UTF-8. There is a trade-off: computational cost of the extra structure, versus the added resilience against error. This is a fundamental tension in all digital processes (including things like DNA).
You're free to generalize endianess to describe things it was never meant to address, but statements like "utf-32le is little-endian, utf-8 is big-stripped-order, utf-16le is little-endian-big-stripped-order", and similar suggestions farther up, are (in my opinion) confusing rather than enlightening.
-- Elphion ( talk) 14:27, 3 May 2016 (UTC)
Section WTF-8 says "The source code samples above work this way, for instance." – I can't find these samples, where are they? -- Unhammer ( talk) 06:25, 24 September 2015 (UTC)
The legend states: Red cells must never appear in a valid UTF-8 sequence. The first two (C0 and C1) could only be used for an invalid "overlong encoding" of ASCII characters (i.e., trying to encode a 7-bit ASCII value between 0 and 127 using two bytes instead of one; see below). The remaining red cells indicate start bytes of sequences that could only encode numbers larger than the 0x10FFFF limit of Unicode.
Why are cells F8-FB (for invalid 5-byte sequences) in a separate, darker red colour? They don't seem conceptually distinct from F5-F7 or FC-FF.
74.12.94.137 (
talk)
11:03, 12 May 2016 (UTC)
The article contains the sentence "Many Windows programs (including Windows Notepad) add the bytes 0xEF, 0xBB, 0xBF at the start of any document saved as UTF-8." Yet when I dumped a UTF-8 file which had been edited by someone on a Windows machine I found that it was actually UTF-16. Every other byte was null. There is no purpose in putting a BOM on a UTF-8 file and although many Windows users talk about "UTF-8 with BOM", is it not the case that all the files they are talking about are actually encoded in UTF-16? I don't have a Windows machine past Vista but looking up the Windows API function IsTextUnicode() I found the remarkable comment with the test IS_TEXT_UNICODE_ODD_LENGTH: "The number of characters in the string is odd. A string of odd length cannot (by definition) be Unicode text." So for the Windows API (today), Unicode and UTF-16 are synonymous. Is it therefore correct to change the sentence above to reflect this? Chris55 ( talk) 14:42, 30 June 2016 (UTC)
In the article, there was no mention that UTF stands for Unicode Transformation Format [1]. This could either go at the start of the article in parenthesis after the bolded "UTF-8" or here.
-- Quasi Quantum x ( talk) 19:24, 2 August 2016 (UTC)
References
A page about UTF-8 should start by giving the definition of UTF-8, not by exploring the whole history of how it developed. History should come later. The current organization will cause many people to misunderstand UTF-8.
-- Andrew Myers ( talk) 23:05, 21 February 2017 (UTC)
Hello fellow Wikipedians,
I have just modified 2 external links on UTF-8. Please take a moment to review my edit. If you have any questions, or need the bot to ignore the links, or the page altogether, please visit this simple FaQ for additional information. I made the following changes:
When you have finished reviewing my changes, you may follow the instructions on the template below to fix any issues with the URLs.
This message was posted before February 2018.
After February 2018, "External links modified" talk page sections are no longer generated or monitored by InternetArchiveBot. No special action is required regarding these talk page notices, other than
regular verification using the archive tool instructions below. Editors
have permission to delete these "External links modified" talk page sections if they want to de-clutter talk pages, but see the
RfC before doing mass systematic removals. This message is updated dynamically through the template {{
source check}}
(last update: 5 June 2024).
Cheers.— InternetArchiveBot ( Report bug) 18:56, 5 April 2017 (UTC)
Why do so many email addresses have "UTF-8" in them?
For example: "=?utf-8?Q?FAIR?= <fair@fair.org>"
Dagme ( talk) 06:09, 21 April 2017 (UTC)
I am trying to fix text that seems to be confused about how the backward compatability with ASCII works. I thought it was useful to mention extended ASCII but that seems to have produced much more confusion, including a whole introduction of the fact that UTF-8 is very easy to detect and that invalid sequences can be substituted with other encodings, something I have been trying to point out over and over and over, but really is not a major feature of UTF-8 and does not belong here.
The underlying statement is that a program that looks at text, copying it to an output, that only specially-treats some ASCII characters, will "work" with UTF-8. This is because it will see some odd sequences of bytes with the high bit set, and maybe "think" that is a bunch of letters in ISO-8859-1 or whatever, but since it does not treat any of those letters specially, it will copy them unchanged to the output, thus preserving the UTF-8 encoded characters. The simplest example is printf, which looks for a '%' in the format string, and copies every other byte to the output without change (it will also copy the bytes unchanged from a "%s" string so that string can contain UTF-8). Printf works perfectly with UTF-8 because of this and does not need any awareness of encoding added to it. Other obvious example is slashes in filenames.
My initial change was to revert some text that implied that a program that "ignores" bytes with the high bit set would work. The exact meaning of "ignore" is hard to say but I was worried that this implied that the bytes were thrown away, which is completely useless as all non-ASCII text would vanish without a trace.
Code that actually thinks the text is ISO-8859-1 and does something special with those values will screw up with UTF-8. For instance a function that capitalizes all the text and thinks it knows how to capitalize ISO-8859-1 will completely destroy the UTF-8. However there is a LOT of functions that don't do this, for instance a lot of "capitalization" only works for the ASCII letters and thus works with UTF-8 (where "work" means "it did not destroy the data").
Would appreciate if anybody can figure out a shorter wording that everybody agrees on and clearly says this. The current thing is really bloated and misleading. Spitzak ( talk) 18:59, 6 June 2017 (UTC)
We kind of discussed this in 2008; see Archive 1 and User_talk:JustOnlyJohn... AnonMoos ( talk) 13:29, 7 June 2017 (UTC)
Hello fellow Wikipedians,
I have just modified one external link on UTF-8. Please take a moment to review my edit. If you have any questions, or need the bot to ignore the links, or the page altogether, please visit this simple FaQ for additional information. I made the following changes:
When you have finished reviewing my changes, you may follow the instructions on the template below to fix any issues with the URLs.
This message was posted before February 2018.
After February 2018, "External links modified" talk page sections are no longer generated or monitored by InternetArchiveBot. No special action is required regarding these talk page notices, other than
regular verification using the archive tool instructions below. Editors
have permission to delete these "External links modified" talk page sections if they want to de-clutter talk pages, but see the
RfC before doing mass systematic removals. This message is updated dynamically through the template {{
source check}}
(last update: 5 June 2024).
Cheers.— InternetArchiveBot ( Report bug) 01:57, 23 January 2018 (UTC)
I made an edit yesterday, in which I added to the table at the top of the secttion " Description" a column with the number of code points in each "number of bytes". But the information I entered there was wrong because it didn't take into account invalid code points, and fortunately the edit was quickly reverted by Spitzak. He said that I shouldn't do this because "it will lead to an endless argument about whether invalid code points count". But I wonder if we could do something a bit different that would avoid that problem and still provide helpful information.
I see that the article starts out with the statement that there are 1,112,064 valid code points, which consist of the 1,114,112 total of my proposed column, less 2,048 invalid ones. The validity rules are fairly complex and I haven't seen them summarized by saying which of those 2,048 invalid code points have two, three, or four bytes. Is there a reference with that information? If so, we could add to the table in the article the three right-hand columns of the following table, or keep this as a separate table if adding the new columns would make the table in the article too wide:
Number of bytes | Code points | ||
---|---|---|---|
Total | Invalid | Valid | |
1 | 128 | 0 | 128 |
2 | 1,920 | ??? | ??? |
3 | 63,488 | ??? | ??? |
4 | 1,048,576 | ??? | ??? |
Total: | 1,114,112 | 2,048 | 1,112,064 |
If the number 2,048 currently in the article is correct, then this informaiton should be uncontroversial. But we need to know how many of those 2,048 invalid code points belong to each number of bytes. Does someone have a reference with that information?
Ennex2 ( talk) 01:44, 7 November 2018 (UTC)
Spitzak ( talk) 02:24, 7 November 2018 (UTC)
Ennex2 ( talk) 02:45, 7 November 2018 (UTC)
The current article is too free with the term "mandatory". Yes, WHATWG calls it mandatory for WHATWG projects. It is not mandatory in a legal sense, or even a standards sense. The references to WHATWG sources need to make that clear. -- Elphion ( talk) 20:03, 15 November 2018 (UTC)
The table recently added to describe UTF-1 is interesting, but it should go in UTF-1, not here. Here it suffices to say what the first paragraph says, together with the link to UTF-1. -- Elphion ( talk) 15:24, 23 July 2016 (UTC)
I think that the part about UTF-1 is essential context, especially given the story about how UTF-8 was invented by one superhuman genius (in a diner, on a placemat...). How can you appreciate what problem was being solved, what the respective contributions were, without a juxtaposition of the various stages? If you don't care about what UTF-1 looked like, then you also don't really care about the history of UTF-8. Without it, you might as well delete the whole section. — RFST ( talk) 14:25, 1 March 2017 (UTC)
In the section "Comparison with single-byte encodings", a bullet point mentions an obvious fact:
"It is possible in UTF-8 (or any other multi-byte encoding) to split or truncate a string in the middle of a character", but this is flagged with [citation needed]".
This is certainly true in languages such as C where strings are stored as an array of bytes. So I will add that proviso and remove the citation needed. — Preceding unsigned comment added by 138.207.235.63 ( talk • contribs) 02:45, 4 August 2019 (UTC)
138.207. and @
Spitzak: watch the
WP: Signatures policy, both of you.
By the way, I removed from the article two instances of older applications that can… (or cannot…) gibberish. Please, write articles based on facts, not advocacy.
Incnis Mrsi (
talk)
07:19, 5 August 2019 (UTC)
Hi Ita140188, I see you've moved important information out of the lead, including the graph.
At a minimum, I find it important to have the graph above the fold, in the lead. Per MOS:LEAD: "It gives the basics in a nutshell and cultivates interest in reading on—though not by teasing the reader or hinting at what follows."
94.5% shows clear majority, while showing that 5% do use something else (could be 430 million people), without going into details, so "what are those 5% using" people might think? There is space to go into more details, show the rest is very divided. If you do not know that, then you would be excused to thinking 1/20 use some ONE other encoding.
The missing 5% is roughly equal to the populations of Russia, Japan, Egypt and Thailand combined, all with their good reasons to avoid Unicode, and for all we know all those countries (and no other) could be avoiding UTF-8 for their old legacy encodings not covering what Unicode/UTF-8 can cover (but all of those countries have high UTF-8 use, and no country has UTF-8 use much lower than 90% on the web).
I checked on mobile, and there you have to press to expand other sections. On my wide-screen desktop monitor, there is empty space that could well be filled with the graph.
Most people are not going to scroll past the lead. And I would argue, the MOST important information about an encoding is that it is used, or not used, and what are the alternatives.
You may not be aware of the UTF8 Everywhere Manifesto: "In particular, we believe that the very popular UTF-16 encoding (often mistakenly referred to as ‘widechar’ or simply ‘Unicode’ in the Windows world) has no place [except]". comp.arch ( talk) 19:01, 6 May 2020 (UTC)
This article has seen significant work recently to try to elevate the important aspects of the subject and reduce the amount of coverage on trivia. One such change has been reverted on the grounds that "BOM killing usage of UTF-8 is very well documented". Of course the material in question has been unreferenced ever since it was added. I don't dispute that people using e.g. Windows Notepad in the middle of the decade were very annoyed by this, but it truly isn't an important enough aspect of the subject today to warrant its own subheading. All that we need to do is note that historically some software insisted on adding BOMs to UTF-8 files and that this caused interoperability issues, with a good reference. We currently lack the latter entirely, but we should at least restore the reduced version of the content such that we aren't inflating what is basically a historic bug that has no impact on the vast majority of uses of the spec. Chris Cunningham (user:thumperward) ( talk) 17:08, 4 September 2020 (UTC)
The Google-developed programming language Go defines a datatype called rune. A rune is "an int32 containing a Unicode character of 1,2,3, or 4 bytes". It is not clear from the documentation whether a rune contains the Unicode character number (code point) or the usual UTF-8 encoding used in Go. Testing reveals that a rune appears to be the Unicode character number.
I found a good reference to confirm this at https://blog.golang.org/strings, so this information should be added prominently to this article and similar articles that are missing it. It can be quite frustrating to read about runes in Go and not have this information. David Spector ( talk) 00:42, 4 September 2020 (UTC)
![]() | This page is an archive of past discussions. Do not edit the contents of this page. If you wish to start a new discussion or revive an old one, please do so on the current talk page. |
http://www.theregister.co.uk/2013/10/04/verity_stob_unicode/ is a secondary source, published in a well-known technology magazine, that comments on the UTF-8 Everywhere publication, and there is no indication that it affiliates with the authors of the later. Please explain how it does not count as a valid source for the above claim. Thanks. 82.80.119.226 ( talk) 17:00, 15 June 2015 (UTC)
The "manifesto" is not a bad piece, but it's one group's opinion and not a very compelling case for "UTF-8 everywhere". Each of the three principal encodings has strengths and weaknesses; one should feel free to use the one that best meets the purpose at hand. Since all three are supported by well-vetted open source libraries, one can safely choose at will. The manifesto's main point is that encoding UTF-16 is tricky; but it's no more tricky than UTF-8, and I've seen home-rolled code making mistakes with either. This really is a tempest in a tea-cup. -- Elphion ( talk) 01:53, 17 June 2015 (UTC)
The manifesto's FAQ #4 is a good example of what the authors get wrong. The question asks whether any encoding is allowable internally, and the authors respond (reasonably) that they have nothing against this. But then they argue that std::string is used for all kinds of different encodings, and we should just all agree on UTF-8. First, this doesn't answer the question, which in context is talking about UTF-8 vs UTF-16: nobody would use std::string (a sequence of bytes) for UTF-16 -- you would use std::wstring instead. Second, std::string has a precise meaning: it is a sequence of bytes, period. It knows nothing of Unicode, or of the UTF-8 encoding, or how to advance a character, or how to recognize illegal encodings, etc., etc. If you are processing a UTF-8 string, you should use a class that is UTF-8 aware -- one specifically designed for UTF-8 that does know about such things. Passing your UTF-8 string to the outside as std::string on the assumption that the interface will treat it as Unicode is just asking for trouble, and asking the world to adopt that assumption (despite the definition of std::string) is naive and will never come to pass. It will only foster religious wars. If you need to pass a UTF-8 string as std::string to the outside (to the OS or another library), test and document the interface between your UTF-8 string and the outside world for clarity. The manifesto's approach only muddies the waters.
The argument that UTF-8 is the "least weak" of all the encodings is silly. The differences in "weakness" are minuscule and mostly in the eyes of the beholder. As a professional programmer dealing with Unicode, you should know all of these and be prepared to deal with them. The important question instead is which encoding best suits your purposes as a programmer for the task at hand. As long as your end result is a valid Unicode string (in whatever encoding) and you communicate this to whatever interface you're sending it to, nobody should have cause for complaint. The interface may need to alter the string to conform to the expectations on the other side (Windows, e.g.). It is precisely the interfaces where expectations on both sides need to be spelled out. Leaving them to convention is the wrong approach.
I would say that all three major encodings being international standards endorsed by multiple actors is sufficient warrant for using them.
-- Elphion ( talk) 18:24, 18 June 2015 (UTC)
I don't think it belongs in the article. sverdrup ( talk) 12:11, 29 June 2015 (UTC)
Move the following material to Talk (originally in the lead of the article and since August 2014 commented out)
UTF-8 is also increasingly being used as the default character encoding in operating systems, programming languages, APIs, and software applications.{{cn}}
Possible Citations, I'm sure you could find lots more, but these are good samples showing that the low-level api is in UTF-8:
https://developer.apple.com/library/mac/qa/qa1173/_index.html
https://developer.gnome.org/pango/stable/pango-Layout-Objects.html#pango-layout-set-text
http://wayland.freedesktop.org/docs/html/protocol-spec-interface-wl_shell_surface.html (look for set_title)
End moved material -- Elphion ( talk) 12:35, 10 September 2015 (UTC)
Graph indicates that UTF-8 (light blue) exceeded other main encodings of text on the Web, that by 2010 it was nearing 50% prevalent, and up to 85% by August 2015.[2]
While the 85% statement may be true, the graph doesn't indicate any such thing. NotYourFathersOldsmobile ( talk) 22:13, 10 September 2015 (UTC)
112.134.187.248 made a whole lot of mis-informed edits. With any luck he will look here. Sorry but it was almost impossible to fix without a complete revert, though he was correct exactly once when he used "will" instead of "could", I will try to keep that..
1. An indic code point cannot take 6 bytes. He is confusing things with CESU-8 I think, though even that will encode these as 3 bytes as they are still in the BMP.
2. You cannot pass ASCII to a function expecting UTF-16. Declaring the arguments to be byte arrays does not help, and I doubt any sane programmer would do that either. Therefore it is true that UTF-16 requires new apis.
3. Many attempts to claim there is some real-world chance of Asian script being larger in UTF-8 than UTF-16, despite obvious measurements of real documents that show this does not happen. Sorry you are wrong. Find a REAL document on-line that is larger. Stripping all the markup and newlines does not count.
4. Deletion of the fact that invalid byte sequences can be stored in UTF-8, by somehow pretending that they will magically not happen. Sorry, try actually programming before you believe such utter rubbish. Especially odd because some other edits indicate that he things invalid sequences will happen and are somehow a disadvantage of UTF-8.
5. Belief that markup tags using non-ASCII can remove the fact that it makes UTF-8 smaller. Sorry but markup contains far more slashes and angle brackets and spaces and quotes and lots and lots of ASCII-only tags so this will just not happen, no matter how much you wish it could.
6. Claim that invalid sequences of 4 bytes are somehow a problem, while ignoring invalid sequences in UTF-16 and all other invalid sequences in UTF-8. This is despite earlier edits where he basically pretends invalid sequences magically don't happen. Sorry you can't have it both ways.
7. Complete misunderstanding of why multibyte sequences in UTF-16 cause more problems than in UTF-8: because they are RARE. Believe me, NOBODY screws up UTF-8 by "confusing it with ASCII" because they will locate their mistake the very first time a non-ASCII character is used. That is the point of this advantage.
Spitzak ( talk) 17:34, 19 October 2015 (UTC)
UTF-8 does not require the arguments to be nul-terminated. They can be passed with a length in code units. And UTF-16 can be nul-terminated as well (with a 16-bit zero character), or it can also be passed with a length in code units. Please stop confusing nul termination. The reason an 8-bit api can be used for both ASCII and UTF-8 is because sizeof(*pointer) is the same for both arguments. For many languages, including C, you cannot pass an ASCII array and a UTF-16 array as the same argument, because sizeof(*pointer) is 1 for ASCII and 2 for UTF-16. This is why all structures and apis have to be duplicated.
Everybody seems to be completely unable to comprehend how vital the ability to handle invalid sequences is. The problem is that many systems (the most obvious are Unix and Windows filenames) do not prevent invalid sequences. This means that unless your software can handle invalid sequences and pass them to the underlying system, you cannot access some possible files! For example you cannot make a program that will rename files with invalid names to have valid names. Far more important is the ability to defer error detection to much later when it can be handled cleanly. Therefore you need a method to take a fully arbitrary sequence of code units and store it in your internal encoding. I hope it is obvious that the easiest way to do this with invalid UTF-8 is to keep it as an array of code units. What is less obvious is that using standard encodings you can translate invalid UTF-16 to "invalid" UTF-8 by using the obvious 3-byte encoding of any erroneous surrogate half. This means that UTF-8 can hold both invalid Unix and invalid Windows filenames. The opposite is not true unless you divert in seriously ways away from how the vast majority of UTF-8/UTF-16 translators work by defining some really odd sequences as valid/invalid.
The idea that SGML can be shorter if you just use enough tags that contain Chinese letters is ridiculous. I guess if the *only* markup was <tag/> (ie no arguments and as short as possible with only 3 ascii characters per markup), and if every single tag used contained only Chinese letters, and they averaged to greater than 3 characters per tag, then the markup would be shorter in UTF-16. I hope you can see how totally stupid any such suggestion is.
I and others mentioned an easy way for you to refute the size argument: point to an actual document on-line that is shorter in UTF-16 verses UTF-8. Should be easy, go do it. Claiming "OR" for others inability to show any examples is really low. Spitzak ( talk) 06:13, 2 November 2015 (UTC)
175.157.213.232 ( talk) 11:56, 6 November 2015 (UTC) 175.157.213.232 ( talk) 12:00, 6 November 2015 (UTC) 175.157.213.232 ( talk) 12:48, 6 November 2015 (UTC)
References
In this section it is stated:
"Modified UTF-8 uses the 2-byte overlong encoding of U+0000 (the NUL character), 11000000 10000000 (hex C0 80), rather than 00000000 (hex 00). This allows the byte 00 to be used as a string terminator."
It ides not explain why a single byte 00000000 can not be used as a string terminator. Given that all other bytes commencing 0 are treated as ASCII, why should 00 be any different?
FreeFlow99 ( talk) 11:53, 8 September 2015 (UTC)
Generating tables similar to the ones in UTF-8#Examples — suzukaze ( t・ c) 06:21, 25 February 2016 (UTC)
arguing against 'undid revision 715814058'...'not an improvement':
unicode character encoding form: utf-8, utf-16, utf-32.
unicode simple character encoding scheme: utf-8, utf-16be, utf-16le, utf-32be, utf-32le.
unicode compound character encoding scheme: utf-16, utf-32.
"avoid the complications of endianness and byte order marks in the alternative utf-16 and utf-32 encodings"
character encoding forms doesn't have endianness and byte order marks, so this is about character encoding schemes.
mentioning one simple character encoding scheme and ignoring all others is misleading; it's trying to compare the three character encoding forms in character encoding scheme context.
this confusion is because name ambiguity (e.g. utf-32 cef corresponds to utf-32 ces, utf-32be ces and utf-32le ces; utf-8 cef only corresponds to utf-8 ces (big endian)).
revision 715814058 mentions all simple character encoding schemes in places discussing character encoding scheme information (endianness, byte order marks, backward compatibility).
177.157.17.126 (
talk)
23:18, 29 April 2016 (UTC)
" 'avoid the complications of endianness' by using a separate protocol that is more complicated than endianness."
Yes, that is exactly right. Errors due to mismatch in numerical byte order between software and hardware were so common that a term (endianess) was coined to describe the phenomenon, and over time people learned to pay attention to the problem. One approach is to communicate information on the transfer protocol. Another is to add structure to the data so that the information becomes self-documenting, and therefore less reliant on the accuracy of the transfer description. The latter is the apporach of UTF-8. There is a trade-off: computational cost of the extra structure, versus the added resilience against error. This is a fundamental tension in all digital processes (including things like DNA).
You're free to generalize endianess to describe things it was never meant to address, but statements like "utf-32le is little-endian, utf-8 is big-stripped-order, utf-16le is little-endian-big-stripped-order", and similar suggestions farther up, are (in my opinion) confusing rather than enlightening.
-- Elphion ( talk) 14:27, 3 May 2016 (UTC)
Section WTF-8 says "The source code samples above work this way, for instance." – I can't find these samples, where are they? -- Unhammer ( talk) 06:25, 24 September 2015 (UTC)
The legend states: Red cells must never appear in a valid UTF-8 sequence. The first two (C0 and C1) could only be used for an invalid "overlong encoding" of ASCII characters (i.e., trying to encode a 7-bit ASCII value between 0 and 127 using two bytes instead of one; see below). The remaining red cells indicate start bytes of sequences that could only encode numbers larger than the 0x10FFFF limit of Unicode.
Why are cells F8-FB (for invalid 5-byte sequences) in a separate, darker red colour? They don't seem conceptually distinct from F5-F7 or FC-FF.
74.12.94.137 (
talk)
11:03, 12 May 2016 (UTC)
The article contains the sentence "Many Windows programs (including Windows Notepad) add the bytes 0xEF, 0xBB, 0xBF at the start of any document saved as UTF-8." Yet when I dumped a UTF-8 file which had been edited by someone on a Windows machine I found that it was actually UTF-16. Every other byte was null. There is no purpose in putting a BOM on a UTF-8 file and although many Windows users talk about "UTF-8 with BOM", is it not the case that all the files they are talking about are actually encoded in UTF-16? I don't have a Windows machine past Vista but looking up the Windows API function IsTextUnicode() I found the remarkable comment with the test IS_TEXT_UNICODE_ODD_LENGTH: "The number of characters in the string is odd. A string of odd length cannot (by definition) be Unicode text." So for the Windows API (today), Unicode and UTF-16 are synonymous. Is it therefore correct to change the sentence above to reflect this? Chris55 ( talk) 14:42, 30 June 2016 (UTC)
In the article, there was no mention that UTF stands for Unicode Transformation Format [1]. This could either go at the start of the article in parenthesis after the bolded "UTF-8" or here.
-- Quasi Quantum x ( talk) 19:24, 2 August 2016 (UTC)
References
A page about UTF-8 should start by giving the definition of UTF-8, not by exploring the whole history of how it developed. History should come later. The current organization will cause many people to misunderstand UTF-8.
-- Andrew Myers ( talk) 23:05, 21 February 2017 (UTC)
Hello fellow Wikipedians,
I have just modified 2 external links on UTF-8. Please take a moment to review my edit. If you have any questions, or need the bot to ignore the links, or the page altogether, please visit this simple FaQ for additional information. I made the following changes:
When you have finished reviewing my changes, you may follow the instructions on the template below to fix any issues with the URLs.
This message was posted before February 2018.
After February 2018, "External links modified" talk page sections are no longer generated or monitored by InternetArchiveBot. No special action is required regarding these talk page notices, other than
regular verification using the archive tool instructions below. Editors
have permission to delete these "External links modified" talk page sections if they want to de-clutter talk pages, but see the
RfC before doing mass systematic removals. This message is updated dynamically through the template {{
source check}}
(last update: 5 June 2024).
Cheers.— InternetArchiveBot ( Report bug) 18:56, 5 April 2017 (UTC)
Why do so many email addresses have "UTF-8" in them?
For example: "=?utf-8?Q?FAIR?= <fair@fair.org>"
Dagme ( talk) 06:09, 21 April 2017 (UTC)
I am trying to fix text that seems to be confused about how the backward compatability with ASCII works. I thought it was useful to mention extended ASCII but that seems to have produced much more confusion, including a whole introduction of the fact that UTF-8 is very easy to detect and that invalid sequences can be substituted with other encodings, something I have been trying to point out over and over and over, but really is not a major feature of UTF-8 and does not belong here.
The underlying statement is that a program that looks at text, copying it to an output, that only specially-treats some ASCII characters, will "work" with UTF-8. This is because it will see some odd sequences of bytes with the high bit set, and maybe "think" that is a bunch of letters in ISO-8859-1 or whatever, but since it does not treat any of those letters specially, it will copy them unchanged to the output, thus preserving the UTF-8 encoded characters. The simplest example is printf, which looks for a '%' in the format string, and copies every other byte to the output without change (it will also copy the bytes unchanged from a "%s" string so that string can contain UTF-8). Printf works perfectly with UTF-8 because of this and does not need any awareness of encoding added to it. Other obvious example is slashes in filenames.
My initial change was to revert some text that implied that a program that "ignores" bytes with the high bit set would work. The exact meaning of "ignore" is hard to say but I was worried that this implied that the bytes were thrown away, which is completely useless as all non-ASCII text would vanish without a trace.
Code that actually thinks the text is ISO-8859-1 and does something special with those values will screw up with UTF-8. For instance a function that capitalizes all the text and thinks it knows how to capitalize ISO-8859-1 will completely destroy the UTF-8. However there is a LOT of functions that don't do this, for instance a lot of "capitalization" only works for the ASCII letters and thus works with UTF-8 (where "work" means "it did not destroy the data").
Would appreciate if anybody can figure out a shorter wording that everybody agrees on and clearly says this. The current thing is really bloated and misleading. Spitzak ( talk) 18:59, 6 June 2017 (UTC)
We kind of discussed this in 2008; see Archive 1 and User_talk:JustOnlyJohn... AnonMoos ( talk) 13:29, 7 June 2017 (UTC)
Hello fellow Wikipedians,
I have just modified one external link on UTF-8. Please take a moment to review my edit. If you have any questions, or need the bot to ignore the links, or the page altogether, please visit this simple FaQ for additional information. I made the following changes:
When you have finished reviewing my changes, you may follow the instructions on the template below to fix any issues with the URLs.
This message was posted before February 2018.
After February 2018, "External links modified" talk page sections are no longer generated or monitored by InternetArchiveBot. No special action is required regarding these talk page notices, other than
regular verification using the archive tool instructions below. Editors
have permission to delete these "External links modified" talk page sections if they want to de-clutter talk pages, but see the
RfC before doing mass systematic removals. This message is updated dynamically through the template {{
source check}}
(last update: 5 June 2024).
Cheers.— InternetArchiveBot ( Report bug) 01:57, 23 January 2018 (UTC)
I made an edit yesterday, in which I added to the table at the top of the secttion " Description" a column with the number of code points in each "number of bytes". But the information I entered there was wrong because it didn't take into account invalid code points, and fortunately the edit was quickly reverted by Spitzak. He said that I shouldn't do this because "it will lead to an endless argument about whether invalid code points count". But I wonder if we could do something a bit different that would avoid that problem and still provide helpful information.
I see that the article starts out with the statement that there are 1,112,064 valid code points, which consist of the 1,114,112 total of my proposed column, less 2,048 invalid ones. The validity rules are fairly complex and I haven't seen them summarized by saying which of those 2,048 invalid code points have two, three, or four bytes. Is there a reference with that information? If so, we could add to the table in the article the three right-hand columns of the following table, or keep this as a separate table if adding the new columns would make the table in the article too wide:
Number of bytes | Code points | ||
---|---|---|---|
Total | Invalid | Valid | |
1 | 128 | 0 | 128 |
2 | 1,920 | ??? | ??? |
3 | 63,488 | ??? | ??? |
4 | 1,048,576 | ??? | ??? |
Total: | 1,114,112 | 2,048 | 1,112,064 |
If the number 2,048 currently in the article is correct, then this informaiton should be uncontroversial. But we need to know how many of those 2,048 invalid code points belong to each number of bytes. Does someone have a reference with that information?
Ennex2 ( talk) 01:44, 7 November 2018 (UTC)
Spitzak ( talk) 02:24, 7 November 2018 (UTC)
Ennex2 ( talk) 02:45, 7 November 2018 (UTC)
The current article is too free with the term "mandatory". Yes, WHATWG calls it mandatory for WHATWG projects. It is not mandatory in a legal sense, or even a standards sense. The references to WHATWG sources need to make that clear. -- Elphion ( talk) 20:03, 15 November 2018 (UTC)
The table recently added to describe UTF-1 is interesting, but it should go in UTF-1, not here. Here it suffices to say what the first paragraph says, together with the link to UTF-1. -- Elphion ( talk) 15:24, 23 July 2016 (UTC)
I think that the part about UTF-1 is essential context, especially given the story about how UTF-8 was invented by one superhuman genius (in a diner, on a placemat...). How can you appreciate what problem was being solved, what the respective contributions were, without a juxtaposition of the various stages? If you don't care about what UTF-1 looked like, then you also don't really care about the history of UTF-8. Without it, you might as well delete the whole section. — RFST ( talk) 14:25, 1 March 2017 (UTC)
In the section "Comparison with single-byte encodings", a bullet point mentions an obvious fact:
"It is possible in UTF-8 (or any other multi-byte encoding) to split or truncate a string in the middle of a character", but this is flagged with [citation needed]".
This is certainly true in languages such as C where strings are stored as an array of bytes. So I will add that proviso and remove the citation needed. — Preceding unsigned comment added by 138.207.235.63 ( talk • contribs) 02:45, 4 August 2019 (UTC)
138.207. and @
Spitzak: watch the
WP: Signatures policy, both of you.
By the way, I removed from the article two instances of older applications that can… (or cannot…) gibberish. Please, write articles based on facts, not advocacy.
Incnis Mrsi (
talk)
07:19, 5 August 2019 (UTC)
Hi Ita140188, I see you've moved important information out of the lead, including the graph.
At a minimum, I find it important to have the graph above the fold, in the lead. Per MOS:LEAD: "It gives the basics in a nutshell and cultivates interest in reading on—though not by teasing the reader or hinting at what follows."
94.5% shows clear majority, while showing that 5% do use something else (could be 430 million people), without going into details, so "what are those 5% using" people might think? There is space to go into more details, show the rest is very divided. If you do not know that, then you would be excused to thinking 1/20 use some ONE other encoding.
The missing 5% is roughly equal to the populations of Russia, Japan, Egypt and Thailand combined, all with their good reasons to avoid Unicode, and for all we know all those countries (and no other) could be avoiding UTF-8 for their old legacy encodings not covering what Unicode/UTF-8 can cover (but all of those countries have high UTF-8 use, and no country has UTF-8 use much lower than 90% on the web).
I checked on mobile, and there you have to press to expand other sections. On my wide-screen desktop monitor, there is empty space that could well be filled with the graph.
Most people are not going to scroll past the lead. And I would argue, the MOST important information about an encoding is that it is used, or not used, and what are the alternatives.
You may not be aware of the UTF8 Everywhere Manifesto: "In particular, we believe that the very popular UTF-16 encoding (often mistakenly referred to as ‘widechar’ or simply ‘Unicode’ in the Windows world) has no place [except]". comp.arch ( talk) 19:01, 6 May 2020 (UTC)
This article has seen significant work recently to try to elevate the important aspects of the subject and reduce the amount of coverage on trivia. One such change has been reverted on the grounds that "BOM killing usage of UTF-8 is very well documented". Of course the material in question has been unreferenced ever since it was added. I don't dispute that people using e.g. Windows Notepad in the middle of the decade were very annoyed by this, but it truly isn't an important enough aspect of the subject today to warrant its own subheading. All that we need to do is note that historically some software insisted on adding BOMs to UTF-8 files and that this caused interoperability issues, with a good reference. We currently lack the latter entirely, but we should at least restore the reduced version of the content such that we aren't inflating what is basically a historic bug that has no impact on the vast majority of uses of the spec. Chris Cunningham (user:thumperward) ( talk) 17:08, 4 September 2020 (UTC)
The Google-developed programming language Go defines a datatype called rune. A rune is "an int32 containing a Unicode character of 1,2,3, or 4 bytes". It is not clear from the documentation whether a rune contains the Unicode character number (code point) or the usual UTF-8 encoding used in Go. Testing reveals that a rune appears to be the Unicode character number.
I found a good reference to confirm this at https://blog.golang.org/strings, so this information should be added prominently to this article and similar articles that are missing it. It can be quite frustrating to read about runes in Go and not have this information. David Spector ( talk) 00:42, 4 September 2020 (UTC)