![]() | This page is an archive of past discussions. Do not edit the contents of this page. If you wish to start a new discussion or revive an old one, please do so on the current talk page. |
I have re-read the XML specification section on encoding [1], and I cannot find anything in it that supports the idea that UTF-8 is preferred over UTF-16. This was a long discussion, with W3C explicitly deciding to be neutral between those two (which I personally think is a mistake). I think it's correct to say that UTF-8 is the most used variant on the Web, but I'm neither aware of a relevant published statistic nor aware of what the situation is in non-Web situations. -- Alvestrand ( talk) 05:56, 19 May 2009 (UTC)
The definition of a byte is a character - NOT a tupel of 8 bits. That is an octet. This article should be rewritten using the correct usage of octect, when a group of 8 bits is meant, and byte when a character (e.g. UTF-8 character) is meant. 93.219.160.251 ( talk) 14:09, 27 June 2009 (UTC)Martin
Octet is more of a French word than it is an English word. Since "byte" is used in almost all contexts by actual English-speaking computer professionals, it should be preferred in this article, unless it it creates ambiguity (which I don't see). AnonMoos ( talk) 20:21, 27 June 2009 (UTC)
Since this talk page has passed 100 Kbytes, I'm setting up archiving for it. -- Alvestrand ( talk) 14:23, 27 June 2009 (UTC)
UTF-16 requires surrogates; an offset of 0x10000 is subtracted, so the bit pattern is not identical with UTF-8 does this mean that the offset is subtracted in utf-8 or does it mean that it is subtracted in utf-16 and not utf-8? —The preceding unsigned comment was added by Plugwash ( talk • contribs) 2004-12-04 22:38:47 (UTC).
How was the choice which characters will fit in two bytes made? Simply by unicode ordiering of them? Wouldn't that be obviously seen as suboptimal? It seems strange not even a single ISCII based layout could fit there (for eg defaulting to devanagari), but there was room for (if unicode order is the case) practically every conceivable precomposed latin character, even though they are used sparingly in some language using latin script, as well as coptic, syriac, armenian, Tāna (latter apparently used by the 300.000 inhabitants of maldives). In each case the language is penalised already by 100% increase in size from 1 byte to 2 byte, yet that's at least not worse than using UTF-16 or is better due to 1 byte space and such - but an avoidable 200% only for such widely used ones?..I presume it also includes N'Ko? 78.0.213.113 ( talk) 14:33, 3 July 2009 (UTC)
Shouldn't there be a description of how the algorithm for generating and reading UTF-8 stuff works? (explaining how the bytes represent codepoints, the surrogate thing etc) -- TiagoTiago ( talk) 05:13, 23 July 2009 (UTC)
- A badly-written (and not compliant with current versions of the standard) UTF-8 parser could accept a number of different pseudo-UTF-8 representations and convert them to the same Unicode output. This provides a way for information to leak past validation routines designed to process data in its eight-bit representation.
That's kinda like saying the disadvantage of living in houses is that a poorly built house can crumble over you... -- TiagoTiago ( talk) 05:23, 23 July 2009 (UTC)
FWIW, I believe that entire section of the document about replacement as an error handling strategy is either WP:OR and or violates WP:NOTGUIDE unless there's a reliable cite for the "popular" or "more useful" solutions that happen to violate the "standard" method which is to error out.
I should explain what I was trying to do in my wording change. When there's an error I move forward one byte and then scan the data stream until I get to a non "tail" byte. For example, we have 41 96 8B 8E 9B 40. It happens to be a CP1252 string "A–‹Ž›@". The possible ways to decode this are:
When it comes to U+DCxx style replacement both of them output U+0041 U+DC96 U+DC8B U+DC8E U+DC9B U+0040 and this method, or to interpret as CP1252 is likely more useful.
All of these are valid as far as I can see as they all "protect against decoding invalid sequences." I'm more concerned about the OR/NOTGUIDE aspect but on the other hand it is rather useful. -- Marc Kupper| talk 01:37, 6 October 2009 (UTC)
{{
cite web}}
: Unknown parameter |coauthors=
ignored (|author=
suggested) (
help)I'm confused by the regular references above to the issues surrounding scanning backwards. Why would a decoder need to scan backwards? The only times I could see a need to back up are
When I use the term "push-back" it usually means one of two things. 1) The source UTF-8 data is on a device that has the concept of a byte position pointer and allows us to reposition that pointer to a byte boundary. A disk file is a typical example of this. 2) We are receiving UTF-8 data from a device. Normally we only read or receive bytes. However, this device (or its driver) allows us to push at least three bytes back into it so that they are later read back FIFO compared to the bytes we pushed. Windows pipes are like this but I don't think Unix pipes support push-back.
Some devices do not support look-ahead nor push-back. Network sockets (Windows or Unix) and TTY/serial ports (Windows or Unix) and are examples of this. Sometimes I deal with devices usually allow look-ahead or push-back but in specific case won't. I make the assumption that it never supports these and instead use a three state storage.
With none of these would I ever need to scan backwards. -- Marc Kupper| talk 02:02, 8 October 2009 (UTC)
Somebody added two "who?" citations. The "who" is "people who keep posting incorrect information to this page" and decidedly NOT an "informative source". I don't know how to correctly state this but we certaily do NOT want a reference to somebody saying this.
The one about CP1252 is an attempt to stop the continuous posting of "UTF-8 sucks I can do everything with 1-byte codes" posters. They do, from their pov, have a complaint, so I think a paragraph is deserved there.
The one about "fixed size" is to stop the repeated posting of "strlen(x) is really slow with UTF-8". These people have not tried programming with UTf-8 and are making very dangerous and incorrect assumptions about how it should work. These people are dangerous in that they are usually smart enough to actually implement what they have in mind, resulting in very slow, lossy, and insecure software. This has been edited repeatedly by me and many others to delete or correct any indication that distances in a string somehow must be measured in "characters" but for awhile it was almost continuous maintenence beceause so many "helpful" people kept trying to "fix" this. It has slowed down some, possibly because Windows programmers are realizing that UTF-16 is variable length yet Windows uses "number of words" to measure the length, but I also think the current wording pointing out that it is the fault of ASCII-era documentation has helped as well. Therefore I would like to keep the wording as it appears to be discouraging incorrect edits. The paragraph itself must remain as a disadvantage, as it is true that "number of characters" is harder to determine with UTF-8, but I want to make sure it is clear that this is a MINOR disadvantage and not the "makes programming impossible" problem some think it is.
On further thought it does seem the whole advantages/disadvantages section is too long and pointless. People are well aware of the advantages, and the only alternative even considered today is UTF-16, all single byte and alternative multi-byte and UCS-2 are effectively dead. There is however some useful facts and information and references in there that may want to be preserved:
Spitzak ( talk) 22:21, 8 October 2009 (UTC)
sizeof
Operator states that it "yields the size (in bytes) of its operand." It further goes on to say that when sizeof is applied to a char, unsigned char, or signed char, (or a qualified version thereof) is that the result is 1. Thus it all hinges on that section. Curiously, the section then goes on with examples that uses the alloc()
function which is not documented in this standard. Thus ANSI C89's strlen() returns bytes. --
Marc Kupper|
talk
23:33, 9 October 2009 (UTC)The table after the first paragraph of the section titled "Description" has various fields of bits underlined in the examples. The underlining seems random; although only non-control bits seem to be underlined, I can see no pattern as to which non-control bits are underlined, and there is no explanation.
What does this underlining mean? If it has meaning the meaning should be explained in the accompanying text. If it has no meaning it should be removed. -- Dan Griscom ( talk) 12:03, 12 October 2009 (UTC)
I have to keep reinserting the advantage that errors in UTF-16 can be stored in UTF-8, but converse is false. This is the ENTIRE reason I am being such a pain in the ass about keeping this page up to date and removing incorrect arguments against UTF-8. I NEED UTF-8 API's to libraries and systems, UTF-16 API's are USELESS because of this fact. I'm sorry I can't find a good reference but believe me this is incredibly vital! Spitzak ( talk) 23:51, 15 January 2010 (UTC)
I've proposed that this section be moved to Unicode equivalence, the article dealing with Unicode normalisation. The problems that the section describes are really nothing to do with encoding; any encoding implementation that arbitrarily alters precomposed or decomposed characters will cause the same issue. Similarly, the statement that "Correct handling of UTF-8 requires preserving the raw byte sequence" is incorrect. If my (UTF-8) application uses NFKD, then it will be quite "correct" for it to not preserve incoming UTF-8 byte sequences. Any arguments? -- Perey ( talk) 17:21, 18 February 2010 (UTC)
Spitzak removed [3] a statement. It is not important, do some WP editors think that measurement in Unicode codes is necessary, or does not, or even consider it harmful. It is used. Somewhere bytes may be preferred ( offsets in memory) or visible characters (in text editors). But default method in many language (e.g. Perl or JavaScript) is to count code points.
Perl example:
#!/usr/bin/perl -n
chomp;
print "length(\"$_\") = ".length($_)."\n";
JS/HTML example:
<html>
<head><title>String length calculator</title></head>
<body>
<form>String: <input type="TEXT" name="s" size="160"/></form>
<p><a href="javascript:alert('length(\u0022'+document.forms[0].s.value+'\u0022) = '+document.forms[0].s.value.length);">
compute size</a></p>
</body>
</html>
Result, both in Perl with proper locale and in JS:
length("…") = 1 length("…̄") = 2
Incnis Mrsi ( talk) 08:34, 30 March 2010 (UTC)
setting my unicode/utf-8 fonts to deja vu I can't view the example glyph in the 4th column on the table in the description ( it looks like japanese kanji, but my browser generally shows that just fine ). I'm guessing the problem is the wikipedia style sheet specifies a font by name, because if I change that font to impact the page still doesn't change. This is what the glyph is supposed to look like: http://www.fileformat.info/info/unicode/char/24b62/index.htm —Preceding unsigned comment added by 173.66.61.35 ( talk) 16:44, 17 May 2010 (UTC)
Hello. The tables currently displayed are difficult to digest. You must calculate the outcome instead of just reading the info. A much better representation would be like this: [start]-[end], [start]-[end] ranges. What happens now is this: there is a range 128-2,047 in the table, which is very confusing because 128 is not a valid Unicode, and, 128 is not represented in binary in two bytes. 128 is a single byte of 10000000 (bits). That is 162 != 0xC2A2. I mean, what happens is that instead of explaining the how 162 transforms into 49826, you put 162 on both sides of the equation. That is, you say 162 = 162 instead of 162 = 49826 & 49664 (which is how you can calculate it and discover what is the value it maps to). Wvxvw ( talk) 18:25, 22 May 2010 (UTC)
If you want a complete list of ranges of code points that will display on current browsers, even the list would be massive. There are also odd invalid codes in between valid ones. The small number shown with MS Character Map is only just a small fraction of the code points that will display especially if you add Chinese. There are also a large number of characters such as pF as a single code point, for picoFarards, togther with dozens of other electrical symbols, roman digits to XII in lower and upper case as one code point, the alphabet (English only luckily) encircled, digits 1-20 encircled and so forth. I tried have so far tried everything up to 0xFFFFF on several character sets in common use and on three browsers, Word, Publisher and Notepad and all agreed. (I wrote a short program to do it). If that is the sort of list you had in mind, leave me a note and give me a couple of weeks and I list them for you. Euc ( talk) 01:40, 6 July 2010 (UTC)
I have serious doubts about this section of bulleted text, currently in the article to detail a disadvantage of UTF-8 that's supposedly rare in practice:
It has this footnote...
...and for completeness, the footnote has this {{Clarify}} note.
My problem with this claim isn't the use of Notepad; it's the claim that "This rarely happens in real documents". The supposed proof of this fact is that a certain HTML page took more space as UTF-16 than as UTF-8. HTML just happens to be a format that uses ASCII for its structuring, as is most XML. But this doesn't mean that "real" documents don't use unmixed non-ASCII characters! Any file format that doesn't use ASCII for its non-content parts (plain text, binary document formats) would absolutely take more space in UTF-8 than UTF-16, if their primary script were in the U+0800–U+FFFF range. Now, if you can prove that "real" plain-text or binary-format documents in Chinese, Japanese or Hindi scripts are "rarely" found... -- Perey ( talk) 13:58, 27 May 2010 (UTC)
Here is a test: Text chosen is the Japanese Wikipedia entry for George Bush [ [4]] This was chosen because I guessed it would be fairly long and not be computer-related. UTF-8 output was saved, size in UTF-8 was counted with "wc -c" (though just the size of the file would work), while size in UTF-16 was counted with "wc -m" multiplied by 2 (I assumed there were no non-BMP characters). All the files have plain LF as a newline, not CR+LF.
Spitzak ( talk) 19:29, 30 May 2010 (UTC)
Shift-JIS 21807 bytes UTF-16 27104 bytes UTF-8 30076 bytes
Big-5 51712 bytes UTF-16 94288 bytes UTF-8 56926 bytes
Big-5 58168 UTF-16 80880 UTF-8 89372
I should add another test that shows why all this argument is completely pointless. This is again with the "edit" text for the George Bush page:
UTF-8: 31317 UTF-16: 35028 (35032 when using iconv) (+11.8%) bzip2 of UTF-8: 10351 bzip2 of UTF-16: 10797 (+4.3%)
The thing is that modern compression produces a far smaller file than either encoding, and since it relies on patterns it removes almost all the difference in the source encoding sizes. Spitzak ( talk) 17:33, 31 May 2010 (UTC)
Note: After much staring I figured out why iconv added 4 bytes over wc -m: it added the BOM at the start, and my source had one more newline at the end. Spitzak ( talk) 17:40, 31 May 2010 (UTC)
UTF-8 is not my specialty so I will leave this for someone else to think about. Code points 10FFFF through 1FFFFF are not valid codes (or defined by unicode for other purposes) but only require a four byte sequence and not 5 as the table suggests. 1FFFFF would be coded as hex digits F7 BF BF BF - 4 bytes not 5 Euc ( talk) 01:21, 6 July 2010 (UTC)
U+024B62 appears extremely uncommon (google "024B62") and unprintable. Can we get a better four-byte character to use as an example?
The table now looks great, except that it's too wide (mainly due to the small hexadecimal numbers in the red cells along the bottom). Unfortunately, most fonts have non-proportional (fixed-width) numerical digit characters 0-9, even when the rest of the font is proportional (variable width), so I'm not sure how to fix the problem... AnonMoos ( talk) 18:12, 3 December 2010 (UTC)
Anyway, I just now thought of adding <small> tags, and it works for now as a temporary hack (at least the table now fits within a "maximized" browser-window on my computer)... AnonMoos ( talk) 21:56, 3 December 2010 (UTC)
Recent edits from user:AnonMoos (e.g. [5]) have pushed the notion that the UTF-8 encoding scheme was designed to avoid the byte values FE and FF (allegedly to avoid the possibility of BOM or anti-BOM appearing). But that's not listed in the design criteria, and I suspect the absence of FE and FF in Thompson's scheme for 31-bit values is quite fortuitous. Is there a reference to the contrary? -- Elphion ( talk) 14:48, 8 December 2010 (UTC)
I agree that there is no indication that FE/FF were skipped in order to avoid the BOM. At the time UTF-8 was being designed I don't think the BOM value was even defined. Also all proposed uses of FE/FF would never have them next to each other (as they are both start bytes) and so confusion of the BOM would still be impossible.
I suspect the reason FE/FF were not defined was because of them not knowing what the best scheme to use them is. I can think of several:
Spitzak ( talk) 18:55, 8 December 2010 (UTC)
Yes, BOM was already specified in 1992, but as AnonM acknowledges, it seems not to have played much of a part in the development of UTF-8 -- neither the article by Thompson and Pike linked above nor Pike's more informal account linked in our article's External links section mentions avoidance of FE and FF as a desideratum in the design. The key point of Thompson's design (and the main rationale for marking continuation bytes with "10") was the self-documenting structure of the codes, so that readers could synchronize easily with the beginning of characters; and that is precisely why it was adopted by the standard.
BOM is a minor point (though something of a headache for Unix). It is useful primarily in helping to identify the encoding, but since it is widely misused, robust programs do not rely on presence or absence of BOM, but use heuristic tests as well, as recommended by various standards. Once the coding is established, subsequent appearances of BOM in UTF-8 sequences (whether through transmission errors or careless manipulation) is simply an error. It's not a big deal, and the self-documenting nature of UTF-8 allows easy recovery. (In fact, one rationale for making FFFE and FFFF non-characters is that applications could safely use them as markers in string processing: there was never any sentiment that they should be avoided at all costs.)
In this article, I think it's fair to mention and even illustrate the extension of Thompson's scheme, pointing out that it maintains the self-documenting feature that was Thompson's main contribution; but that for larger values (as AnonM pointed out) it introduces FE and FF as lead bytes, unlike the standard version of UTF-8. I think it's fair in the discussion of standard UTF-8 to point out the absence of FE and FF and the consequent unlikelihood that a spurious BOM will occur, but let's not belabor that, as it was never a big deal in the design of UTF-8.
-- Elphion ( talk) 14:39, 9 December 2010 (UTC)
(1) It is clear from Pike's account that because of the ASCII backward compatibility they were already focused on the lead bit. So 0 meant an 1-byte code, and 1 a multibyte code -- and the natural place to distinguish lead bytes from continuation bytes is in the next bit. '0' is the obvious terminator for a sequence of '1's, so "10" is the obvious choice for the continuation marker. This rationale yields Thompson's scheme, and Pike's account gives no indication that FE or FF was a consideration at all. Your alternative scheme would have been another way to do it, but confers no real advantage -- and it lacks the transparent simplicity of Thompson's scheme. (2) The extension previously shown in the article is the only natural extension of Thompson's scheme: it preserves Thompson's design scheme and does the obvious thing when the byte-count requires more than one byte. (3) Of course Thompson and the standards committee had a good reason to stop where they did: Thompson stopped at 6 bytes because he was representing only the 31-bit UCS space; the standards committee at 4 bytes because they were representing only the 21-bit UTF-16 space. It had nothing to do with FE and FF. You can stop at whatever point you want to get the size space you want to represent -- that's the beauty of the scheme. -- Elphion ( talk) 04:36, 10 December 2010 (UTC)
Does anyone know who's responsible for the extension of Thompson's scheme to arbitrarily large values? It would be good to have a reference. (Was it perhaps even Thompson himself?) -- Elphion ( talk) 05:03, 10 December 2010 (UTC)
In a variable length coding system, each code must convey two things: (A) how long the code is, and (B) what the coded value is. Most such schemes rely to some degree on implicit information: certain ranges are handled in special ways, certain values have to be computed from special values. UTF-16 is a good example: words that fall in the range D800..DCFF signify two-word codes, whose value is computed from the bit pattern of the code using special constants. The various CJK multibyte schemes have similar properties.
Thompson's scheme for values > 127 is the first variable-length character encoding (that I know of) where both the length (A) and the value (B) are stored explicitly in the code itself: the byte count (A) is coded by a string of 1's, followed by a 0, followed by the numerical value (B) of the code. These are packed into the code bytes with the string of 1's starting in the high bit of the first byte, and the data continuing into the free bits of continuation bytes (whose first two bits are reserved and set to "10" to mark them as continuation bytes). Thus data items (A) and (B) are stored continuously in the data bits of the code bytes (avoiding the continuation markers), and (B) is padded with 0's at the high end so that its LSBit lands in the LSBit of the last byte of the code.
That's a simple description of Thompson's scheme. Although it was designed for a space that requires a maximum of 6 bytes, it can represent arbitrarily high values. You can stop at 6 bytes if you're interested in representing the UCS space; you can stop at 3 bytes if you're interested in representing BMP; you can stop at 4 bytes if you're interested in the standard Unicode space. But the system itself is completely general. To quote the passage (not mine) that you chose to suppress: it is "sufficiently general to be extended indefinitely to any number of bytes and an unlimited number of bits".
Obviously this is not true of standard UTF-8. The standards committee needed only a 4-byte-max subset to represent their restricted character space, whose size was already determined. The extensibility of Thompson's scheme is not "blocked" by the standard; it's simply not needed by the standard. Since they stop at 4 bytes, their encoding does not use FE or FF and contains the length of the code within the lead byte: nice but inessential additional properties. They don't negate the general extensibility of Thompson's scheme to larger spaces.
(And I confess ignorance: I see no similarity between Thompson and Huffman -- they are "codings" in completely different senses of the word.)
-- Elphion ( talk) 01:39, 11 December 2010 (UTC)
By the way, ISO-8859-1 characters 0x80 to 0xBF (C1 controls and upper punctuation) encode to a UTF-8 sequence with a 0xC2 byte (Â) followed by themselves... AnonMoos ( talk) 18:35, 27 December 2010 (UTC)
I removed some detail about what UTF-8 "can" represent. Our article on Code point calls the entire range 0..0x10FFFF of the Unicode space the Unicode "codepoints" even though some of those values (despite being representable as a well-formed UTF-8 sequences) are not valid characters. It's not just the surrogates; there are unassigned characters, and there are other permanently reserved characters, like FFFE and FFFF in each plane. Once you start saying what "can't" be represented you need further fussy language about various exclusions. Let's leave it at "UTF-8 can represent all the codepoints, even though not all of those are legal characters." (This point is already made it the footnote to the first sentence in the article's second paragraph.) And in fact, many applications do use UTF-8 to represent non-legal characters. -- Elphion ( talk) 22:40, 30 April 2011 (UTC)
Second sentence says:
Hhh3h ( talk) 19:47, 13 May 2011 (UTC)
If four bytes are used to encode, say, ASCII NUL, what's wrong with that, other than wasting space? Is there some document which says this is not allowed and should be diagnosed? Why? Are these encodings reserved for future extension, and is such an extension a really a good idea? (E.g. using a three byte encoding of NUL to signal something else?) I think this just complicates decoders, because their state machine has to remember what length of code it is decoding. "oops, we got a zero, but we were decoding three bytes!". This is kind of like saying that 00 or 0.00 is no longer a valid way of writing zero. 24.85.131.247 ( talk) 20:34, 28 January 2012 (UTC)
Spitzak ( talk) 00:20, 29 January 2012 (UTC)
![]() | This page is an archive of past discussions. Do not edit the contents of this page. If you wish to start a new discussion or revive an old one, please do so on the current talk page. |
I have re-read the XML specification section on encoding [1], and I cannot find anything in it that supports the idea that UTF-8 is preferred over UTF-16. This was a long discussion, with W3C explicitly deciding to be neutral between those two (which I personally think is a mistake). I think it's correct to say that UTF-8 is the most used variant on the Web, but I'm neither aware of a relevant published statistic nor aware of what the situation is in non-Web situations. -- Alvestrand ( talk) 05:56, 19 May 2009 (UTC)
The definition of a byte is a character - NOT a tupel of 8 bits. That is an octet. This article should be rewritten using the correct usage of octect, when a group of 8 bits is meant, and byte when a character (e.g. UTF-8 character) is meant. 93.219.160.251 ( talk) 14:09, 27 June 2009 (UTC)Martin
Octet is more of a French word than it is an English word. Since "byte" is used in almost all contexts by actual English-speaking computer professionals, it should be preferred in this article, unless it it creates ambiguity (which I don't see). AnonMoos ( talk) 20:21, 27 June 2009 (UTC)
Since this talk page has passed 100 Kbytes, I'm setting up archiving for it. -- Alvestrand ( talk) 14:23, 27 June 2009 (UTC)
UTF-16 requires surrogates; an offset of 0x10000 is subtracted, so the bit pattern is not identical with UTF-8 does this mean that the offset is subtracted in utf-8 or does it mean that it is subtracted in utf-16 and not utf-8? —The preceding unsigned comment was added by Plugwash ( talk • contribs) 2004-12-04 22:38:47 (UTC).
How was the choice which characters will fit in two bytes made? Simply by unicode ordiering of them? Wouldn't that be obviously seen as suboptimal? It seems strange not even a single ISCII based layout could fit there (for eg defaulting to devanagari), but there was room for (if unicode order is the case) practically every conceivable precomposed latin character, even though they are used sparingly in some language using latin script, as well as coptic, syriac, armenian, Tāna (latter apparently used by the 300.000 inhabitants of maldives). In each case the language is penalised already by 100% increase in size from 1 byte to 2 byte, yet that's at least not worse than using UTF-16 or is better due to 1 byte space and such - but an avoidable 200% only for such widely used ones?..I presume it also includes N'Ko? 78.0.213.113 ( talk) 14:33, 3 July 2009 (UTC)
Shouldn't there be a description of how the algorithm for generating and reading UTF-8 stuff works? (explaining how the bytes represent codepoints, the surrogate thing etc) -- TiagoTiago ( talk) 05:13, 23 July 2009 (UTC)
- A badly-written (and not compliant with current versions of the standard) UTF-8 parser could accept a number of different pseudo-UTF-8 representations and convert them to the same Unicode output. This provides a way for information to leak past validation routines designed to process data in its eight-bit representation.
That's kinda like saying the disadvantage of living in houses is that a poorly built house can crumble over you... -- TiagoTiago ( talk) 05:23, 23 July 2009 (UTC)
FWIW, I believe that entire section of the document about replacement as an error handling strategy is either WP:OR and or violates WP:NOTGUIDE unless there's a reliable cite for the "popular" or "more useful" solutions that happen to violate the "standard" method which is to error out.
I should explain what I was trying to do in my wording change. When there's an error I move forward one byte and then scan the data stream until I get to a non "tail" byte. For example, we have 41 96 8B 8E 9B 40. It happens to be a CP1252 string "A–‹Ž›@". The possible ways to decode this are:
When it comes to U+DCxx style replacement both of them output U+0041 U+DC96 U+DC8B U+DC8E U+DC9B U+0040 and this method, or to interpret as CP1252 is likely more useful.
All of these are valid as far as I can see as they all "protect against decoding invalid sequences." I'm more concerned about the OR/NOTGUIDE aspect but on the other hand it is rather useful. -- Marc Kupper| talk 01:37, 6 October 2009 (UTC)
{{
cite web}}
: Unknown parameter |coauthors=
ignored (|author=
suggested) (
help)I'm confused by the regular references above to the issues surrounding scanning backwards. Why would a decoder need to scan backwards? The only times I could see a need to back up are
When I use the term "push-back" it usually means one of two things. 1) The source UTF-8 data is on a device that has the concept of a byte position pointer and allows us to reposition that pointer to a byte boundary. A disk file is a typical example of this. 2) We are receiving UTF-8 data from a device. Normally we only read or receive bytes. However, this device (or its driver) allows us to push at least three bytes back into it so that they are later read back FIFO compared to the bytes we pushed. Windows pipes are like this but I don't think Unix pipes support push-back.
Some devices do not support look-ahead nor push-back. Network sockets (Windows or Unix) and TTY/serial ports (Windows or Unix) and are examples of this. Sometimes I deal with devices usually allow look-ahead or push-back but in specific case won't. I make the assumption that it never supports these and instead use a three state storage.
With none of these would I ever need to scan backwards. -- Marc Kupper| talk 02:02, 8 October 2009 (UTC)
Somebody added two "who?" citations. The "who" is "people who keep posting incorrect information to this page" and decidedly NOT an "informative source". I don't know how to correctly state this but we certaily do NOT want a reference to somebody saying this.
The one about CP1252 is an attempt to stop the continuous posting of "UTF-8 sucks I can do everything with 1-byte codes" posters. They do, from their pov, have a complaint, so I think a paragraph is deserved there.
The one about "fixed size" is to stop the repeated posting of "strlen(x) is really slow with UTF-8". These people have not tried programming with UTf-8 and are making very dangerous and incorrect assumptions about how it should work. These people are dangerous in that they are usually smart enough to actually implement what they have in mind, resulting in very slow, lossy, and insecure software. This has been edited repeatedly by me and many others to delete or correct any indication that distances in a string somehow must be measured in "characters" but for awhile it was almost continuous maintenence beceause so many "helpful" people kept trying to "fix" this. It has slowed down some, possibly because Windows programmers are realizing that UTF-16 is variable length yet Windows uses "number of words" to measure the length, but I also think the current wording pointing out that it is the fault of ASCII-era documentation has helped as well. Therefore I would like to keep the wording as it appears to be discouraging incorrect edits. The paragraph itself must remain as a disadvantage, as it is true that "number of characters" is harder to determine with UTF-8, but I want to make sure it is clear that this is a MINOR disadvantage and not the "makes programming impossible" problem some think it is.
On further thought it does seem the whole advantages/disadvantages section is too long and pointless. People are well aware of the advantages, and the only alternative even considered today is UTF-16, all single byte and alternative multi-byte and UCS-2 are effectively dead. There is however some useful facts and information and references in there that may want to be preserved:
Spitzak ( talk) 22:21, 8 October 2009 (UTC)
sizeof
Operator states that it "yields the size (in bytes) of its operand." It further goes on to say that when sizeof is applied to a char, unsigned char, or signed char, (or a qualified version thereof) is that the result is 1. Thus it all hinges on that section. Curiously, the section then goes on with examples that uses the alloc()
function which is not documented in this standard. Thus ANSI C89's strlen() returns bytes. --
Marc Kupper|
talk
23:33, 9 October 2009 (UTC)The table after the first paragraph of the section titled "Description" has various fields of bits underlined in the examples. The underlining seems random; although only non-control bits seem to be underlined, I can see no pattern as to which non-control bits are underlined, and there is no explanation.
What does this underlining mean? If it has meaning the meaning should be explained in the accompanying text. If it has no meaning it should be removed. -- Dan Griscom ( talk) 12:03, 12 October 2009 (UTC)
I have to keep reinserting the advantage that errors in UTF-16 can be stored in UTF-8, but converse is false. This is the ENTIRE reason I am being such a pain in the ass about keeping this page up to date and removing incorrect arguments against UTF-8. I NEED UTF-8 API's to libraries and systems, UTF-16 API's are USELESS because of this fact. I'm sorry I can't find a good reference but believe me this is incredibly vital! Spitzak ( talk) 23:51, 15 January 2010 (UTC)
I've proposed that this section be moved to Unicode equivalence, the article dealing with Unicode normalisation. The problems that the section describes are really nothing to do with encoding; any encoding implementation that arbitrarily alters precomposed or decomposed characters will cause the same issue. Similarly, the statement that "Correct handling of UTF-8 requires preserving the raw byte sequence" is incorrect. If my (UTF-8) application uses NFKD, then it will be quite "correct" for it to not preserve incoming UTF-8 byte sequences. Any arguments? -- Perey ( talk) 17:21, 18 February 2010 (UTC)
Spitzak removed [3] a statement. It is not important, do some WP editors think that measurement in Unicode codes is necessary, or does not, or even consider it harmful. It is used. Somewhere bytes may be preferred ( offsets in memory) or visible characters (in text editors). But default method in many language (e.g. Perl or JavaScript) is to count code points.
Perl example:
#!/usr/bin/perl -n
chomp;
print "length(\"$_\") = ".length($_)."\n";
JS/HTML example:
<html>
<head><title>String length calculator</title></head>
<body>
<form>String: <input type="TEXT" name="s" size="160"/></form>
<p><a href="javascript:alert('length(\u0022'+document.forms[0].s.value+'\u0022) = '+document.forms[0].s.value.length);">
compute size</a></p>
</body>
</html>
Result, both in Perl with proper locale and in JS:
length("…") = 1 length("…̄") = 2
Incnis Mrsi ( talk) 08:34, 30 March 2010 (UTC)
setting my unicode/utf-8 fonts to deja vu I can't view the example glyph in the 4th column on the table in the description ( it looks like japanese kanji, but my browser generally shows that just fine ). I'm guessing the problem is the wikipedia style sheet specifies a font by name, because if I change that font to impact the page still doesn't change. This is what the glyph is supposed to look like: http://www.fileformat.info/info/unicode/char/24b62/index.htm —Preceding unsigned comment added by 173.66.61.35 ( talk) 16:44, 17 May 2010 (UTC)
Hello. The tables currently displayed are difficult to digest. You must calculate the outcome instead of just reading the info. A much better representation would be like this: [start]-[end], [start]-[end] ranges. What happens now is this: there is a range 128-2,047 in the table, which is very confusing because 128 is not a valid Unicode, and, 128 is not represented in binary in two bytes. 128 is a single byte of 10000000 (bits). That is 162 != 0xC2A2. I mean, what happens is that instead of explaining the how 162 transforms into 49826, you put 162 on both sides of the equation. That is, you say 162 = 162 instead of 162 = 49826 & 49664 (which is how you can calculate it and discover what is the value it maps to). Wvxvw ( talk) 18:25, 22 May 2010 (UTC)
If you want a complete list of ranges of code points that will display on current browsers, even the list would be massive. There are also odd invalid codes in between valid ones. The small number shown with MS Character Map is only just a small fraction of the code points that will display especially if you add Chinese. There are also a large number of characters such as pF as a single code point, for picoFarards, togther with dozens of other electrical symbols, roman digits to XII in lower and upper case as one code point, the alphabet (English only luckily) encircled, digits 1-20 encircled and so forth. I tried have so far tried everything up to 0xFFFFF on several character sets in common use and on three browsers, Word, Publisher and Notepad and all agreed. (I wrote a short program to do it). If that is the sort of list you had in mind, leave me a note and give me a couple of weeks and I list them for you. Euc ( talk) 01:40, 6 July 2010 (UTC)
I have serious doubts about this section of bulleted text, currently in the article to detail a disadvantage of UTF-8 that's supposedly rare in practice:
It has this footnote...
...and for completeness, the footnote has this {{Clarify}} note.
My problem with this claim isn't the use of Notepad; it's the claim that "This rarely happens in real documents". The supposed proof of this fact is that a certain HTML page took more space as UTF-16 than as UTF-8. HTML just happens to be a format that uses ASCII for its structuring, as is most XML. But this doesn't mean that "real" documents don't use unmixed non-ASCII characters! Any file format that doesn't use ASCII for its non-content parts (plain text, binary document formats) would absolutely take more space in UTF-8 than UTF-16, if their primary script were in the U+0800–U+FFFF range. Now, if you can prove that "real" plain-text or binary-format documents in Chinese, Japanese or Hindi scripts are "rarely" found... -- Perey ( talk) 13:58, 27 May 2010 (UTC)
Here is a test: Text chosen is the Japanese Wikipedia entry for George Bush [ [4]] This was chosen because I guessed it would be fairly long and not be computer-related. UTF-8 output was saved, size in UTF-8 was counted with "wc -c" (though just the size of the file would work), while size in UTF-16 was counted with "wc -m" multiplied by 2 (I assumed there were no non-BMP characters). All the files have plain LF as a newline, not CR+LF.
Spitzak ( talk) 19:29, 30 May 2010 (UTC)
Shift-JIS 21807 bytes UTF-16 27104 bytes UTF-8 30076 bytes
Big-5 51712 bytes UTF-16 94288 bytes UTF-8 56926 bytes
Big-5 58168 UTF-16 80880 UTF-8 89372
I should add another test that shows why all this argument is completely pointless. This is again with the "edit" text for the George Bush page:
UTF-8: 31317 UTF-16: 35028 (35032 when using iconv) (+11.8%) bzip2 of UTF-8: 10351 bzip2 of UTF-16: 10797 (+4.3%)
The thing is that modern compression produces a far smaller file than either encoding, and since it relies on patterns it removes almost all the difference in the source encoding sizes. Spitzak ( talk) 17:33, 31 May 2010 (UTC)
Note: After much staring I figured out why iconv added 4 bytes over wc -m: it added the BOM at the start, and my source had one more newline at the end. Spitzak ( talk) 17:40, 31 May 2010 (UTC)
UTF-8 is not my specialty so I will leave this for someone else to think about. Code points 10FFFF through 1FFFFF are not valid codes (or defined by unicode for other purposes) but only require a four byte sequence and not 5 as the table suggests. 1FFFFF would be coded as hex digits F7 BF BF BF - 4 bytes not 5 Euc ( talk) 01:21, 6 July 2010 (UTC)
U+024B62 appears extremely uncommon (google "024B62") and unprintable. Can we get a better four-byte character to use as an example?
The table now looks great, except that it's too wide (mainly due to the small hexadecimal numbers in the red cells along the bottom). Unfortunately, most fonts have non-proportional (fixed-width) numerical digit characters 0-9, even when the rest of the font is proportional (variable width), so I'm not sure how to fix the problem... AnonMoos ( talk) 18:12, 3 December 2010 (UTC)
Anyway, I just now thought of adding <small> tags, and it works for now as a temporary hack (at least the table now fits within a "maximized" browser-window on my computer)... AnonMoos ( talk) 21:56, 3 December 2010 (UTC)
Recent edits from user:AnonMoos (e.g. [5]) have pushed the notion that the UTF-8 encoding scheme was designed to avoid the byte values FE and FF (allegedly to avoid the possibility of BOM or anti-BOM appearing). But that's not listed in the design criteria, and I suspect the absence of FE and FF in Thompson's scheme for 31-bit values is quite fortuitous. Is there a reference to the contrary? -- Elphion ( talk) 14:48, 8 December 2010 (UTC)
I agree that there is no indication that FE/FF were skipped in order to avoid the BOM. At the time UTF-8 was being designed I don't think the BOM value was even defined. Also all proposed uses of FE/FF would never have them next to each other (as they are both start bytes) and so confusion of the BOM would still be impossible.
I suspect the reason FE/FF were not defined was because of them not knowing what the best scheme to use them is. I can think of several:
Spitzak ( talk) 18:55, 8 December 2010 (UTC)
Yes, BOM was already specified in 1992, but as AnonM acknowledges, it seems not to have played much of a part in the development of UTF-8 -- neither the article by Thompson and Pike linked above nor Pike's more informal account linked in our article's External links section mentions avoidance of FE and FF as a desideratum in the design. The key point of Thompson's design (and the main rationale for marking continuation bytes with "10") was the self-documenting structure of the codes, so that readers could synchronize easily with the beginning of characters; and that is precisely why it was adopted by the standard.
BOM is a minor point (though something of a headache for Unix). It is useful primarily in helping to identify the encoding, but since it is widely misused, robust programs do not rely on presence or absence of BOM, but use heuristic tests as well, as recommended by various standards. Once the coding is established, subsequent appearances of BOM in UTF-8 sequences (whether through transmission errors or careless manipulation) is simply an error. It's not a big deal, and the self-documenting nature of UTF-8 allows easy recovery. (In fact, one rationale for making FFFE and FFFF non-characters is that applications could safely use them as markers in string processing: there was never any sentiment that they should be avoided at all costs.)
In this article, I think it's fair to mention and even illustrate the extension of Thompson's scheme, pointing out that it maintains the self-documenting feature that was Thompson's main contribution; but that for larger values (as AnonM pointed out) it introduces FE and FF as lead bytes, unlike the standard version of UTF-8. I think it's fair in the discussion of standard UTF-8 to point out the absence of FE and FF and the consequent unlikelihood that a spurious BOM will occur, but let's not belabor that, as it was never a big deal in the design of UTF-8.
-- Elphion ( talk) 14:39, 9 December 2010 (UTC)
(1) It is clear from Pike's account that because of the ASCII backward compatibility they were already focused on the lead bit. So 0 meant an 1-byte code, and 1 a multibyte code -- and the natural place to distinguish lead bytes from continuation bytes is in the next bit. '0' is the obvious terminator for a sequence of '1's, so "10" is the obvious choice for the continuation marker. This rationale yields Thompson's scheme, and Pike's account gives no indication that FE or FF was a consideration at all. Your alternative scheme would have been another way to do it, but confers no real advantage -- and it lacks the transparent simplicity of Thompson's scheme. (2) The extension previously shown in the article is the only natural extension of Thompson's scheme: it preserves Thompson's design scheme and does the obvious thing when the byte-count requires more than one byte. (3) Of course Thompson and the standards committee had a good reason to stop where they did: Thompson stopped at 6 bytes because he was representing only the 31-bit UCS space; the standards committee at 4 bytes because they were representing only the 21-bit UTF-16 space. It had nothing to do with FE and FF. You can stop at whatever point you want to get the size space you want to represent -- that's the beauty of the scheme. -- Elphion ( talk) 04:36, 10 December 2010 (UTC)
Does anyone know who's responsible for the extension of Thompson's scheme to arbitrarily large values? It would be good to have a reference. (Was it perhaps even Thompson himself?) -- Elphion ( talk) 05:03, 10 December 2010 (UTC)
In a variable length coding system, each code must convey two things: (A) how long the code is, and (B) what the coded value is. Most such schemes rely to some degree on implicit information: certain ranges are handled in special ways, certain values have to be computed from special values. UTF-16 is a good example: words that fall in the range D800..DCFF signify two-word codes, whose value is computed from the bit pattern of the code using special constants. The various CJK multibyte schemes have similar properties.
Thompson's scheme for values > 127 is the first variable-length character encoding (that I know of) where both the length (A) and the value (B) are stored explicitly in the code itself: the byte count (A) is coded by a string of 1's, followed by a 0, followed by the numerical value (B) of the code. These are packed into the code bytes with the string of 1's starting in the high bit of the first byte, and the data continuing into the free bits of continuation bytes (whose first two bits are reserved and set to "10" to mark them as continuation bytes). Thus data items (A) and (B) are stored continuously in the data bits of the code bytes (avoiding the continuation markers), and (B) is padded with 0's at the high end so that its LSBit lands in the LSBit of the last byte of the code.
That's a simple description of Thompson's scheme. Although it was designed for a space that requires a maximum of 6 bytes, it can represent arbitrarily high values. You can stop at 6 bytes if you're interested in representing the UCS space; you can stop at 3 bytes if you're interested in representing BMP; you can stop at 4 bytes if you're interested in the standard Unicode space. But the system itself is completely general. To quote the passage (not mine) that you chose to suppress: it is "sufficiently general to be extended indefinitely to any number of bytes and an unlimited number of bits".
Obviously this is not true of standard UTF-8. The standards committee needed only a 4-byte-max subset to represent their restricted character space, whose size was already determined. The extensibility of Thompson's scheme is not "blocked" by the standard; it's simply not needed by the standard. Since they stop at 4 bytes, their encoding does not use FE or FF and contains the length of the code within the lead byte: nice but inessential additional properties. They don't negate the general extensibility of Thompson's scheme to larger spaces.
(And I confess ignorance: I see no similarity between Thompson and Huffman -- they are "codings" in completely different senses of the word.)
-- Elphion ( talk) 01:39, 11 December 2010 (UTC)
By the way, ISO-8859-1 characters 0x80 to 0xBF (C1 controls and upper punctuation) encode to a UTF-8 sequence with a 0xC2 byte (Â) followed by themselves... AnonMoos ( talk) 18:35, 27 December 2010 (UTC)
I removed some detail about what UTF-8 "can" represent. Our article on Code point calls the entire range 0..0x10FFFF of the Unicode space the Unicode "codepoints" even though some of those values (despite being representable as a well-formed UTF-8 sequences) are not valid characters. It's not just the surrogates; there are unassigned characters, and there are other permanently reserved characters, like FFFE and FFFF in each plane. Once you start saying what "can't" be represented you need further fussy language about various exclusions. Let's leave it at "UTF-8 can represent all the codepoints, even though not all of those are legal characters." (This point is already made it the footnote to the first sentence in the article's second paragraph.) And in fact, many applications do use UTF-8 to represent non-legal characters. -- Elphion ( talk) 22:40, 30 April 2011 (UTC)
Second sentence says:
Hhh3h ( talk) 19:47, 13 May 2011 (UTC)
If four bytes are used to encode, say, ASCII NUL, what's wrong with that, other than wasting space? Is there some document which says this is not allowed and should be diagnosed? Why? Are these encodings reserved for future extension, and is such an extension a really a good idea? (E.g. using a three byte encoding of NUL to signal something else?) I think this just complicates decoders, because their state machine has to remember what length of code it is decoding. "oops, we got a zero, but we were decoding three bytes!". This is kind of like saying that 00 or 0.00 is no longer a valid way of writing zero. 24.85.131.247 ( talk) 20:34, 28 January 2012 (UTC)
Spitzak ( talk) 00:20, 29 January 2012 (UTC)