![]() | This article is rated C-class on Wikipedia's
content assessment scale. It is of interest to the following WikiProjects: | |||||||||||||||||||||||
|
This article states that "...There are a few, fairly rarely used codes that UTF-8 requires three bytes whereas UTF-16 requires only two..."; but it seems to me that most CJK characters take 3 bytes in UTF-8 but 2 bytes in UTF-16? 76.126.165.196 ( talk) 08:32, 25 February 2008 (UTC)
I think you are right. And western people generally don't care about that... -- 96.44.173.118 ( talk) 16:20, 13 December 2011 (UTC)
This article appears to be a sub-page of Unicode, which is ok; but it should have an encyclopedic name that reflects its importance (that of an article on Unicode encodings, rather than some evaluative comparison). — donhalcon ╤ 16:26, 7 March 2006 (UTC)
hex 110000, the grand total of 17 Planes, obviously takes 21 bits, which comfortably fit into 3 bytes (24 bits). So why would anyone want to encode 21 bits in 32 bits? the fourth byte is entirely redundant. What, then, is the rationale behind having UTF-32 instead of "UTF-24"? Just a superstitious fear of odd numbers of bytes? dab (ᛏ) 12:47, 6 July 2006 (UTC)
See this page [1] which describes the encoding. Olivier Mengué | ⇄ 23:19, 22 May 2007 (UTC)
So what is the most popular encoding??? —Preceding unsigned comment added by 212.154.193.78 ( talk) 07:52, 15 February 2008 (UTC)
This seems to be a bit out of date. I just searched the reference library and can not come up with anything in the current version of Mac OS regarding UTF-16. Since the cited material is two revisions (10.3 vs. the current 10.5) AND since Mac OS has understands UTF-8, the fact that it uses UTF-16 in a previous version for INTERNAL system files, is irrelevant. I suggest this be removed. Lloydsargent ( talk) 14:08, 24 April 2008 (UTC)
"A UTF-8 file that contains only ASCII characters is identical to an ASCII file"—Only with the strictest (too strict) reading is this true. A UTF-8 file could have a BOM, which then would not "[contain] only ASCII characters." Can someone re-word this without making it require such a strict reading yet still be simple? —Preceding unsigned comment added by 72.86.168.59 ( talk) 18:07, 3 September 2009 (UTC)
Spitzak, your summary in [2] is wrong. UTF-8 has 8-bit bytes (i.e. coded message alphabet has ≤ 256 symbols, more exactly, 243), but its code points are just Unicode ones. Surrogates are code points, they are not characters indeed. Please, recall the terminology. The word "character" is inappropriate. Incnis Mrsi ( talk) 19:47, 23 March 2010 (UTC)
I propose to delete this section. "UTF-5" and "UTF-6" are unimplemented vaporware; they were early entries in an IDNA competition which Punycode ultimately won. Doug Ewell 20:14, 20 September 2010 (UTC) —Preceding unsigned comment added by DougEwell ( talk • contribs)
Found this in the UTF-16 RFC: "Characters with values greater than 0x10FFFF cannot be encoded in UTF-16." ( http://www.ietf.org/rfc/rfc2781.txt) I'm wondering if this is a specific limit of UTF-16? Can characters above 0x10FFFF be encoding in UTF-8, for example? (Are there any such characters?) 86.186.87.85 ( talk) 10:08, 30 April 2012 (UTC)
UTF-16 theoretically can encode codes up to U+FFFFFFFFFF. I'm not sure if UTF-1 examples are correct, but i replaced 5-byte sequences to four. modified UTF-16:
0000000000-000000FFFF= xxxx 0000010000-0000FFFFFF= D8xxxxxx 0001000000-0001FFFFFF= D9xxxxxx 0002000000-0002FFFFFF= DAxxxxxx 0003000000-0003FFFFFF= DBxxxxxx 0004000000-0004FFFFFF= DCxxxxxx 0005000000-0005FFFFFF= DDxxxxxx 0006000000-0006FFFFFF= DExxxxxx 0007000000-FFFFFFFFFF=DFxxxxxxxxxx
The UTF-24:
00000000-00FFFFFF= xxxxxx 01000000-FFFFFFFF=00D8xxxxxxxx
Comparison:
UTF-8 UTF-16 UTF-32 UTF-EBCDIC UTF-1 UTF-24 00000000-0000007F 1 2 4 1 1 3 00000080-0000009F 2 2 4 1 1 3 000000A0-000003FF 2 2 4 2 2 3 00000400-000007FF 2 2 4 3 2 3 00000800-00003FFF 3 2 4 3 2 3 00004000-00004015 3 2 4 4 2 3 00004016-0000FFFF 3 2 4 4 3 3 00010000-00038E2D 4 4 4 4 3 3 00038E2E-0003FFFF 4 4 4 4 4 3 00040000-001FFFFF 4 4 4 5 4 3 00200000-003FFFFF 5 4 4 5 4 3 00400000-006C3725 5 4 4 6 4 3 006C3726-00FFFFFF 5 4 4 6 5 3 01000000-03FFFFFF 5 4 4 6 5 6 04000000-06FFFFFF 6 4 4 7 5 6 07000000-3FFFFFFF 6 6 4 7 5 6 40000000-4E199F35 6 6 4 8 5 6 4E199F36-FFFFFFFF 6 6 4 8 6 6
164.127.155.17 ( talk) 20:40, 6 March 2015 (UTC)
/
) or can be part of a longer sequence. This has a lot of drawbacks, hence UTF-8 and UTF-16 are designed in a way, that such "double usage" of codewords never happened. --
RokerHRO (
talk)
08:10, 11 March 2016 (UTC)i have just made this for "what if" and "for fun" interest... 2A01:119F:251:9000:6048:4CFB:87B3:44FA ( talk) 17:51, 22 March 2016 (UTC)
"Default" implies that "this is what you get if you don't declare the encoding." But the cited section doesn't say that. It seems to say that you should assume UTF-8 if there's no BOM. (I'm not clear what it should assume if there is a BOM.) Even if "UTF-8 is the default" is true in some technical sense, the paragraph would be clearer like this:
All XML processors must support both UTF-8 and UTF-16. If there is no encoding declaration and no byte order mark, the processor can safely assume the file is UTF-8. Thus a plain ASCII file is seen as UTF-8.
Isaac Rabinovitch ( talk) 19:27, 16 April 2022 (UTC)
A common misconception is that there is a need to "find the nth character" and that this requires a fixed-length encoding; however, in real use the number n is only derived from examining the n−1 characters, thus sequential access is needed anyway.
Could this be clarified, please? Surely with a fixed-length encoding the nth character can be determined to be at offset (n-1)(fixed encoding length) without having to examine the preceding characters? You can do "find the nth character" in a variable-length encoding (so a fixed-length encoding isn't necessary), but it's when you use a variable-length encoding that you do need to scan preceding characters. Or are we not talking about random access (which is what "find the nth character" would suggest is wanted) here?
![]() | This article is rated C-class on Wikipedia's
content assessment scale. It is of interest to the following WikiProjects: | |||||||||||||||||||||||
|
This article states that "...There are a few, fairly rarely used codes that UTF-8 requires three bytes whereas UTF-16 requires only two..."; but it seems to me that most CJK characters take 3 bytes in UTF-8 but 2 bytes in UTF-16? 76.126.165.196 ( talk) 08:32, 25 February 2008 (UTC)
I think you are right. And western people generally don't care about that... -- 96.44.173.118 ( talk) 16:20, 13 December 2011 (UTC)
This article appears to be a sub-page of Unicode, which is ok; but it should have an encyclopedic name that reflects its importance (that of an article on Unicode encodings, rather than some evaluative comparison). — donhalcon ╤ 16:26, 7 March 2006 (UTC)
hex 110000, the grand total of 17 Planes, obviously takes 21 bits, which comfortably fit into 3 bytes (24 bits). So why would anyone want to encode 21 bits in 32 bits? the fourth byte is entirely redundant. What, then, is the rationale behind having UTF-32 instead of "UTF-24"? Just a superstitious fear of odd numbers of bytes? dab (ᛏ) 12:47, 6 July 2006 (UTC)
See this page [1] which describes the encoding. Olivier Mengué | ⇄ 23:19, 22 May 2007 (UTC)
So what is the most popular encoding??? —Preceding unsigned comment added by 212.154.193.78 ( talk) 07:52, 15 February 2008 (UTC)
This seems to be a bit out of date. I just searched the reference library and can not come up with anything in the current version of Mac OS regarding UTF-16. Since the cited material is two revisions (10.3 vs. the current 10.5) AND since Mac OS has understands UTF-8, the fact that it uses UTF-16 in a previous version for INTERNAL system files, is irrelevant. I suggest this be removed. Lloydsargent ( talk) 14:08, 24 April 2008 (UTC)
"A UTF-8 file that contains only ASCII characters is identical to an ASCII file"—Only with the strictest (too strict) reading is this true. A UTF-8 file could have a BOM, which then would not "[contain] only ASCII characters." Can someone re-word this without making it require such a strict reading yet still be simple? —Preceding unsigned comment added by 72.86.168.59 ( talk) 18:07, 3 September 2009 (UTC)
Spitzak, your summary in [2] is wrong. UTF-8 has 8-bit bytes (i.e. coded message alphabet has ≤ 256 symbols, more exactly, 243), but its code points are just Unicode ones. Surrogates are code points, they are not characters indeed. Please, recall the terminology. The word "character" is inappropriate. Incnis Mrsi ( talk) 19:47, 23 March 2010 (UTC)
I propose to delete this section. "UTF-5" and "UTF-6" are unimplemented vaporware; they were early entries in an IDNA competition which Punycode ultimately won. Doug Ewell 20:14, 20 September 2010 (UTC) —Preceding unsigned comment added by DougEwell ( talk • contribs)
Found this in the UTF-16 RFC: "Characters with values greater than 0x10FFFF cannot be encoded in UTF-16." ( http://www.ietf.org/rfc/rfc2781.txt) I'm wondering if this is a specific limit of UTF-16? Can characters above 0x10FFFF be encoding in UTF-8, for example? (Are there any such characters?) 86.186.87.85 ( talk) 10:08, 30 April 2012 (UTC)
UTF-16 theoretically can encode codes up to U+FFFFFFFFFF. I'm not sure if UTF-1 examples are correct, but i replaced 5-byte sequences to four. modified UTF-16:
0000000000-000000FFFF= xxxx 0000010000-0000FFFFFF= D8xxxxxx 0001000000-0001FFFFFF= D9xxxxxx 0002000000-0002FFFFFF= DAxxxxxx 0003000000-0003FFFFFF= DBxxxxxx 0004000000-0004FFFFFF= DCxxxxxx 0005000000-0005FFFFFF= DDxxxxxx 0006000000-0006FFFFFF= DExxxxxx 0007000000-FFFFFFFFFF=DFxxxxxxxxxx
The UTF-24:
00000000-00FFFFFF= xxxxxx 01000000-FFFFFFFF=00D8xxxxxxxx
Comparison:
UTF-8 UTF-16 UTF-32 UTF-EBCDIC UTF-1 UTF-24 00000000-0000007F 1 2 4 1 1 3 00000080-0000009F 2 2 4 1 1 3 000000A0-000003FF 2 2 4 2 2 3 00000400-000007FF 2 2 4 3 2 3 00000800-00003FFF 3 2 4 3 2 3 00004000-00004015 3 2 4 4 2 3 00004016-0000FFFF 3 2 4 4 3 3 00010000-00038E2D 4 4 4 4 3 3 00038E2E-0003FFFF 4 4 4 4 4 3 00040000-001FFFFF 4 4 4 5 4 3 00200000-003FFFFF 5 4 4 5 4 3 00400000-006C3725 5 4 4 6 4 3 006C3726-00FFFFFF 5 4 4 6 5 3 01000000-03FFFFFF 5 4 4 6 5 6 04000000-06FFFFFF 6 4 4 7 5 6 07000000-3FFFFFFF 6 6 4 7 5 6 40000000-4E199F35 6 6 4 8 5 6 4E199F36-FFFFFFFF 6 6 4 8 6 6
164.127.155.17 ( talk) 20:40, 6 March 2015 (UTC)
/
) or can be part of a longer sequence. This has a lot of drawbacks, hence UTF-8 and UTF-16 are designed in a way, that such "double usage" of codewords never happened. --
RokerHRO (
talk)
08:10, 11 March 2016 (UTC)i have just made this for "what if" and "for fun" interest... 2A01:119F:251:9000:6048:4CFB:87B3:44FA ( talk) 17:51, 22 March 2016 (UTC)
"Default" implies that "this is what you get if you don't declare the encoding." But the cited section doesn't say that. It seems to say that you should assume UTF-8 if there's no BOM. (I'm not clear what it should assume if there is a BOM.) Even if "UTF-8 is the default" is true in some technical sense, the paragraph would be clearer like this:
All XML processors must support both UTF-8 and UTF-16. If there is no encoding declaration and no byte order mark, the processor can safely assume the file is UTF-8. Thus a plain ASCII file is seen as UTF-8.
Isaac Rabinovitch ( talk) 19:27, 16 April 2022 (UTC)
A common misconception is that there is a need to "find the nth character" and that this requires a fixed-length encoding; however, in real use the number n is only derived from examining the n−1 characters, thus sequential access is needed anyway.
Could this be clarified, please? Surely with a fixed-length encoding the nth character can be determined to be at offset (n-1)(fixed encoding length) without having to examine the preceding characters? You can do "find the nth character" in a variable-length encoding (so a fixed-length encoding isn't necessary), but it's when you use a variable-length encoding that you do need to scan preceding characters. Or are we not talking about random access (which is what "find the nth character" would suggest is wanted) here?