This is the
talk page for discussing improvements to the
Unicode article. This is not a forum for general discussion of the article's subject. |
Article policies
|
Find sources: Google ( books · news · scholar · free images · WP refs) · FENS · JSTOR · TWL |
Archives:
Index,
1,
2,
3,
4,
5,
6,
7Auto-archiving period: 730 days
![]() |
![]() | This ![]() It is of interest to the following WikiProjects: | ||||||||||||||||||||||||||||||||||||||||
|
![]() | Text and/or other creative content from this version of Unicode was copied or moved into incubator:Wp/nod/ᩀᩪᨶᩥᨣᩰ᩠ᨯ with this edit. The former page's history now serves to provide attribution for that content in the latter page, and it must not be deleted as long as the latter page exists. |
I am adding new blocks & data to Wikidata now. Assuming no DAB needed here, the pages are:
DePiep ( talk) 16:10, 13 September 2022 (UTC)
A proposal is opened at WP:COMP § Taskforce WP Unicode –_proposal. Please take a look. DePiep ( talk) 09:35, 2 October 2022 (UTC)
The lead claims that there are currently 149 186 characters in the Standard. That's confusing! Is that actual characters or does it include unprintable code points? I know what a code point is, my point is that the lead shouldn't confuse code points with characters. (I also argue that a "control character" isn't 'really' a character, not a grapheme, but that's a fight for somewhere else.) Writing about Unicode without an early clear explanation of what a code point is, is -I think- awful pedagogy. In fact, I don't think code point - a fundamental aspect of Unicode - is even defined in the article!!!! Wow, just wow.
I also would like someone to verify that Unicode has characters for color. I believe that's wrong/false/misleading. I am aware that certain emoji can be modified by a code point to change some of its color. As far as I know, this is only true with a very small set of code points, and a very very small set of colors (I don't actually know if the colors are well-defined, I'd expect so, but...). These aren't colors, but are color modifiers for those other code points. 174.130.71.156 ( talk) 16:00, 13 December 2022 (UTC)
The offending sentence is:"The Unicode standard defines three and several other encodings exist, all in practice variable-length encodings." (Sure, you could strain to interpret that to mean "all but UTF-32", but let's keep it clear. It clearly implies all encodings are variable length. Wikipedia's own article on UTF-32 says it is fixed length. (Because it only needs to use 21 of the 32 bits for Unicode code points, it is very inefficient (and rarely used, afaik). But rarely used is not the same as "doesn't exist", and "all are variable" clearly implies it doesn't exist. I'd have to look again, are there really 3 variable Unicode encodings? I can only think of UTF-8 and UTF-16. (and some others that afaik are not "defined" in the Unicode standard (like GB18030), or that are obsolete (like UTF-7).) Replace "all" with "all common encodings" or something similar, and mention UTF-32. 174.130.71.156 ( talk) 11:43, 15 December 2022 (UTC)
I object to the reversal by Peter M. Brown, citing WP:ITALICTITLE inappropriately. I'd say that the name, a noun, should not be in italics.
ITALICTITLE referst to the name of a work, ie the work itself (play, periodic, book). However, the Unicode standard is a standard, not a book &tc. not even it's publication. The Standard is abstraction: the set of rules. It is a proper noun full stop. Key is, the article title notes the subject: the standard not the book. DePiep ( talk) 17:04, 21 April 2023 (UTC)
I don't know if it would be manageable, but Unicode clearly does not have all commonly used symbols. A simple example is the very commonly used 'slash marks' used to count. Most reading this will be familiar with the sequence /, //, ///, ////, and //// with the crossmark (strike-through) diagonal (top left to bottom right) rather than horizontal. (This is typical in the USA, I understand European convention is slightly different). I request the editors to consider the addition of a list of missing (but documented) symbols.
40.142.183.146 (
talk)
11:49, 9 June 2023 (UTC)
Unicode 16 is set to release in September 2024. I think the following (con)scripts definitely need to be encoded:
94.180.80.9 ( talk) 07:31, 9 July 2023 (UTC)
@
Spitzak: In the text for example, ḗ (precomposed e with macron and acute above) and ḗ (e followed by the combining macron above and combining acute above) should be rendered identically,
the "e" is followed by two distinct combining characters, but they are rendered at a single location. I inserted a space to cause them to display as two separate characters, and
Spitzak reverted the change with the comment They are supposed to be combined
. In context, I don't understand how it makes sense to combine them, since the text refers to them individually. --
Shmuel (Seymour J.) Metz Username:Chatul (
talk)
21:40, 16 October 2023 (UTC)
for example, ḗ (precomposed e with macron and acute above) and e followed by the combining macron above and combining acute above should be rendered identically,. Alternatively,
for example, ḗ (precomposed e with macron and acute above) and eōó (e followed by the combining macron above and combining acute above) should be rendered identically,. -- Shmuel (Seymour J.) Metz Username:Chatul ( talk) 19:25, 18 October 2023 (UTC)
Welcome, I want the Kurdistan flag on my keyboard 85.94.240.91 ( talk) 23:28, 2 November 2023 (UTC)
@ Spitzak, I'm also really not sure what you're talking about exactly—Microsoft seems to have the definition of "Unicode" in line with that of the rest of the world. [1] If they use "Unicode" as a shorthand for "UTF-16" sometimes (the way many people use it as a shorthand for "UTF-8", then the page I just linked seems to do any theoretical disambiguation work, and doesn't really leave us wondering whether they're somehow creating an ambiguity problem for us to solve. Remsense 诉 02:28, 8 March 2024 (UTC)
isTextUnicode
which returns false for UTF-8. There are a number of other examples where "Unicode" means the 16-bit interface.
Spitzak (
talk)
06:19, 8 March 2024 (UTC)In Microsoft windows, the Unicode support is limited to UTF-16.-- Shmuel (Seymour J.) Metz Username:Chatul ( talk) 15:47, 8 March 2024 (UTC)
In the Codespace and code points section, it refers to "the interval ". I had to read it several times to figure out what was meant. I originally parsed "0,17" as a European-format decimal number, which made no sense. Eventually I figured what was meant, but it wasn't at all obvious. There is nothing in the referenced Unicode 15 standard which uses that terminology, either. The use of mis-matched bracket and paren is a math construct which makes sense for real intervals, but is less commonly used in integer contexts. It will simply appear wrong to readers without a real analysis background.
May I suggest this might be more understandable replaced with "the range 0 : 1114111"? The origin of the latter number is available later in the sentence (with the hexadecimal number 0x10FFFF). Alternatively, a less obscure notation might be . Tarl N. ( discuss) 22:55, 11 May 2024 (UTC)
in the range from 0 to 1114111,... Tarl N. ( discuss) 08:01, 12 May 2024 (UTC)
This is the
talk page for discussing improvements to the
Unicode article. This is not a forum for general discussion of the article's subject. |
Article policies
|
Find sources: Google ( books · news · scholar · free images · WP refs) · FENS · JSTOR · TWL |
Archives:
Index,
1,
2,
3,
4,
5,
6,
7Auto-archiving period: 730 days
![]() |
![]() | This ![]() It is of interest to the following WikiProjects: | ||||||||||||||||||||||||||||||||||||||||
|
![]() | Text and/or other creative content from this version of Unicode was copied or moved into incubator:Wp/nod/ᩀᩪᨶᩥᨣᩰ᩠ᨯ with this edit. The former page's history now serves to provide attribution for that content in the latter page, and it must not be deleted as long as the latter page exists. |
I am adding new blocks & data to Wikidata now. Assuming no DAB needed here, the pages are:
DePiep ( talk) 16:10, 13 September 2022 (UTC)
A proposal is opened at WP:COMP § Taskforce WP Unicode –_proposal. Please take a look. DePiep ( talk) 09:35, 2 October 2022 (UTC)
The lead claims that there are currently 149 186 characters in the Standard. That's confusing! Is that actual characters or does it include unprintable code points? I know what a code point is, my point is that the lead shouldn't confuse code points with characters. (I also argue that a "control character" isn't 'really' a character, not a grapheme, but that's a fight for somewhere else.) Writing about Unicode without an early clear explanation of what a code point is, is -I think- awful pedagogy. In fact, I don't think code point - a fundamental aspect of Unicode - is even defined in the article!!!! Wow, just wow.
I also would like someone to verify that Unicode has characters for color. I believe that's wrong/false/misleading. I am aware that certain emoji can be modified by a code point to change some of its color. As far as I know, this is only true with a very small set of code points, and a very very small set of colors (I don't actually know if the colors are well-defined, I'd expect so, but...). These aren't colors, but are color modifiers for those other code points. 174.130.71.156 ( talk) 16:00, 13 December 2022 (UTC)
The offending sentence is:"The Unicode standard defines three and several other encodings exist, all in practice variable-length encodings." (Sure, you could strain to interpret that to mean "all but UTF-32", but let's keep it clear. It clearly implies all encodings are variable length. Wikipedia's own article on UTF-32 says it is fixed length. (Because it only needs to use 21 of the 32 bits for Unicode code points, it is very inefficient (and rarely used, afaik). But rarely used is not the same as "doesn't exist", and "all are variable" clearly implies it doesn't exist. I'd have to look again, are there really 3 variable Unicode encodings? I can only think of UTF-8 and UTF-16. (and some others that afaik are not "defined" in the Unicode standard (like GB18030), or that are obsolete (like UTF-7).) Replace "all" with "all common encodings" or something similar, and mention UTF-32. 174.130.71.156 ( talk) 11:43, 15 December 2022 (UTC)
I object to the reversal by Peter M. Brown, citing WP:ITALICTITLE inappropriately. I'd say that the name, a noun, should not be in italics.
ITALICTITLE referst to the name of a work, ie the work itself (play, periodic, book). However, the Unicode standard is a standard, not a book &tc. not even it's publication. The Standard is abstraction: the set of rules. It is a proper noun full stop. Key is, the article title notes the subject: the standard not the book. DePiep ( talk) 17:04, 21 April 2023 (UTC)
I don't know if it would be manageable, but Unicode clearly does not have all commonly used symbols. A simple example is the very commonly used 'slash marks' used to count. Most reading this will be familiar with the sequence /, //, ///, ////, and //// with the crossmark (strike-through) diagonal (top left to bottom right) rather than horizontal. (This is typical in the USA, I understand European convention is slightly different). I request the editors to consider the addition of a list of missing (but documented) symbols.
40.142.183.146 (
talk)
11:49, 9 June 2023 (UTC)
Unicode 16 is set to release in September 2024. I think the following (con)scripts definitely need to be encoded:
94.180.80.9 ( talk) 07:31, 9 July 2023 (UTC)
@
Spitzak: In the text for example, ḗ (precomposed e with macron and acute above) and ḗ (e followed by the combining macron above and combining acute above) should be rendered identically,
the "e" is followed by two distinct combining characters, but they are rendered at a single location. I inserted a space to cause them to display as two separate characters, and
Spitzak reverted the change with the comment They are supposed to be combined
. In context, I don't understand how it makes sense to combine them, since the text refers to them individually. --
Shmuel (Seymour J.) Metz Username:Chatul (
talk)
21:40, 16 October 2023 (UTC)
for example, ḗ (precomposed e with macron and acute above) and e followed by the combining macron above and combining acute above should be rendered identically,. Alternatively,
for example, ḗ (precomposed e with macron and acute above) and eōó (e followed by the combining macron above and combining acute above) should be rendered identically,. -- Shmuel (Seymour J.) Metz Username:Chatul ( talk) 19:25, 18 October 2023 (UTC)
Welcome, I want the Kurdistan flag on my keyboard 85.94.240.91 ( talk) 23:28, 2 November 2023 (UTC)
@ Spitzak, I'm also really not sure what you're talking about exactly—Microsoft seems to have the definition of "Unicode" in line with that of the rest of the world. [1] If they use "Unicode" as a shorthand for "UTF-16" sometimes (the way many people use it as a shorthand for "UTF-8", then the page I just linked seems to do any theoretical disambiguation work, and doesn't really leave us wondering whether they're somehow creating an ambiguity problem for us to solve. Remsense 诉 02:28, 8 March 2024 (UTC)
isTextUnicode
which returns false for UTF-8. There are a number of other examples where "Unicode" means the 16-bit interface.
Spitzak (
talk)
06:19, 8 March 2024 (UTC)In Microsoft windows, the Unicode support is limited to UTF-16.-- Shmuel (Seymour J.) Metz Username:Chatul ( talk) 15:47, 8 March 2024 (UTC)
In the Codespace and code points section, it refers to "the interval ". I had to read it several times to figure out what was meant. I originally parsed "0,17" as a European-format decimal number, which made no sense. Eventually I figured what was meant, but it wasn't at all obvious. There is nothing in the referenced Unicode 15 standard which uses that terminology, either. The use of mis-matched bracket and paren is a math construct which makes sense for real intervals, but is less commonly used in integer contexts. It will simply appear wrong to readers without a real analysis background.
May I suggest this might be more understandable replaced with "the range 0 : 1114111"? The origin of the latter number is available later in the sentence (with the hexadecimal number 0x10FFFF). Alternatively, a less obscure notation might be . Tarl N. ( discuss) 22:55, 11 May 2024 (UTC)
in the range from 0 to 1114111,... Tarl N. ( discuss) 08:01, 12 May 2024 (UTC)