![]() | This is an archive of past discussions. Do not edit the contents of this page. If you wish to start a new discussion or revive an old one, please do so on the current talk page. |
Archive 1 | Archive 2 | Archive 3 | Archive 4 | Archive 5 | Archive 6 | Archive 7 |
Are there any characters to the Unicode code points U+FDD0 - U+FDEF assigned? -- 84.61.23.172 09:47, 23 June 2006 (UTC)
To use one of the available unicode fonts (in your computer) to display the unicode special characters existing on web pages, then, if you are using that special char inside a table or chart or box, specify the class="Unicode" in the table's TR tag (or, in each TD tag, but using it in each TR is easier than using it in each TD), in wiki table code, use that after the (TR equivalent) "|-" (like |- class="Unicode"). For individual case, template code {{Unicode|char}} for each character can also be used. you may use HTML decimal or hexadecimal in the place of char. If a paragraph with lots of special Unicode chars needs to be displayed, then <p class="Unicode"> ... </p> code can be used. Thanks. ~ Tarikash 22:42, 14 July 2006 (UTC).
Could this article use a bibliography or a list of books for further reading? Seems to me there are some good texts in print regarding Unicode.
wtf does blocks mean in this case? what is the problem with word processors and how will more "blocks" help? i've commented this out until it is explained. Plugwash 18:18, 13 August 2006 (UTC)
Hi
By blocks I was referring to code points allotted in the code chart for the Tamil language. The problem is not with word processors but with the allocation of spaces in the Unicode standard itself. The Tamil language has 12 vowels and 18 consonants. Simple math yields 216+12+18=246 characters. Tamil also has a special character called 'aytha ezhuthu'. Put together there are 247 letters of the alphabet. However, the powers that be at Unicode have decided that Tamil does not have to be allocated so many points. Instead they have allotted a few code points for joiners and modifiers. The problem arises when text is copied and pasted. The joiners are rendered as independent characters ('ku' is displayed as 'k'+'u', for instance). Illogical ordering of letters and modifiers is another problem.
Regards
C Ramesh —The preceding unsigned comment was added by 203.199.211.197 ( talk • contribs) 12:57, 14 August 2006 (UTC)
Hi Chovain
You are right. The 216 letters are vowel-consonant pairs, but they are all treated as individual letters, unlike in English.
I think the problem would be better understood with this illustration:
க = the letter ka
ு = the 'u' vowel sign
கு = the letter 'ku'
When I copy 'கு' from a Web page and paste it onto a word processor, it would appear as க ு (without the space between the two letters). The letter and the vowel sign must not appear as separate letters in Tamil. That's my biggest quibble with Unicode. It's perfect for Tamil text display but fails miserably when it comes to text representation in a word processor or text editor.
C Ramesh —The preceding unsigned comment was added by 203.199.211.197 ( talk • contribs) 15:29, 14 August 2006 (UTC)
Ok - that makes much more sense. I've rewritten the paragraph in question. Let me know what you think. Chovain 23:48, 15 August 2006 (UTC)
Hi Chovain
Thanks for the rewrite. It ceratinly provides a lot more clarity.
C Ramesh —The preceding unsigned comment was added by 203.199.211.197 ( talk • contribs) 12:25, 16 August 2006 (UTC)
Ramesh - PLEASE sign your comments with ~~~~. See WP:SIG if you don't know what I am talking about. Chovain 03:22, 17 August 2006 (UTC)
(Sorry for the indent mess below. I do not know what was an answer to what anymore. Mlewan 13:32, 22 August 2006 (UTC))
The Unicode support of Tamil is perfectly able to fulfill all user requirements (except perhaps some strange issues concerning text markup), but the software implementation is needs somewhat more sophistication that visual order encodings like TSCII. Note that Thai got its visual order encoding grandfathered into Unicode, but most Unicode expert consider the Unicode Thai implementation an odd deviation, needing special-casing here and there (e.g. in UCA). -- Pjacobi 12:11, 17 August 2006 (UTC)
From the TSCII proposal (linked from tscii.org), it's pretty clear that TSCII encodes glyphs in order to make text processing easier for systems that can't compose க+ ு=கு. Representing glyphs means instead of கொ (letter ka, vowel sign o), one can use ெகா (vowel sign e, letter ka, vowel sign aa), dispensing with the need for one-glyph-per-consonant-vowel: the two look (almost) identical. I don't think "க ு" is valid Tamil. (OSX actually combines ு with the previous character, so it seems to be one character while it really is two). The comparison to æ is nonsensical - "æ" is semantically different from "ae", and the OS shouldn't display one as the other. However, while there is an fi ligature, OS X does go through your text and ligaturise fi when possible (in fonts with suitable glyphs). There used to be a bug where moving your cursor over a fi would move across the whole ligature - now it treats it properly, and the cursor sits between the f and i, in the middle of a glyph. A glyph is not necessarily the same as a character. For another example, look at Arabic - the OS has to do a lot of work to convert from characters to the right glyphs. If text looks wrong in MS Office, then you either need better fonts or a better word processor. Elektron 19:35, 24 August 2007 (UTC)
In the Issues section, both the [1] and [2] references ('alternatives to Unicode' and 'Thai problems in collation') link to the same dead page at IBM.
Can something be done to improve this table? The reader is left to figure out for themselves the correlation between the bolding and italicising and which codepoint ranges are included in the subsets. I presume that bold means it is in WGL-4, italics that it is in MES-1 (actually, there don't appear to be any examples of this), bold italics that it is in both WGL-4 and MES-1 and that all mentioned codepoint ranges are included in MES-2. Is this right? I'm still not sure why in the F0 line, 01-02 are given in parentheses. Perhaps there is another scheme (like using colours) to make all this clearer (and not quite so ugly)? Is there a particular reason the table was forced into a different font to the rest of the article? And finally, the notes [1] and [2] in the title don't seem to do anything (their content seems to only appear when you edit them). (I note from the history that the table was originally inserted by User:Crissov back in April.) Thylacoleo 03:06, 23 August 2006 (UTC)
I'm tempted to delete the entire section "Input methods". They are essentially unrealted to Unicode. -- Pjacobi 22:33, 25 August 2006 (UTC)
Would a more comprehensive listing of operating systems be of benefit?
Considering how HP is a party to the Consortium that attempts to be responsible for Unicode, maybe they can appear out of the blue, pipe up, and explain a simple way to translate my custom-designed HP laserfonts (originally composed in 1988 or 1989) to Unicode? (No, I don't use a PC and I don't use a Mac, and admit I am dealing with a fairly ordinary 68000 environment contained (and accessed) in a non-FAT, non-PC filesystem.)
The information here at Wikipedia is simply not explicit enough for me to translate my laserfonts to Unicode. There's got to be a way, but I need a lot more information than what is currently in the main article. And I shouldn't have to buy a PC running under Windows just to see the Unicodes.
Doesn't Unicode worry about the depths of margins into the bitmaps, or the heights and widths of the relevant data of the bitmaps?
Somebody should put together an article about the way Unicode stores its data. —The preceding unsigned comment was added by 198.177.27.18 ( talk • contribs) 06:47, 11 October 2006 (UTC)
I've been searching all over on how to update my PC's unicode registry, as very few non-keyboard characters appear as anything other than boxes, but to no avail. Can anyone tell me how I do this? (Also, might be something to incorporate into the article) -- Nintendorulez talk 20:58, 21 October 2006 (UTC)
I went to tidy up a recent edit to the "Origin and development" section of the article, and ended up rewriting the whole section. Here's what I came up with.
The main changes I'm aware of (it's late here) is mentioning ISO/IEC 2022 and round-trip compatibility. Does anyone think this go in the article? Cheers, CWC (talk) 17:50, 27 October 2006 (UTC)
Can you really call UCS-2 "obsolete"? To me, "obsolete" means something that no longer is used. However MS SQL Server still uses UCS-2 internally, and that means that a lot of us use it indirectly every day. Our bank may use it, our HR System that pays our salary, some of our favourite web sites... Mlewan 06:28, 4 November 2006 (UTC)
The Java situation: Character handling in J2SE 5 is based on version 4.0 of the Unicode standard. This includes support for supplementary characters, which has been specified by the JSR 204 expert group and implemented throughout the JDK. See the article Supplementary Characters in the Java Platform, the Java Specification Request 204 or the Character class documentation for more information.
The Microsoft OS situation
Windows 2000 introduced support for basic input, output, and simple sorting of supplementary characters. However, not all system components are compatible with supplementary characters. Also, supplementary characters are not supported in Windows 95/98/Me.
The MS SQL server situation
Pjacobi 18:37, 6 November 2006 (UTC)
char
as a 16-bit type and storing non-BMP codepoints as a surrogate pair. Code that processes strings character by character has to be rewritten to use CodePointAt
and similar methods which JSR 204 added to java.lang.String
and java.lang.StringBuffer
, but it's better to use a higher-level
ICU4J facility such as BreakIterator
. (See also the brief rationale for JSR 204 in
Supplementary Characters in the Java Platform.) The days when any competent programmer could write production-quality text-processing tools from scratch are over.![]() | This is an archive of past discussions. Do not edit the contents of this page. If you wish to start a new discussion or revive an old one, please do so on the current talk page. |
Archive 1 | Archive 2 | Archive 3 | Archive 4 | Archive 5 | Archive 6 | Archive 7 |
Are there any characters to the Unicode code points U+FDD0 - U+FDEF assigned? -- 84.61.23.172 09:47, 23 June 2006 (UTC)
To use one of the available unicode fonts (in your computer) to display the unicode special characters existing on web pages, then, if you are using that special char inside a table or chart or box, specify the class="Unicode" in the table's TR tag (or, in each TD tag, but using it in each TR is easier than using it in each TD), in wiki table code, use that after the (TR equivalent) "|-" (like |- class="Unicode"). For individual case, template code {{Unicode|char}} for each character can also be used. you may use HTML decimal or hexadecimal in the place of char. If a paragraph with lots of special Unicode chars needs to be displayed, then <p class="Unicode"> ... </p> code can be used. Thanks. ~ Tarikash 22:42, 14 July 2006 (UTC).
Could this article use a bibliography or a list of books for further reading? Seems to me there are some good texts in print regarding Unicode.
wtf does blocks mean in this case? what is the problem with word processors and how will more "blocks" help? i've commented this out until it is explained. Plugwash 18:18, 13 August 2006 (UTC)
Hi
By blocks I was referring to code points allotted in the code chart for the Tamil language. The problem is not with word processors but with the allocation of spaces in the Unicode standard itself. The Tamil language has 12 vowels and 18 consonants. Simple math yields 216+12+18=246 characters. Tamil also has a special character called 'aytha ezhuthu'. Put together there are 247 letters of the alphabet. However, the powers that be at Unicode have decided that Tamil does not have to be allocated so many points. Instead they have allotted a few code points for joiners and modifiers. The problem arises when text is copied and pasted. The joiners are rendered as independent characters ('ku' is displayed as 'k'+'u', for instance). Illogical ordering of letters and modifiers is another problem.
Regards
C Ramesh —The preceding unsigned comment was added by 203.199.211.197 ( talk • contribs) 12:57, 14 August 2006 (UTC)
Hi Chovain
You are right. The 216 letters are vowel-consonant pairs, but they are all treated as individual letters, unlike in English.
I think the problem would be better understood with this illustration:
க = the letter ka
ு = the 'u' vowel sign
கு = the letter 'ku'
When I copy 'கு' from a Web page and paste it onto a word processor, it would appear as க ு (without the space between the two letters). The letter and the vowel sign must not appear as separate letters in Tamil. That's my biggest quibble with Unicode. It's perfect for Tamil text display but fails miserably when it comes to text representation in a word processor or text editor.
C Ramesh —The preceding unsigned comment was added by 203.199.211.197 ( talk • contribs) 15:29, 14 August 2006 (UTC)
Ok - that makes much more sense. I've rewritten the paragraph in question. Let me know what you think. Chovain 23:48, 15 August 2006 (UTC)
Hi Chovain
Thanks for the rewrite. It ceratinly provides a lot more clarity.
C Ramesh —The preceding unsigned comment was added by 203.199.211.197 ( talk • contribs) 12:25, 16 August 2006 (UTC)
Ramesh - PLEASE sign your comments with ~~~~. See WP:SIG if you don't know what I am talking about. Chovain 03:22, 17 August 2006 (UTC)
(Sorry for the indent mess below. I do not know what was an answer to what anymore. Mlewan 13:32, 22 August 2006 (UTC))
The Unicode support of Tamil is perfectly able to fulfill all user requirements (except perhaps some strange issues concerning text markup), but the software implementation is needs somewhat more sophistication that visual order encodings like TSCII. Note that Thai got its visual order encoding grandfathered into Unicode, but most Unicode expert consider the Unicode Thai implementation an odd deviation, needing special-casing here and there (e.g. in UCA). -- Pjacobi 12:11, 17 August 2006 (UTC)
From the TSCII proposal (linked from tscii.org), it's pretty clear that TSCII encodes glyphs in order to make text processing easier for systems that can't compose க+ ு=கு. Representing glyphs means instead of கொ (letter ka, vowel sign o), one can use ெகா (vowel sign e, letter ka, vowel sign aa), dispensing with the need for one-glyph-per-consonant-vowel: the two look (almost) identical. I don't think "க ு" is valid Tamil. (OSX actually combines ு with the previous character, so it seems to be one character while it really is two). The comparison to æ is nonsensical - "æ" is semantically different from "ae", and the OS shouldn't display one as the other. However, while there is an fi ligature, OS X does go through your text and ligaturise fi when possible (in fonts with suitable glyphs). There used to be a bug where moving your cursor over a fi would move across the whole ligature - now it treats it properly, and the cursor sits between the f and i, in the middle of a glyph. A glyph is not necessarily the same as a character. For another example, look at Arabic - the OS has to do a lot of work to convert from characters to the right glyphs. If text looks wrong in MS Office, then you either need better fonts or a better word processor. Elektron 19:35, 24 August 2007 (UTC)
In the Issues section, both the [1] and [2] references ('alternatives to Unicode' and 'Thai problems in collation') link to the same dead page at IBM.
Can something be done to improve this table? The reader is left to figure out for themselves the correlation between the bolding and italicising and which codepoint ranges are included in the subsets. I presume that bold means it is in WGL-4, italics that it is in MES-1 (actually, there don't appear to be any examples of this), bold italics that it is in both WGL-4 and MES-1 and that all mentioned codepoint ranges are included in MES-2. Is this right? I'm still not sure why in the F0 line, 01-02 are given in parentheses. Perhaps there is another scheme (like using colours) to make all this clearer (and not quite so ugly)? Is there a particular reason the table was forced into a different font to the rest of the article? And finally, the notes [1] and [2] in the title don't seem to do anything (their content seems to only appear when you edit them). (I note from the history that the table was originally inserted by User:Crissov back in April.) Thylacoleo 03:06, 23 August 2006 (UTC)
I'm tempted to delete the entire section "Input methods". They are essentially unrealted to Unicode. -- Pjacobi 22:33, 25 August 2006 (UTC)
Would a more comprehensive listing of operating systems be of benefit?
Considering how HP is a party to the Consortium that attempts to be responsible for Unicode, maybe they can appear out of the blue, pipe up, and explain a simple way to translate my custom-designed HP laserfonts (originally composed in 1988 or 1989) to Unicode? (No, I don't use a PC and I don't use a Mac, and admit I am dealing with a fairly ordinary 68000 environment contained (and accessed) in a non-FAT, non-PC filesystem.)
The information here at Wikipedia is simply not explicit enough for me to translate my laserfonts to Unicode. There's got to be a way, but I need a lot more information than what is currently in the main article. And I shouldn't have to buy a PC running under Windows just to see the Unicodes.
Doesn't Unicode worry about the depths of margins into the bitmaps, or the heights and widths of the relevant data of the bitmaps?
Somebody should put together an article about the way Unicode stores its data. —The preceding unsigned comment was added by 198.177.27.18 ( talk • contribs) 06:47, 11 October 2006 (UTC)
I've been searching all over on how to update my PC's unicode registry, as very few non-keyboard characters appear as anything other than boxes, but to no avail. Can anyone tell me how I do this? (Also, might be something to incorporate into the article) -- Nintendorulez talk 20:58, 21 October 2006 (UTC)
I went to tidy up a recent edit to the "Origin and development" section of the article, and ended up rewriting the whole section. Here's what I came up with.
The main changes I'm aware of (it's late here) is mentioning ISO/IEC 2022 and round-trip compatibility. Does anyone think this go in the article? Cheers, CWC (talk) 17:50, 27 October 2006 (UTC)
Can you really call UCS-2 "obsolete"? To me, "obsolete" means something that no longer is used. However MS SQL Server still uses UCS-2 internally, and that means that a lot of us use it indirectly every day. Our bank may use it, our HR System that pays our salary, some of our favourite web sites... Mlewan 06:28, 4 November 2006 (UTC)
The Java situation: Character handling in J2SE 5 is based on version 4.0 of the Unicode standard. This includes support for supplementary characters, which has been specified by the JSR 204 expert group and implemented throughout the JDK. See the article Supplementary Characters in the Java Platform, the Java Specification Request 204 or the Character class documentation for more information.
The Microsoft OS situation
Windows 2000 introduced support for basic input, output, and simple sorting of supplementary characters. However, not all system components are compatible with supplementary characters. Also, supplementary characters are not supported in Windows 95/98/Me.
The MS SQL server situation
Pjacobi 18:37, 6 November 2006 (UTC)
char
as a 16-bit type and storing non-BMP codepoints as a surrogate pair. Code that processes strings character by character has to be rewritten to use CodePointAt
and similar methods which JSR 204 added to java.lang.String
and java.lang.StringBuffer
, but it's better to use a higher-level
ICU4J facility such as BreakIterator
. (See also the brief rationale for JSR 204 in
Supplementary Characters in the Java Platform.) The days when any competent programmer could write production-quality text-processing tools from scratch are over.