This Wikipedia page has been superseded by Help talk:Multilingual support and is retained primarily for historical reference. |
Discussion page about using Unicode in Wikipedia.
(Note that this page and the discussion page are covering much the same topic, so you might want to read both. m.e. 11:02, 14 Jul 2004 (UTC))
From the Village Pump
This may be the wrong place to ask this, or it may be answered elsewhere, but can anyone tell me if and when the English Wiki will be changed over to UTF-8? I ask becuase it's hugely inconvenient to work with text that's full of ś's, but for some topics (Sanskrit and associated languages and subjects, in my case), there is no adequate alternative to using unicode characters. This is true even if I eschew Devanagari and work in roman, because standardized roman transliteration requires characters with diacritics that aren't available in latin-1. कुक्कुरोवाच 20:51, 31 Mar 2004 (UTC)
Switching the entire project over to UTF-8 or leaving things in ISO-8859-1 are not the only two choices. It would be straightforward to add a user option for "Edit in UTF-8". When a logged-in user with this option set requests to edit a page, the server translates HTML character references to their UTF-8. When the users submits their edit, the server translates non-ASCII (or non-ISO-8859-1) characters back to the HTML character references for storage in the database. Users who don't set this option would see no differences. See my Editing in UTF-8 feature request. — Gdr 12:33 2004-04-01.
Hi! I am a user from the french wikipédia. I know that some of you were interested by the conversion to utf-8. As you perhaps want to test on your personal wiki before considering the switch, here is the software to convert the MySQL dump : http://mboquien.free.fr/wikiconvert/ . It converts :
What it doesn't do :
This version is the rewritten version of the one we used (which was really dirty) to convert the french wiki. I rewrote it this afternoon and i tested it on an old cur dump of the french wiki, everything seems to work as expected. For the details, it depends on Qt (no troll on the toolkit used please) and i ran it on Mandrake 10.0. I was reported that it also compiles out of the box on Slackware. If you use another distribution, you may perhaps need to tweak the Makefile to have the correct path for Qt (you should set QTDIR correctly before trying to compile). No need to say that you need the Qt development packages installed. Using it is quite easy. The Makefile produces a wikiconvert executable. To convert you just need to write : ./wikiconvert < dump > converteddump (if you don't use iso8859-1, there is one line to change in wikiconvert.cpp, as explained in the source). On my computer (an athlonXP 2000+ underclocked at 1,5 GHz), converting a 90 Mb dump of cur lasts about 100 seconds. You should ask for a non compressed dump of cur for your test since converting compressed dumps available at http://download.wikipedia.org/ are not suitable for conversion since, once converted, MySQL can't load the dump completely (a problem of lines too long apparently, last time i tried).
I'd be very happy to get some feedback, and i would gladly accept patches to make the program faster/better. :) If you have any question, you can reach me on #fr.wikipedia on Freenode or on my discussion page (french or english only please). Med 09:41, 4 Apr 2004 (UTC)
I think the ironic thing is that Wikipedia is already using Unicode. Tagging the pages as ISO-8859-1 and forcing users to use HTML entities just takes up more bandwidth and makes the editing slower.
-浪人
update: By now the spanish and the german wikipedia have been converted successfully to utf8. Only dutch, danish, swedish and english still use 8859-1.
While the whole Unicode debate is going on, you might find a little tool I wrote useful. Just go to my user page for the source and a link to a "runnable" version. All it does is convert all non-ASCII Unicode characters you type in it into the � format. I didn't know if there was something like this already out there, so I just spent 25 minutes writing my own. -- Aramgutang 06:46, 8 Aug 2004 (UTC)
I have placed a set of Greek alphabet unicodes at the foot of my User page for anyone who works on Greek-related articles and shares my inability to memorise them. Adam 03:12, 23 Apr 2004 (UTC)
I started a few weeks ago changing various Greek language entries (e.g. in the top line of Jesus, I put Greek Ἰησοῦς Χριστός Iēsoûs Khristós) to display the proper accent marks. This displays fine in Mozilla. But when I try to display the same pages in Microsoft Internet Explorer all I get is little squares not Greek letters.
Is there an official Wikipedia policy on which Unicode characters we should and should not use? m.e. 10:58, 24 Jun 2004 (UTC)/ m.e. 08:12, 9 Jul 2004 (UTC)
I suppose someone should jump in and write a policy that says which characters one should and should not use? Where would it go? Who should write it? Would it go through some sort of acceptance test before it reaches 'production'? I'd think it would be a bit contextual; in some (more specialised) contexts you might go for the 'real' characters, and accept that they might not display for evveryone.
Also, could we solve this by using the TeX option? Can we use the TeX display mode, normally used for mathematics, for displaying non-Latin characters?... TeX mode doesn't seem to work for this, as it throws you straight into math mode, and it seems only to recognise a limited subset of TeX commands; is this true/ m.e. 09:22, 29 Jun 2004 (UTC)
I think the policy should be use any Unicode characters you think right for the article. Writing excellent encyclopedia articles is more important than worrying too much about browser and operating system capabilities. Browsers and operating systems will catch up (some are pretty good already). To cater for people who can't see some characters, the right thing to do is to present the same information in several forms. For example many articles give pronunciation indications in both IPA and ASCII-IPA. Gdr 19:12, 2004 Jul 3 (UTC) — that's a point, I suppose we should work on the principle that Wikipedia will still be around in 10n years and we should write for then as well as for now. m.e. 09:53, 5 Jul 2004 (UTC)
I posted this on the article page, though I'd post it here as well, so that more people know about it.
While the whole Unicode debate is going on, you might find a little tool I wrote useful. Just go to my user page for the source and a link to a "runnable" version. All it does is convert all non-ASCII Unicode characters you type in it into the � format. I didn't know if there was something like this already out there, so I just spent 25 minutes writing my own. -- Aramgutang 06:50, 8 Aug 2004 (UTC)
The Resin identification code Unicode symbols don't work (on firefox anyway). Is there someone here who knows how to fix them?
Duk 16:00, 10 Oct 2004 (UTC)
This Wikipedia page has been superseded by Help talk:Multilingual support and is retained primarily for historical reference. |
Discussion page about using Unicode in Wikipedia.
(Note that this page and the discussion page are covering much the same topic, so you might want to read both. m.e. 11:02, 14 Jul 2004 (UTC))
From the Village Pump
This may be the wrong place to ask this, or it may be answered elsewhere, but can anyone tell me if and when the English Wiki will be changed over to UTF-8? I ask becuase it's hugely inconvenient to work with text that's full of ś's, but for some topics (Sanskrit and associated languages and subjects, in my case), there is no adequate alternative to using unicode characters. This is true even if I eschew Devanagari and work in roman, because standardized roman transliteration requires characters with diacritics that aren't available in latin-1. कुक्कुरोवाच 20:51, 31 Mar 2004 (UTC)
Switching the entire project over to UTF-8 or leaving things in ISO-8859-1 are not the only two choices. It would be straightforward to add a user option for "Edit in UTF-8". When a logged-in user with this option set requests to edit a page, the server translates HTML character references to their UTF-8. When the users submits their edit, the server translates non-ASCII (or non-ISO-8859-1) characters back to the HTML character references for storage in the database. Users who don't set this option would see no differences. See my Editing in UTF-8 feature request. — Gdr 12:33 2004-04-01.
Hi! I am a user from the french wikipédia. I know that some of you were interested by the conversion to utf-8. As you perhaps want to test on your personal wiki before considering the switch, here is the software to convert the MySQL dump : http://mboquien.free.fr/wikiconvert/ . It converts :
What it doesn't do :
This version is the rewritten version of the one we used (which was really dirty) to convert the french wiki. I rewrote it this afternoon and i tested it on an old cur dump of the french wiki, everything seems to work as expected. For the details, it depends on Qt (no troll on the toolkit used please) and i ran it on Mandrake 10.0. I was reported that it also compiles out of the box on Slackware. If you use another distribution, you may perhaps need to tweak the Makefile to have the correct path for Qt (you should set QTDIR correctly before trying to compile). No need to say that you need the Qt development packages installed. Using it is quite easy. The Makefile produces a wikiconvert executable. To convert you just need to write : ./wikiconvert < dump > converteddump (if you don't use iso8859-1, there is one line to change in wikiconvert.cpp, as explained in the source). On my computer (an athlonXP 2000+ underclocked at 1,5 GHz), converting a 90 Mb dump of cur lasts about 100 seconds. You should ask for a non compressed dump of cur for your test since converting compressed dumps available at http://download.wikipedia.org/ are not suitable for conversion since, once converted, MySQL can't load the dump completely (a problem of lines too long apparently, last time i tried).
I'd be very happy to get some feedback, and i would gladly accept patches to make the program faster/better. :) If you have any question, you can reach me on #fr.wikipedia on Freenode or on my discussion page (french or english only please). Med 09:41, 4 Apr 2004 (UTC)
I think the ironic thing is that Wikipedia is already using Unicode. Tagging the pages as ISO-8859-1 and forcing users to use HTML entities just takes up more bandwidth and makes the editing slower.
-浪人
update: By now the spanish and the german wikipedia have been converted successfully to utf8. Only dutch, danish, swedish and english still use 8859-1.
While the whole Unicode debate is going on, you might find a little tool I wrote useful. Just go to my user page for the source and a link to a "runnable" version. All it does is convert all non-ASCII Unicode characters you type in it into the � format. I didn't know if there was something like this already out there, so I just spent 25 minutes writing my own. -- Aramgutang 06:46, 8 Aug 2004 (UTC)
I have placed a set of Greek alphabet unicodes at the foot of my User page for anyone who works on Greek-related articles and shares my inability to memorise them. Adam 03:12, 23 Apr 2004 (UTC)
I started a few weeks ago changing various Greek language entries (e.g. in the top line of Jesus, I put Greek Ἰησοῦς Χριστός Iēsoûs Khristós) to display the proper accent marks. This displays fine in Mozilla. But when I try to display the same pages in Microsoft Internet Explorer all I get is little squares not Greek letters.
Is there an official Wikipedia policy on which Unicode characters we should and should not use? m.e. 10:58, 24 Jun 2004 (UTC)/ m.e. 08:12, 9 Jul 2004 (UTC)
I suppose someone should jump in and write a policy that says which characters one should and should not use? Where would it go? Who should write it? Would it go through some sort of acceptance test before it reaches 'production'? I'd think it would be a bit contextual; in some (more specialised) contexts you might go for the 'real' characters, and accept that they might not display for evveryone.
Also, could we solve this by using the TeX option? Can we use the TeX display mode, normally used for mathematics, for displaying non-Latin characters?... TeX mode doesn't seem to work for this, as it throws you straight into math mode, and it seems only to recognise a limited subset of TeX commands; is this true/ m.e. 09:22, 29 Jun 2004 (UTC)
I think the policy should be use any Unicode characters you think right for the article. Writing excellent encyclopedia articles is more important than worrying too much about browser and operating system capabilities. Browsers and operating systems will catch up (some are pretty good already). To cater for people who can't see some characters, the right thing to do is to present the same information in several forms. For example many articles give pronunciation indications in both IPA and ASCII-IPA. Gdr 19:12, 2004 Jul 3 (UTC) — that's a point, I suppose we should work on the principle that Wikipedia will still be around in 10n years and we should write for then as well as for now. m.e. 09:53, 5 Jul 2004 (UTC)
I posted this on the article page, though I'd post it here as well, so that more people know about it.
While the whole Unicode debate is going on, you might find a little tool I wrote useful. Just go to my user page for the source and a link to a "runnable" version. All it does is convert all non-ASCII Unicode characters you type in it into the � format. I didn't know if there was something like this already out there, so I just spent 25 minutes writing my own. -- Aramgutang 06:50, 8 Aug 2004 (UTC)
The Resin identification code Unicode symbols don't work (on firefox anyway). Is there someone here who knows how to fix them?
Duk 16:00, 10 Oct 2004 (UTC)