This article is rated C-class on Wikipedia's
content assessment scale. It is of interest to the following WikiProjects: | ||||||||||||||||||||||||
|
This article may be too technical for most readers to understand.(September 2010) |
This article links to one or more target anchors that no longer exist.
Please help fix the broken anchors. You can remove this template after fixing the problems. |
Reporting errors |
First, the linked "Punycode Exploit" should link to the advisory text, not the simple demonstration page. Secondly, this is not an exploit of punycode so much as it is an exploit of the fact that not running domain names through nameprep is asking for spoofing problems.
I would change the paragraph from:
Punycode is easily exploitable, and for an example see Punycode exploit
to:
Note that browsers which fail to run a string through nameprep before using it as a DNS name are vulnerable to spoofing exploits, as in Punycode spoof.
I suspect is should read more like:
The fact that Unicode and Punycode strings result in homographs (strings which are different but visually indistinguishable or almost so) is a matter of some concern. Phishing (a class of social engineering exploits) will make use of homographs like "www.paypa1.com" (note: this containst the digit, 1 ("one") rather than the letter, l ("ell").
I'd like to add some links to the discussion of "petnames" (which I've read from the cap-talk mailing list that's maintained by Jonathan Shapiro, creator of EROS).
Unforunately I, User:JimD, lack the time at the moment to do this properly. So, I'm tacking this in here in the hopes that it will help me remember or that someone here will take the ball and run with it. JimD 22:43, 2005 Feb 22 (UTC)
With the last two words an external link to the exploit.
The explanation for how Punycode is supposed to work is unintelligible to me, and I don't consider myself particularly dull about technical matters. The main problem is that the author doesn't seem to have a grasp of where to begin and ends up explaining the thing in bits and pieces. Can someone replace this with something more coherent? 82.92.119.11 8 July 2005 17:17 (UTC)
Is Punycode patented? -- 84.61.51.126 13:01, 6 July 2006 (UTC)
How about showing some examples?
From the text its not really clear about how exactly you get back to the name and even if its possible. Or how the position of the character is encoded. Since currently it looks to me as if "bücher" "bächer" "bchüer" etc. all get the same bcher-kva code??
Shouldn't "bücüher" be punycoded as "bcher-kvaba", and "bücherü" as "bcher-kvaea"? Moreover, should "ýbücher" be punycoded as "bcher-kvafa" or as "bcher-fakva"? (are both possible?) —Preceding unsigned comment added by 193.145.147.38 ( talk) 18:21, 27 February 2008 (UTC)
The heading "Encoding procedure" says:
"The 'bücher' example given above . . ."
There is no "bücher" example "above" in the current form of the article.
This is probably an editorial anomaly in the wake of rearrangement of the article.
Le crayon rouge ne dort jamais.
Doug Kerr 17:56, 11 March 2007 (UTC)
I did not understand this article but had no problem to understand Punycode from reading the RFC. I will reorder and enhance this article by. Starting with a quick description of the reasons why this particular encoding scheme is used and following the explanation order in the RFC, which start out with the decoding, which is far easier to understand then the encoding. Roeschter 19:25, 25 March 2007 (UTC)
' Bootstring' redirects here, even though it doesn't strike me as synonymous to 'Punycode' (Punycode, going by the title of ref1, Punycode: A Bootstring encoding of Unicode for Internationalized Domain Names in Applications (IDNA), seems to be a kind of Bootstring).
Could someone with more knowledge on the matter either create a Bootstring article describing what that is or make it clear that Bootstring indeed is merely a synonym? I'm curious which/what it is.
- pinkgothic ( talk) 12:42, 12 November 2010 (UTC)
Section Separation of ASCII characters says "Since it is a basic character, the ASCII hyphen may still appear in the string before this additional character, but the addition does not cause ambiguity". But precisely because the hyphen may still appear, we have a problem: Suppose you wanted to convert the string "bcher-kva" to a label - wouldn't that be just the same, thus rendering the label "bcher-kva" ambiguous? — Sebastian 07:19, 20 December 2011 (UTC)
The current link to the Punycode RFC didn't work; it was redirected to an IETF home page. < https://www.rfc-editor.org/rfc/rfc3492.txt> is the new link to the Punycode RFC. — Preceding unsigned comment added by CTMacUser ( talk • contribs) 06:37, 14 March 2019 (UTC)
Anyone know 179.98.225.10 ( talk) 14:36, 16 March 2019 (UTC)
Because it is a PUN on "UNIcode" (Unicode) Uni rhymes with puny Firejuggler86 ( talk) 21:31, 4 April 2020 (UTC)
What will happen if the number of states to be skipped will get larger than 1.03e40=35*34*...*2? Unlikely given the application, but there is no answer given inside the article, as far as I've understood. Am I correct? — Preceding unsigned comment added by 212.82.199.28 ( talk) 15:40, 9 December 2019 (UTC)
I must admit that I understand very little of this article, especially because there are almost no examples. What I especially miss is how to encode pure basic ASCII strings in Punycode. It only seems to focus on strings that include at least one non-ASCII character, probably because Punycode was created specifically for handling this subset of Unicode strings. But I would assume that Punycode is also able to encode pure basic ASCII strings? At least, that's what the first sentence in the article claims: "Punycode is a representation of Unicode", which means it can handle all strings including basic ASCII strings which is a subset of all Unicode strings. If that is the case, how does it do that? For instance, what is the Punycode encoding of "London"? My guess is that it would be "London", but I can't find that information in the article; the section "Separation of ASCII characters" seems to say that "London" would be encoded as "London-" ("If any characters were copied, an ASCII hyphen is added to the output"), but I find that quite odd. Also, what is the encoding of the empty string? Is it the empty string? Simple examples are needed, preferably a table that especially lists both very common cases like pure basic ASCII strings ("a", "abc", "London", "A string with spaces" if a space is a basic ASCII character) as well as corner cases like the empty string (""), the string "3", the string "-", and a string consisting of a single non-basic ASCII letter such as "ü".
As a first step, though, I would simply add the example of "London" next to the mentioned example of München. But I still don't know the Punycode encoding of "London". I can't even use an online Punycode converter to figure it out, as those I have tried all seem to get it wrong, as they all add "xn--" in the beginning of encodings of non-ASCII strings, which does not seem to be part of Punycode (they seem to mistakenly mix in Nameprep, which is not part of Punycode, in the process).
-- Jhertel ( talk) 16:11, 15 February 2020 (UTC)
As a follow-up, the table below shows what I would like too see the encodings of, as that shows corner cases and gives a quick overview. I have filled out the only one I know the answer to (from the article); the rest are marked ???
:
Input | Punycode of input | Description of input |
---|---|---|
|
??? |
The empty string. |
a |
??? |
Only basic ASCII characters, one, lowercase. |
A |
??? |
Only basic ASCII characters, one, uppercase. |
3 |
??? |
Only basic ASCII characters, one, a digit. |
- |
??? |
Only basic ASCII characters, one, a hyphen. |
-- |
??? |
Only basic ASCII characters, two hyphens. |
abc |
??? |
Only basic ASCII characters, more than one, all lowercase. |
London |
??? |
Only basic ASCII characters, more than one, one uppercase. |
Lloyd-Atkinson |
??? |
Only basic ASCII characters, one hyphen. |
This has spaces |
??? |
A string with spaces. |
ü |
??? |
No basic ASCII characters, one character. |
αβγ |
??? |
No basic ASCII characters, more than one character. |
München |
Mnchen-3ya |
Mixed string, with one character that is not a basic ASCII character. |
Mnchen-3ya |
??? |
Only basic ASCII characters, equal to the Punycode of "München" (effectively encoding "München" twice). |
München-Ost |
??? |
Mixed string, with one character that is not basic ASCII, and a hyphen. |
Bahnhof München-Ost |
??? |
Mixed string, with one space, one hyphen, and one character that is not basic ASCII. |
I just got the idea to use the
Python library codecs
with the encoding "punycode", which should be very trustworthy, so I will return soon with the actual punycodes for the table.
-- Jhertel ( talk) 18:07, 15 February 2020 (UTC)
Okay, so here is the table:
Input | Punycode of input | Description of input |
---|---|---|
|
|
The empty string. |
a |
a- |
Only basic ASCII characters, one, lowercase. |
A |
A- |
Only basic ASCII characters, one, uppercase. |
3 |
3- |
Only basic ASCII characters, one, a digit. |
- |
-- |
Only basic ASCII characters, one, a hyphen. |
-- |
--- |
Only basic ASCII characters, two hyphens. |
abc |
abc- |
Only basic ASCII characters, more than one, all lowercase. |
London |
London- |
Only basic ASCII characters, more than one, one uppercase. |
Lloyd-Atkinson |
Lloyd-Atkinson- |
Only basic ASCII characters, one hyphen. |
This has spaces |
This has spaces- |
A string with spaces. |
ü |
tda |
No basic ASCII characters, one character. |
αβγ |
mxacd |
No basic ASCII characters, more than one character. |
München |
Mnchen-3ya |
Mixed string, with one character that is not a basic ASCII character. |
Mnchen-3ya |
Mnchen-3ya- |
Only basic ASCII characters, equal to the Punycode of "München" (effectively encoding "München" twice). |
München-Ost |
Mnchen-Ost-9db |
Mixed string, with one character that is not basic ASCII, and a hyphen. |
Bahnhof München-Ost |
Bahnhof Mnchen-Ost-u6b |
Mixed string, with one space, one hyphen, and one character that is not basic ASCII. |
It was created by this Python (3.8) script:
examples =
("", "The empty string."),
("a", "Only basic ASCII characters, one, lowercase."),
("A", "Only basic ASCII characters, one, uppercase."),
("3", "Only basic ASCII characters, one, a digit."),
("-", "Only basic ASCII characters, one, a hyphen."),
("--", "Only basic ASCII characters, two hyphens."),
("abc", "Only basic ASCII characters, more than one, all lowercase."),
("London", "Only basic ASCII characters, more than one, one uppercase."),
("Lloyd-Atkinson", "Only basic ASCII characters, one hyphen."),
("This has spaces", "A string with spaces."),
("ü", "No basic ASCII characters, one character."),
("αβγ", "No basic ASCII characters, more than one character."),
("München", "Mixed string, with one character that is not a basic ASCII character."),
("Mnchen-3ya", 'Only basic ASCII characters, equal to the Punycode of "München" (effectively encoding "München" twice).'),
("München-Ost", "Mixed string, with one character that is not basic ASCII, and a hyphen."),
("Bahnhof München-Ost", "Mixed string, with one space, one hyphen, and one character that is not basic ASCII.")]
def punycode(s):
return s.encode("punycode").decode("ascii")
def nowrap(s):
return "{{nowrap|" + s + "}}"
def code(s):
return "<code>{}</code>".format(s)
def print_row(s, description):
print("| {} || {} || {}".format(
nowrap(code(s)),
nowrap(code(punycode(s))),
description))
print("|-")
def print_table(examples):
print('{| class="wikitable"')
print('|-')
print('! Input !! Punycode of input !! Description of input')
print('|-')
for (s, description) in examples:
print_row(s, description)
print('|}')
print_table(examples)
-- Jhertel ( talk) 19:04, 15 February 2020 (UTC)
So, technically, this means that https://xn--xn--wikipedia--.org redirects to https://xn--wikipedia-.org and then to https://wikipedia.org because the conversion method still works even if the original domain is LDH? 154.5.234.189 ( talk) 04:55, 17 November 2020 (UTC)
There is no "redirect" at all. The browser will try to use the name exactly as you entered it. An un-cautious browser might display "xn--xn--wikipedia--.org" as "xn--wikipedia-.org" but would always use "xn--xn--wikipedia--.org" when communicating over the network because that's what you typed in. And real browsers won't decode your example even for display because it doesn't pass the smell test; it looks like you're trying to exploit something. Real browsers only decode a domain for display if that decode results in a string that utilizes exactly one non-ASCII writing system. Your example uses zero. Other phishing attempts use more than one. 108.246.204.20 ( talk) 03:01, 15 June 2021 (UTC)
Maybe the section `Encoding of non-ASCII character insertions as code numbers` should have the hard to follow text that describes the state changes removed, and instead have a proper state machine as a formal definition with a state graph inserted and/or just pseudo code? Explaining it in plain English was worth a try, but I can only surmise that sadly I feel like it didn't really succeed. At least for me it isn't really possible to decipher from the text how the encoding actually works. 2003:EA:7F11:DE00:BC43:6BFF:FE40:1BA5 ( talk) 12:38, 17 December 2020 (UTC)
This article is rated C-class on Wikipedia's
content assessment scale. It is of interest to the following WikiProjects: | ||||||||||||||||||||||||
|
This article may be too technical for most readers to understand.(September 2010) |
This article links to one or more target anchors that no longer exist.
Please help fix the broken anchors. You can remove this template after fixing the problems. |
Reporting errors |
First, the linked "Punycode Exploit" should link to the advisory text, not the simple demonstration page. Secondly, this is not an exploit of punycode so much as it is an exploit of the fact that not running domain names through nameprep is asking for spoofing problems.
I would change the paragraph from:
Punycode is easily exploitable, and for an example see Punycode exploit
to:
Note that browsers which fail to run a string through nameprep before using it as a DNS name are vulnerable to spoofing exploits, as in Punycode spoof.
I suspect is should read more like:
The fact that Unicode and Punycode strings result in homographs (strings which are different but visually indistinguishable or almost so) is a matter of some concern. Phishing (a class of social engineering exploits) will make use of homographs like "www.paypa1.com" (note: this containst the digit, 1 ("one") rather than the letter, l ("ell").
I'd like to add some links to the discussion of "petnames" (which I've read from the cap-talk mailing list that's maintained by Jonathan Shapiro, creator of EROS).
Unforunately I, User:JimD, lack the time at the moment to do this properly. So, I'm tacking this in here in the hopes that it will help me remember or that someone here will take the ball and run with it. JimD 22:43, 2005 Feb 22 (UTC)
With the last two words an external link to the exploit.
The explanation for how Punycode is supposed to work is unintelligible to me, and I don't consider myself particularly dull about technical matters. The main problem is that the author doesn't seem to have a grasp of where to begin and ends up explaining the thing in bits and pieces. Can someone replace this with something more coherent? 82.92.119.11 8 July 2005 17:17 (UTC)
Is Punycode patented? -- 84.61.51.126 13:01, 6 July 2006 (UTC)
How about showing some examples?
From the text its not really clear about how exactly you get back to the name and even if its possible. Or how the position of the character is encoded. Since currently it looks to me as if "bücher" "bächer" "bchüer" etc. all get the same bcher-kva code??
Shouldn't "bücüher" be punycoded as "bcher-kvaba", and "bücherü" as "bcher-kvaea"? Moreover, should "ýbücher" be punycoded as "bcher-kvafa" or as "bcher-fakva"? (are both possible?) —Preceding unsigned comment added by 193.145.147.38 ( talk) 18:21, 27 February 2008 (UTC)
The heading "Encoding procedure" says:
"The 'bücher' example given above . . ."
There is no "bücher" example "above" in the current form of the article.
This is probably an editorial anomaly in the wake of rearrangement of the article.
Le crayon rouge ne dort jamais.
Doug Kerr 17:56, 11 March 2007 (UTC)
I did not understand this article but had no problem to understand Punycode from reading the RFC. I will reorder and enhance this article by. Starting with a quick description of the reasons why this particular encoding scheme is used and following the explanation order in the RFC, which start out with the decoding, which is far easier to understand then the encoding. Roeschter 19:25, 25 March 2007 (UTC)
' Bootstring' redirects here, even though it doesn't strike me as synonymous to 'Punycode' (Punycode, going by the title of ref1, Punycode: A Bootstring encoding of Unicode for Internationalized Domain Names in Applications (IDNA), seems to be a kind of Bootstring).
Could someone with more knowledge on the matter either create a Bootstring article describing what that is or make it clear that Bootstring indeed is merely a synonym? I'm curious which/what it is.
- pinkgothic ( talk) 12:42, 12 November 2010 (UTC)
Section Separation of ASCII characters says "Since it is a basic character, the ASCII hyphen may still appear in the string before this additional character, but the addition does not cause ambiguity". But precisely because the hyphen may still appear, we have a problem: Suppose you wanted to convert the string "bcher-kva" to a label - wouldn't that be just the same, thus rendering the label "bcher-kva" ambiguous? — Sebastian 07:19, 20 December 2011 (UTC)
The current link to the Punycode RFC didn't work; it was redirected to an IETF home page. < https://www.rfc-editor.org/rfc/rfc3492.txt> is the new link to the Punycode RFC. — Preceding unsigned comment added by CTMacUser ( talk • contribs) 06:37, 14 March 2019 (UTC)
Anyone know 179.98.225.10 ( talk) 14:36, 16 March 2019 (UTC)
Because it is a PUN on "UNIcode" (Unicode) Uni rhymes with puny Firejuggler86 ( talk) 21:31, 4 April 2020 (UTC)
What will happen if the number of states to be skipped will get larger than 1.03e40=35*34*...*2? Unlikely given the application, but there is no answer given inside the article, as far as I've understood. Am I correct? — Preceding unsigned comment added by 212.82.199.28 ( talk) 15:40, 9 December 2019 (UTC)
I must admit that I understand very little of this article, especially because there are almost no examples. What I especially miss is how to encode pure basic ASCII strings in Punycode. It only seems to focus on strings that include at least one non-ASCII character, probably because Punycode was created specifically for handling this subset of Unicode strings. But I would assume that Punycode is also able to encode pure basic ASCII strings? At least, that's what the first sentence in the article claims: "Punycode is a representation of Unicode", which means it can handle all strings including basic ASCII strings which is a subset of all Unicode strings. If that is the case, how does it do that? For instance, what is the Punycode encoding of "London"? My guess is that it would be "London", but I can't find that information in the article; the section "Separation of ASCII characters" seems to say that "London" would be encoded as "London-" ("If any characters were copied, an ASCII hyphen is added to the output"), but I find that quite odd. Also, what is the encoding of the empty string? Is it the empty string? Simple examples are needed, preferably a table that especially lists both very common cases like pure basic ASCII strings ("a", "abc", "London", "A string with spaces" if a space is a basic ASCII character) as well as corner cases like the empty string (""), the string "3", the string "-", and a string consisting of a single non-basic ASCII letter such as "ü".
As a first step, though, I would simply add the example of "London" next to the mentioned example of München. But I still don't know the Punycode encoding of "London". I can't even use an online Punycode converter to figure it out, as those I have tried all seem to get it wrong, as they all add "xn--" in the beginning of encodings of non-ASCII strings, which does not seem to be part of Punycode (they seem to mistakenly mix in Nameprep, which is not part of Punycode, in the process).
-- Jhertel ( talk) 16:11, 15 February 2020 (UTC)
As a follow-up, the table below shows what I would like too see the encodings of, as that shows corner cases and gives a quick overview. I have filled out the only one I know the answer to (from the article); the rest are marked ???
:
Input | Punycode of input | Description of input |
---|---|---|
|
??? |
The empty string. |
a |
??? |
Only basic ASCII characters, one, lowercase. |
A |
??? |
Only basic ASCII characters, one, uppercase. |
3 |
??? |
Only basic ASCII characters, one, a digit. |
- |
??? |
Only basic ASCII characters, one, a hyphen. |
-- |
??? |
Only basic ASCII characters, two hyphens. |
abc |
??? |
Only basic ASCII characters, more than one, all lowercase. |
London |
??? |
Only basic ASCII characters, more than one, one uppercase. |
Lloyd-Atkinson |
??? |
Only basic ASCII characters, one hyphen. |
This has spaces |
??? |
A string with spaces. |
ü |
??? |
No basic ASCII characters, one character. |
αβγ |
??? |
No basic ASCII characters, more than one character. |
München |
Mnchen-3ya |
Mixed string, with one character that is not a basic ASCII character. |
Mnchen-3ya |
??? |
Only basic ASCII characters, equal to the Punycode of "München" (effectively encoding "München" twice). |
München-Ost |
??? |
Mixed string, with one character that is not basic ASCII, and a hyphen. |
Bahnhof München-Ost |
??? |
Mixed string, with one space, one hyphen, and one character that is not basic ASCII. |
I just got the idea to use the
Python library codecs
with the encoding "punycode", which should be very trustworthy, so I will return soon with the actual punycodes for the table.
-- Jhertel ( talk) 18:07, 15 February 2020 (UTC)
Okay, so here is the table:
Input | Punycode of input | Description of input |
---|---|---|
|
|
The empty string. |
a |
a- |
Only basic ASCII characters, one, lowercase. |
A |
A- |
Only basic ASCII characters, one, uppercase. |
3 |
3- |
Only basic ASCII characters, one, a digit. |
- |
-- |
Only basic ASCII characters, one, a hyphen. |
-- |
--- |
Only basic ASCII characters, two hyphens. |
abc |
abc- |
Only basic ASCII characters, more than one, all lowercase. |
London |
London- |
Only basic ASCII characters, more than one, one uppercase. |
Lloyd-Atkinson |
Lloyd-Atkinson- |
Only basic ASCII characters, one hyphen. |
This has spaces |
This has spaces- |
A string with spaces. |
ü |
tda |
No basic ASCII characters, one character. |
αβγ |
mxacd |
No basic ASCII characters, more than one character. |
München |
Mnchen-3ya |
Mixed string, with one character that is not a basic ASCII character. |
Mnchen-3ya |
Mnchen-3ya- |
Only basic ASCII characters, equal to the Punycode of "München" (effectively encoding "München" twice). |
München-Ost |
Mnchen-Ost-9db |
Mixed string, with one character that is not basic ASCII, and a hyphen. |
Bahnhof München-Ost |
Bahnhof Mnchen-Ost-u6b |
Mixed string, with one space, one hyphen, and one character that is not basic ASCII. |
It was created by this Python (3.8) script:
examples =
("", "The empty string."),
("a", "Only basic ASCII characters, one, lowercase."),
("A", "Only basic ASCII characters, one, uppercase."),
("3", "Only basic ASCII characters, one, a digit."),
("-", "Only basic ASCII characters, one, a hyphen."),
("--", "Only basic ASCII characters, two hyphens."),
("abc", "Only basic ASCII characters, more than one, all lowercase."),
("London", "Only basic ASCII characters, more than one, one uppercase."),
("Lloyd-Atkinson", "Only basic ASCII characters, one hyphen."),
("This has spaces", "A string with spaces."),
("ü", "No basic ASCII characters, one character."),
("αβγ", "No basic ASCII characters, more than one character."),
("München", "Mixed string, with one character that is not a basic ASCII character."),
("Mnchen-3ya", 'Only basic ASCII characters, equal to the Punycode of "München" (effectively encoding "München" twice).'),
("München-Ost", "Mixed string, with one character that is not basic ASCII, and a hyphen."),
("Bahnhof München-Ost", "Mixed string, with one space, one hyphen, and one character that is not basic ASCII.")]
def punycode(s):
return s.encode("punycode").decode("ascii")
def nowrap(s):
return "{{nowrap|" + s + "}}"
def code(s):
return "<code>{}</code>".format(s)
def print_row(s, description):
print("| {} || {} || {}".format(
nowrap(code(s)),
nowrap(code(punycode(s))),
description))
print("|-")
def print_table(examples):
print('{| class="wikitable"')
print('|-')
print('! Input !! Punycode of input !! Description of input')
print('|-')
for (s, description) in examples:
print_row(s, description)
print('|}')
print_table(examples)
-- Jhertel ( talk) 19:04, 15 February 2020 (UTC)
So, technically, this means that https://xn--xn--wikipedia--.org redirects to https://xn--wikipedia-.org and then to https://wikipedia.org because the conversion method still works even if the original domain is LDH? 154.5.234.189 ( talk) 04:55, 17 November 2020 (UTC)
There is no "redirect" at all. The browser will try to use the name exactly as you entered it. An un-cautious browser might display "xn--xn--wikipedia--.org" as "xn--wikipedia-.org" but would always use "xn--xn--wikipedia--.org" when communicating over the network because that's what you typed in. And real browsers won't decode your example even for display because it doesn't pass the smell test; it looks like you're trying to exploit something. Real browsers only decode a domain for display if that decode results in a string that utilizes exactly one non-ASCII writing system. Your example uses zero. Other phishing attempts use more than one. 108.246.204.20 ( talk) 03:01, 15 June 2021 (UTC)
Maybe the section `Encoding of non-ASCII character insertions as code numbers` should have the hard to follow text that describes the state changes removed, and instead have a proper state machine as a formal definition with a state graph inserted and/or just pseudo code? Explaining it in plain English was worth a try, but I can only surmise that sadly I feel like it didn't really succeed. At least for me it isn't really possible to decipher from the text how the encoding actually works. 2003:EA:7F11:DE00:BC43:6BFF:FE40:1BA5 ( talk) 12:38, 17 December 2020 (UTC)