[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [idn] I-D ACTION:draft-ietf-idn-uri-00.txt
- To: idn working group <idn@ops.ietf.org>
- Subject: Re: [idn] I-D ACTION:draft-ietf-idn-uri-00.txt
- From: "Adam M. Costello" <amc@cs.berkeley.edu>
- Date: Wed, 10 Jan 2001 00:27:14 +0000
- Delivery-date: Tue, 09 Jan 2001 16:30:16 -0800
- Envelope-to: idn-data@psg.com
- User-Agent: Mutt/1.3.12i
> Title : Internationalized Domain Names in URIs and IRIs
> Author(s) : M. Duerst
> Filename : draft-ietf-idn-uri-00.txt
This is interesting. I'll give some comments after a very quick summary
for those who haven't read it:
URIs will continue to contain only ASCII characters, while IRIs will
allow Unicode characters. IRIs can be converted to equivalent URIs by
converting each non-ASCII character to UTF-8 and then %hh encoding it.
If domain names are eventually allowed to include non-ASCII characters,
they will be represented in URIs using the same method.
There is a potential problem with host name comparisons. When URIs
are compared, the comparison between the host parts is supposed to
be case-insensitive. Software comparing IRIs presumably knows about
case-equivalence among non-ASCII characters, but software comparing URIs
might not. That means IRIs might use mixed-case host names, but when
they are converted to URIs the non-ASCII characters must be forced to
lower case (in the host part only, not the path).
The draft tells how URIs can be converted back to IRIs. It would be
nice if the case of the Unicode characters in the host names could
be recovered. Fortunately, the UTF-8-%hh encoding of any non-ASCII
character always begins with C, D, E, or F. Therefore I suggest
recording the original case of the non-ASCII character as the case of
the first letter of the %hh encoding.
By the way, if an ACE is adopted for IDNs, there are two very different
ways it could be viewed, which has implications for this URI proposal.
Perhaps domain names really contain non-ASCII characters, and the ACE
is merely a representation that's friendlier to existing software and
protocols. In that case, this URI proposal applies. On the other
hand, perhaps domain names do not really contain non-ASCII characters;
the *ACE* is the real name, and the Unicode is merely a representation
that's friendlier to humans. In that case, this URI proposal would be
moot, because domain names would never contain non-ASCII characters.
I'm not advocating either viewpoint (yet), but it will have to be
decided.
AMC