[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [idn] An experiment with UTF-8 domain names
- To: idn@ops.ietf.org
- Subject: Re: [idn] An experiment with UTF-8 domain names
- From: "Adam M. Costello" <amc@cs.berkeley.edu>
- Date: Sun, 7 Jan 2001 10:23:30 +0000
- Delivery-date: Sun, 07 Jan 2001 02:25:38 -0800
- Envelope-to: idn-data@psg.com
- User-Agent: Mutt/1.3.12i
"Martin J. Duerst" <duerst@w3.org> wrote:
> If I send a mail in iso-2022-jp or put up a web page in iso-8859-1,
> the IDNs will be in these encodings, not in UTF-8. Everything else,
> including ACE, would be wrong.
That's an interesting problem. People exchanging Japanese email do
indeed use iso-2022-jp, not utf-8. If one of them wants to include a
Japanese domain name in the body of a message (as part of a URL, for
example), it would be a shame to use ACE, because it will be visible to
both sender and recipient, unless the MUAs do some kludgy conversions on
message bodies. On the other hand, suppose the URL in question refers
to an image, and the sender wants his non-Japanese friend to see the
image. If we don't use ACE, then there may not exist a charset that the
sender and recipient are both able to use that's capable of representing
the domain name.
This suggests that users should have some way of representing any domain
name using ASCII (as a last resort), regardless of what goes on the
wire.
This scenario raises a big question about URIs. According to RFC
2396, a URI is a sequence of characters (not bytes) from a very small
repertoire (smaller than printable ASCII); this sequence of characters
can be used to represent a sequence of bytes (using ASCII and %hh),
which in turn could be used to represent characters from a larger
repertoire, but that last mapping is not standardized. In any case,
when a URI is presented to the user, the characters making up the URI
itself are supposed to be shown, not the characters represented by the
octets represented by the URI.
If we want URIs to appear to contain non-ASCII characters, we need to
alter the URI model somehow. Here are three possibilities (there may be
others):
1. Use ACE, and add a rule saying that user agents should convert
to/from ACE when interacting with the user. This has the advantage of
changing nothing at the lower layers, but doesn't immediately answer
the next obvious question: how to allow non-ASCII characters to appear
in the rest of the URI. Perhaps ACE should be generalized to be usable
in other parts of URIs in addition to the domain names. This approach
exposes the ACE to people editing files containing URIs.
2. Change the rule about how URIs are presented to the user, so that %hh
sequences (except those for excluded ASCII characters) are unescaped and
interpreted as UTF-8. When URIs are obtained from the user, they must
be converted to UTF-8 and escaped before being dereferenced. This has
the same advantages and disadvantages as the previous approach.
3. Allow a much larger set of characters in URIs. This approach suffers
from the same problem as in the non-Japanese friend scenario above.
It would also require changing any protocol that embeds URIs without
specifying a sufficiently expressive encoding, most notably HTTP. HTTP
does not say what encoding is used for URIs; it is implicitly ASCII
or an extension of ASCII. HTTP assumes ISO-8859-1 for comments and
unstructured text in message headers, so it would not go without saying
that 8-bit URIs are in UTF-8; that would have to be added to the spec,
and who knows how existing implementations would react to 8-bit URIs.
Allowing URIs to appear to contain non-ASCII characters (using any
approach) is counter to the URI design goal of transcribability (the
business card scenario), but if we value transcribability more than i18n
then we should forget about IDN altogether.
> ACE is not a 7-bit encoding of the UCS. It's a two-step encoding,
> first from the UCS to the legacy host name repertoire, and then from
> there to 7-bit octets (using US-ASCII).
Thank you! I had missed that subtlety, but you're right, host names
are not required to be represented in ASCII; they are just sequences
of characters. An EBCDIC system would apply an EBCDIC-to-ASCII
transformation to any outgoing host names; if ACE mapped Unicode to
bytes rather than characters, the ACE host names leaving the EBCDIC
system would get scrambled.
John C Klensin <klensin@jck.com> wrote:
[business-card scenario]
Ouch. IDN is so difficult, maybe it's hopeless.
The unaccented Latin alphabet and Arabic numerals (and a few punctuation
marks) make an excellent set of characters for universal names, because
it's a small set that is recognized by more humans and more machines
than any other. That's surely a big part of why this set was chosen
for domain names and URIs, which are intended to be usable by anyone,
anywhere.
A giant character set is by nature ill-suited to the task of universal
naming. Perhaps we have no choice but to view IDNs as deliberately
non-interoperable things, optimized for particular communities.
As for the business cards, maybe people will just register two names,
one native and one romanized (perhaps translated), with one being a
CNAME for the other. I can't immediately think of anything easier for
the people trying to use the names on the card.
AMC