[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [idn] An experiment with UTF-8 domain names

To: Keith Moore <moore@cs.utk.edu>
Subject: Re: [idn] An experiment with UTF-8 domain names
From: "Martin J. Duerst" <duerst@w3.org>
Date: Sun, 07 Jan 2001 00:56:10 +0900
Cc: Keith Moore <moore@cs.utk.edu>, "D. J. Bernstein" <djb@cr.yp.to>, idn@ops.ietf.org
Delivery-date: Sat, 06 Jan 2001 07:58:41 -0800
Envelope-to: idn-data@psg.com

At 01/01/06 09:14 -0500, Keith Moore wrote:

>no.  you're correct in saying that - at least in email - all of the
>necessary standards are in place.  whether the deployed implementations
>support those standards is a completely different question.

They support them to the degree that native users can
send around text in their own language. That support
is very solid, otherwise email wouldn't exist in some
places at all. What is missing is Japanese mailers
being able to do Arabic,..., but that's secondary.

>the primary reason for using ACE is within protocols  - i.e.
>in contexts that most humans do not see.  but we of course
>realize that ACE-encoded IDNs will leak into contexts that
>humans do see, just as native IDNs will leak into protocols
>that cannot deal with UTF8.  and those IDNs will get passed
>to other protocols.

It's not necessarily needed that protocols deal with UTF-8.
Mail and Web pages are typical cases. If I send a mail in
iso-2022-jp or put up a web page in iso-8859-1, the IDNs
will be in these encodings, not in UTF-8. Everything else,
including ACE, would be wrong. And that already works
(Again with the problem that you can't put an ARABIC
domain name into an iso-2022-jp mail. But you can put
it in a web page using the &#dddd; syntax.)

> > >a strategy for minimal disruption is:
> > >
> > >- affect as few components as possible  (since the effort required to
> > >   deal with breakage is on a per-component basis rather than a
> > >   per-line-of-code basis)
> >
> > If it's by component, I doubt that ACE is better than UTF-8.
> > ACE needs a lot of special considerations, scripts that have
> > to be fixed, and so on, in order to work.
>
>we need to look at this more carefully then, and choose an ACE
>which is unlikely to break things that currently deal
>successfully with domain names.  this is of paramount concern.

It's not really a question of which ACE to pick. Scripts
are in many ways at the boundary of the protocols and the
user. That means that a lot of conversion between ACE and
something readable will have to go on. For ACE, all of
this will have to be hand-knitted, because you will
continuously have to parse the data to separate domain
names from other stuff. In many ways, we will get into
a very NAT-like situation. NAT can be dealt with by
knowing exactly what's going on on the wire, for each
and every protocol, in particular to know all the places
where internet addresses turn up. But it has turned out
that this isn't feasible. ACE depends on knowing exactly
where domain names turn up. The only difference is that
the problem is not on the wire, it's at each end, but
it's nevertheless a mess.

For UTF-8, things just go through. If you want to use
another encoding, you don't have to know where the
domain names are, you can convert everything, including
comments,...

> > >I think it's the other way around.  people will not give up their
> > >favorite tools en masse in favor of unfamiliar tools that support UTF-8,
> >
> > They don't have to. On unix, the general thing you will have to
> > do is to create aliases or wrapper shell scripts for your editor
> > to either set the locale to work with UTF-8 or to convert the
> > file to the encoding your editor can handle and back.
>
>I suspect that most UNIX users are unwilling to do even this much -
>they'd have to remember to run a different command when editing
>a UTF-8 file than when editing a normal file.

The alternative is to have the files in the encoding of their
preference, and insert an additional step in situations such
as when loading a zone file.

The important point is that users and administrators will
with close to absolute certainty prefer to edit text they
can read and understand rather than ACEed stuff. And for
this, UTF-8 is much less painful than ACE.

>though perhaps emacs
>would acquire a UTF-8 mode that would automagically detect UTF-8 -
>that's quite doable.

That has already been done. See e.g.
http://packages.debian.org/stable/editors/mule-ucs.html

> > Do you expect editors that can handle ACE transparently to become 
> prominent?
> > Do you think it is sufficient to handle ACE as ASCII?
>
>Depends on what you mean by 'sufficient'.  But a lot of people, if faced
>with the choice between using a new tool (or a new script) and handling
>IDNs in ACE format, would choose the latter.

Not in Japan, not in any other place of the world where people
actually can read and understand these characters. Or would
you prefer to edit configuration files in nonsense sequences
of Greek characters rather than define and use an alias or two?
And if you think you would prefer it that way, do you think
that that's what people would prefer in general? Very hard
for me to believe!

>I agree that UTF-8 is quite well designed, but it does break the common
>assumption that all characters are a single octet.

Which is anyway wrong in large parts of the world.

>anything that uses
>'.' in a regular expression to match a single character will fail given
>UTF-8 as input,

Yes. But how many such regular expressions do you expect
for handling domain names? Can you give a reasonable
example?

>as will anything that tries to parse input by columns.

Good point. But not exactly. The problems will appear
if applications that assume one octet=one column position
and applications that assume one character=one column position
get combined.

Regards,   Martin.

Prev by Date: Re: [idn] An experiment with UTF-8 domain names
Next by Date: Re: [idn] What's wrong with skwan-utf8?
Prev by thread: Re: [idn] An experiment with UTF-8 domain names
Next by thread: Re: [idn] An experiment with UTF-8 domain names
Index(es):
- Date
- Thread