[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [idn] UTF-8 / RACE



This message contains replies to D. J. Bernstein, C C Magnus Gustavsson,
and Eric A. Hall.

"D. J. Bernstein" <djb@cr.yp.to> wrote:

> > telnet `ace NATIVECHARACTERS`
> 
> Even if that were a complete solution

It's not.  It's a workaround that people could use while they're waiting
for stuff to get upgraded.

> what do you do about telnet's open command?

Nothing.  I'd quit telnet and start it again as above.

> or its status display?

Ignore it.  This is just a workaround.  The point is that with ACE
workarounds are possible.  Without ACE, when things choke on UTF-8
names, you're just out of luck until stuff gets upgraded, and some of
that stuff might not be under your control.

> it would have no advantages over simply fixing gethostbyname().

It would have a few minor advantages:  Installing a standalone program
is easier than upgrading libc.  The upgraded gethostbyname() will work
only if my local DNS server doesn't choke on UTF-8 names, whereas the
ace program doesn't depend on anything outside my system.

But I'll admit, for telnet, the advantages are minor.  Let's try mail.
Upgrading gethostbyname() is not going to allow "mail user1@IDN.org
user2@foo.bar" to work, because the mail program still has to get
IDN.org into the header somehow.  Putting UTF-8 directly into the header
violates RFC 822 and RFC 821, and even if it somehow gets to user1
intact, it might not get to user2 intact, and even if it does, user2's
mail program might choke on it.  Without ACE, it's simply not going
to work until I upgrade my mail program, and user2 upgrades his mail
program, and all the intervening SMTP servers get upgraded.  But with
ACE, I have a workaround that doesn't depend on anyone else upgrading
anything: mail user1@`ace IDN.org` user2@foo.bar

> Except, of course, for all the programs that _already_ work with
> UTF-8.

Yes, the advantage of UTF-8 is that some programs already use it
natively, and they might be able to use IDNs by accident.  But there are
also many programs that don't use UTF-8.  I use a terminal emulator and
a text editor and a web browser all written to support Japanese (kterm,
nvi-m17n, w3m), and none of them supports UTF-8.  They could all benefit
immediately from the sort of ACE workarounds described above.

Remember, I'm not arguing against UTF-8; I'm merely arguing in favor of
ACE.  I'm certainly willing to consider an architecture where both are
used.

C C Magnus Gustavsson <mag@lysator.liu.se> wrote:

> The problem of characters which can't be displayed should be solved
> in exactly the same way it is today. Which is: Not standardized. Some
> systems or programs can use a default character. Others, like Emacs,
> could use a backslash ('\') and a coding point number. And so on.

Okay, I'll back off a little on my claim that the ACE should be
displayed when the name contains unsupported characters.  Ideally it
should be a user-configurable option.  I personally would choose the
ACE, so that I could cut and paste it.

> Since the coding point is stored internally even though it can't be
> displayed, there should be no problems with copying and pasting it.

I agree that there "should" be no problems, but there will be.  In my
experience, copying text with the mouse usually gives you exactly what
is displayed; very rarely does the application intercept the operation,
because it's usually being handled entirely by the GUI library.

> I think it's obvious that [ACE] should eventually be phased out.

It's not obvious to me.  Someone (Keith?) mentioned archived mail.
Should those addresses arbitrarily stop working someday?

> Since you're not talking about the Roman alphabet here, I suggest you
> use the term "English alphabet" instead.

<offtopic> I rarely hear the phrase "English alphabet".  When I do, I
think to myself, "There is no English alphabet.  English uses the Roman
alphabet."  The phrase "Roman alphabet" is the more common name for the
letters A-Z.  "English alphabet" sounds arrogant to me, since England
merely inherited the alphabet from the Roman empire.  The most formally
correct phrase might be "modern Roman alphabet". </offtopic>

"Eric A. Hall" <ehall@ehsco.com> wrote:

> Domain names show up all over the place. lpq listings, netstat,
> etc. Even in a GUI environment where font management was simpler, not
> much of this stuff would convert ACE for output purposes. Opinion.

Agreed.  If netstat gets back UTF-8 from gethostbyaddr() and blindly
outputs it, it will look nice if it happens to be running in a UTF-8
environment, and will produce useless garbage otherwise.  If it gets
back ACE, it will produce garbage in all environments, but potentially
useful garbage (because it can be copied into other applications).
There will be problems of one sort or another until netstat is upgraded
to call a function that returns the domain name in the local charset.

AMC