[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [idn] Why we can go directly to UTF-8



> At 00:09 01/05/24 -0400, Keith Moore wrote:
> >It's really quite simple.
> >
> >If we use UTF-8 names, each component in a signal path that handles
> >an IDN has to be upgraded before the application will work with IDNs.
> 
> Each component *may* have to be upgraded. Some may not need an
> upgrade. For many, 8-bit transparency will do just fine.

the reason I say that each component has to be upgraded is that for
every component at least one of the following is likely to be true:

- it needs to be upgraded to keep from breaking the next component 
  by sending it utf-8 when it's not ready to deal with it

- it needs to be upgraded to do nameprep before making a query
  so that the existing DNS servers will work in the face of various
  representations of the IDN

- it needs to be upgraded to be able to correctly compare utf-8
  names in the protocol  with names in a configuration file

- it needs to be upgraded to treat UTF-8 as UTF-8 and not as some
  other locale-specific character set.

> >For email, this means every UA, MTA, message store, mail filter,
> >mailing list, etc that uses the addresses in the header or
> >envelope of a message.  For the web, this means every web browser,
> >proxy, cache, and origin server that makes use of domain names
> >in the request or response (header or payload).
> 
> 'every' is too general. It's just those in the relevant paths.

that's why I said "every component in the signal path".
of course, an IDN solution that works only for a few signal
paths isn't terribly useful.

> For the average web case, it's just the browser and the server.
> Proxies can be changed if necessary.

the browser and every server that the user wants to contact which
uses IDNs, and possibly the web proxies. and the dns servers and 
their proxies also.

> >For both cases,
> >it means that every DNS query library, resolver, cache, and server
> >involved in the lookups supports UTF-8 also (unless you believe
> >that the existing ones will already support UTF-8 without protocol
> >extensions, which is far from a given).  There's little incentive
> >to upgrade because so many other components need to be upgraded
> >before you can get reliable operation.
> 
> I very much doubt your last sentence. Companies that are interested
> in being found with an idn will obviously upgrade their DNS and
> web servers (if necessary). Users interested in using idns will
> upgrade their browsers (if necessary). 

or the user will upgrade his browser and not see any improvement,
because few companies support it, and because his service provider
or LAN has a proxy that prevents it from working.  the service 
might even be worse than before because the user will see IDNs
in print and other media and try to type them in, and they will fail
because one compoent or another hasn't been upgraded. and the user 
will say "why bother"?  

and the  web is easier to ugprade than most applications.

the user will upgrade his email client and try to use IDNs in his
address and find that lots of folks cannot reply to them, and that
sometimes his mail doesn't get delivered or gets damaged in transit.
and his recipients postmasters will rigthly blame the sender of mail 
containing IDNs because that mail violates the protocol specifications. 
and once again the user will say "why bother?"  


> If people will be using
> idns as frequently as we all think, the missing bits will be
> filled in quite quickly.

OTOH, if people find that IDNs don't work reliably, and if they cause 
things to break, IDNs will get a bad reputation and users will
learn to avoid them.

> >If we use ASCII compatible names, each component in a signal path
> >that handles a domain name can upgrade independently, and things
> >will keep working - they just won't display the name as nicely if
> >they're not updated.    And only the components that interface with
> >users need to be upgraded before the users see a benefit.
> 
> This is an end-to-end problem, and the end is the user, not some
> system. A system that gives (people like you) the impression that
> it works, but displays ASCII garbage, is a total failure.

if you believe that then there is no possible successful solution.

an IDN will display as garbage on a system that doesn't support IDNs.
it doesn't matter whether it's in ACE or UTF-8, it still looks like 
garbage.

but ACE garbage doesn't mess up my display formatting; UTF-8 garbage
will.  ACE garbage doesn't cause parse errors in existing software;
UTF-8 garbage will.  ACE garbage can be looked up in DNS even if
the software can't display it; UTF-8 garbage is less likely to work.

> *Nobody* who has a choice between let's say 'toshiba.co.jp' and
> some garbage like 'xyz--ttnhpur83g4prhoaunh3.co.jp' (rather
> than something like TOUSIBA.co.jp, imagine upper-case as kanji)
> will ever want to use the later. It's completely useless for
> humans.

probably true.  I suspect that at first people will use IDNs 
sparingly at first, regardless of what representation is chosen. 
 
> >It's easier to get real IDN support into the various components
> >using ASCII compatible names because fewer components need to be
> >upgraded.  And the incentives for adoption are greater with ASCII
> >names because the benefit of upgrading will be seen sooner.
> 
> No, what will happen is that the problems will be seen sooner,
> and people will complain. 

perhaps.  but the logical extension of this argument is that we
should avoid deploying IDNs as long as possible since that way
we can put off seeing the problems for as long as possible.

> Using UTF-8 will help to make sure
> that clients are upgraded before idns are used

how do you figure that?  the clients have to be upgraded in either
case before the user sees a benefit.  and there's no incentive
to upgrade the clients to support IDNs before they can be used.
I suppose the vendors might just update their clients to support
IDNs in the next release cycle,  and eventually most computers
will be running a relase that supports IDNs.  But that could
take a decade or more.

meanwhile, we don't want use of IDNs to cause things to break.
nor do we want to get a bunch of software widely deployed
after many years' time and only then realise that it has
serious bugs or incompatibilities.  (some vendors might try to
do this, believing it would give them a competitive advantage)

> >Users won't care about whether the applications protocols represent
> >IDNs in ACE or UTF-8.  But they will care about whether their
> >applications support IDNs.  ACE lets them do so far more quickly.
> 
> Users care very much whether they see things in their script or they
> see some ACE garbage. A half-way solution is not a solution, and ACE
> will expose a half-way solution for a long time. Very ugly.

so will UTF-8, and it's also very ugly.  

humans will see the ugliness either way.  you are correct that
IDNs are being done for the benefit of humans. but the difference 
between the ACE approach and the UTF-8 approach is that the ACE 
approach allows existing software to continue to work reliably
(even if it looks ugly), while the UTF-8 approach casues existing 
software to fail randomly (while also being ugly).

> Of course, most probably to you ASCII garbage looks better than
> Arabic, Chinese, Japanese, Korean, Hebrew, or other scripts.

it probably does.  but for lots of people, the UTF-8 
version won't display in Arabic, Chinese, Japanese, etc. either -
it will display as unprintable garbage rather than the ACE
version which will display as printable garbage.

> So again, I would like to remind you: If users see ACE garbage,
> this is a protocol failure in an end-to-end system, not something
> we can gloss over lightly.

fine, I'll accept that definition for the purpose of argument.
but that means that if users see UTF-8 as garbage, that's an even 
worse protocol failure - since the UTF-8 is less likely than ASCII
to be displayable at all.  (which follows because the ASCII 
characters used in ACE are a strict subset of UTF-8)

you have now defined terms in such a way that success is impossible.

you've also defined success criteria in such a way that having
the names look right is more important than having them work 
reliably when used.  which seems pretty strange to me.

Keith