[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [idn] UTF-8 as the long-term IDN solution



I've had some further thoughts regarding that model I proposed, some
of which are similar to thoughts that have been posted in this thread.
First here's a quick recap of the model:

IDNs can be represented in a multitude of ways, but there are two
especially customary representations: ACE and UTF-8.  Old protocols
use ACE on the wire; new protocols use UTF-8 on the wire.  Evolving
protocols need to either stick with ACE or handle the transition
cleanly, perhaps by negotiating capabilities.  DNS must accept both ACE
queries (returning ACE responses) and UTF-8 queries (returning UTF-8
responses).

It's the DNS part that I've been thinking about.  The ACE
queries/responses are obviously needed for backward compatibility, and
they require no changes to the infrastructure.  But how are the UTF-8
queries going to work?

I can think of two approaches:  Either segregate UTF-8 queries into a
new kind of request (using a protocol extension or a new class or new
resource records or whatever), or allow UTF-8 in regular DNS queries.

If UTF-8 queries use a new type of request, then they can be used
only if both the client's resolver and the local DNS server have been
upgraded to support the new query format.  If you're upgrading the
resolver, you might as well include an ACE encoder/decoder, because
ACE queries will always work, even before the local DNS server has
been upgraded.  And once you have an ACE encoder/decoder, why bother
implementing the new query format?  Applications will never know or care
which format the resolver is using.  Even if every other protocol and
application migrates to UTF-8 and forgets how to encode and decode ACE,
the resolvers and DNS servers could still be speaking ACE to each other
and the other protocols/applications would be oblivious.

The other approach, allowing UTF-8 in regular DNS queries, has
advantages and disadvantages.  The advantage is that some existing
applications will work to some extent by accident.  In particular, if an
application uses UTF-8 natively, and its resolver is 8-bit clean, and
its DNS servers are 8-bit clean, and the UTF-8 encoding of an IDN fits
in 63 bytes (which won't always happen--UTF-8 is about 50% larger than
ACE for some scripts), then a lookup of the IDN will work by accident.
Of course, if this application talks to other applications that don't
use UTF-8 natively, there may be other problems, but let's focus on
DNS for the moment.  One disadvantage of this approach is an increased
opportunity for spoofing:  If some applications accidentally send
UTF-8 in DNS queries, then surely other applications will accidentally
send other 8-bit encodings, some of which are coincidentally valid
UTF-8.  But it could be argued that any existing application that does a
lookup on an 8-bit domain name is asking for trouble, and gets what it
deserves, except that UTF-8 applications will be pardoned (sometimes).

Does anyone find this accidentally-works phenomenon to be a compelling
reason to allow UTF-8 in regular DNS queries?

It looks to me like allowing UTF-8 in regular DNS queries may be more
trouble than it's worth, and creating a new query format just for UTF-8
is certainly more trouble than it's worth.  If someday there is a
pressing need for a new query format for some other reason, we might as
well make it use UTF-8, but UTF-8 itself is not a sufficient reason to
introduce a new format now.

Please remember that even if DNS sticks with ACE for now, that doesn't
impede any other application/protocol from using UTF-8.  It's quite
orthogonal.

AMC