[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[idn] Transparent vs. ACE representations (was We are quibbling about WHAT?)



-----BEGIN PGP SIGNED MESSAGE-----

Dave Crocker wrote:
> Content-Type: text/plain; charset=us-ascii; format=flowed
> ...
>
> At 11:55 AM 7/30/2001, David Hopwood wrote:
> >The debate between ACE and UTF-8 has nothing to do with encoding
> >efficiency. The benefit of UTF-8 is in having DNS names in the same
> >encoding as the surrounding text, rather than having to treat them as a
> >special case.
> 
> "surrounding text"???  you mean like this message?

Yes, like that message: the domain name "www.brandenburg.com" and the
email address "dcrocker@brandenburg.com" were in the same encoding as the
surrounding text. That's why they looked right when I viewed the message,
rather than looking like gibberish.

But, you may say, that encoding wasn't UTF-8. Doesn't matter. Suppose for
the sake of argument that the message had contained characters outside the
US-ASCII repertoire (in a domain name or elsewhere). In that case, your
mail client would have automatically chosen an encoding that is capable
of representing those characters. That still might not be UTF-8, but the
important point is that a text message or file can always be losslessly
transcoded into any UTF, including UTF-8 [*]. So, any domain names in
the message (regardless of whether they are pasted, typed, come from a
.signature file, or whatever), will both "look right", and will be
losslessly convertable to the format needed for a lookup. They can also
be pasted into other applications, and will still look right - even in
applications that don't and shouldn't need to know anything about IDNs.

[*] The transcoder may or may not be NFC-normalizing; that doesn't matter
    because nameprep(NFC(X)) = NFC(nameprep(X)) = nameprep(X) for all X.


Although most of the discussion on this list has been categorising the
possible architectures as "UTF-8-only", "UTF-8 + ACE", and "ACE-only",
this misses the point slightly, and I should have been clearer about
that in my previous message. Call a domain name representation within
some surrounding text (or other protocol elements) "transparent", if the
representation is of a character string such that each character in the
string stands for the same character in the domain name. I.e.:

                charset
  octet string  <----->  character string     =     domain name

as opposed to:

                charset                      ACE
  octet string  <----->  character string  <----->  domain name

A more fundamental categorisation of IDN architectures is then
"transparent-only" vs. "transparent + ACE" vs. "ACE-only".


The representation of LDH domain names has always been transparent: even
if you transcode a message containing domain names into, say, EBCDIC,
each character still stands for itself in the EBCDIC encoding.
Similarly, in proposals that support transparent non-ASCII domain
names, there will be no difficulty with transcoding a message into
UTF-16, UTF-EBCDIC, or anything else that is capable of representing
all the characters in that message.

[I use EBCDIC and UTF-EBCDIC as examples, to make the point that it's not
the fact that the encoding is a superset of US-ASCII that is relied on;
only that it is capable of representing all the characters in the name
(up to Unicode compatibility equivalence), and all the characters in the
surrounding message text. Of course, using an encoding that is a superset
of US-ASCII is very convenient in practice, and a UTF will always be able
to represent the necessary characters - hence the attractiveness of UTF-8.)


The thesis behind the ACE-only proposals seems to be that most existing
protocols, formats, and implementations assume that domain names only
use US-ASCII characters, and so using a non-ASCII representation will
break them. In the vast majority of cases, that simply isn't true: most
things don't assume that the representation is ASCII, but they *do* very
often assume that it is transparent. The main flaw in the argument that
ACE is just another encoding, is that it ignores how important
transparency is as a simplification.

> sorry, no.  DNS strings occur in many situations.  Most do not, today,
> use UTF-8,

I've been searching through all RFCs to try to produce a list of all
the IETF-standardized protocols where domain names are encoded (including
email addresses and URIs/URLs), and actually quite a large proportion of
them use UTF-8 already for the rest of the protocol. There are very, very
few protocols that require use of a different 8-bit character set.
Overall, I don't see anything that would interfere with straightforwardly
adopting UTF-8 as the preferred encoding for domain names, with ACE just
used for a handful of problematic cases such as SMTP.

In the case of protocols where messages are tagged with their charsets,
receiving software might have to transcode a message from whatever it is
tagged as into a UTF before further processing. However, that is what a
Unicode-based program would normally do anyway, and after that, the entire
message (or file, or whatever) can be treated uniformly. ACE domain names,
OTOH, will always have to be treated as a special case. So transparent
solutions have an advantage over ACE-only solutions even when the encoding
is not initially UTF-8.

If anything, maximum name lengths are a bigger compatibility problem than
encodings, and in that respect there is little difference between UTF-8
and the proposed ACEs (some languages win, some lose, but either UTF-8 or
AMC-ACE-Z would be adequate in practice).

> As to predictions of the future, well, they are always wrong.

That's obvious nonsense. It is a perfectly safe prediction that the
default encoding (often the only encoding) used in almost all Internet
protocols in the near future will be UTF-8. In a sense there is nothing
to predict; the decision has already been made in RFC 2277 / BCP 18, and
the process of making the necessary modifications to existing standards
is already well underway.

- -- 
David Hopwood <david.hopwood@zetnet.co.uk>

Home page & PGP public key: http://www.users.zetnet.co.uk/hopwood/
RSA 2048-bit; fingerprint 71 8E A6 23 0E D3 4C E5  0F 69 8C D4 FA 66 15 01
Nothing in this message is intended to be legally binding. If I revoke a
public key but refuse to specify why, it is because the private key has been
seized under the Regulation of Investigatory Powers Act; see www.fipr.org/rip


-----BEGIN PGP SIGNATURE-----
Version: 2.6.3i
Charset: noconv

iQEVAwUBO3HjuTkCAxeYt5gVAQGH5Af/U8UodcH1fleoT/ug7jbS8S6aZixQrEEa
oAAPjxUMGyxthjqwcbPVluWLuyWGiJ79q9JFIx1ub6VJ8QvEa1lAXU/Z1KGIOTmD
9w2TOZD6qt+NYfFS5PNW/GsgjEcOMKOaSKPGHP2q21V8fw5CuJGx5PzPWDKpANUd
GRHLxeclUXabQfIgyQL6X9WSj9QyRIClb3bmZEqhRazhwxRTa6hTPZVbRFCW0X02
LdmawzjXwIiPOf/hbW+TNRjn9FyUf7R5B5M/56rBAIYJmhP8Er6cQJ048fkXsrhH
3RdkusyVIjK4kZXKC89rIGsLeXyZzXPv7DrvBovz8EyD48IwJW+RbQ==
=Sp8e
-----END PGP SIGNATURE-----