[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [idn] Transparent vs. ACE representations



-----BEGIN PGP SIGNED MESSAGE-----

Dave Crocker wrote:
> At 02:13 AM 8/9/2001, David Hopwood wrote:
> >But, you may say, that encoding wasn't UTF-8. Doesn't matter.
> 
> well, actually it does.
> 
> >Suppose for the sake of argument that the message had contained characters
> >outside the US-ASCII repertoire (in a domain name or elsewhere).
> 
> If the text switches from one character set to another,

When did I say anything about switching charsets within a text? The example
you chose didn't support your argument because the message only contained
characters from the US-ASCII repertoire (and so it obviously wouldn't show
up differences between any IDN proposals). I'm talking about a message that
includes some abstract characters with Unicode code points > 0x7F.

> then no, it is unlikely the processing software will handle that well.
> If you mean that the text is all in the UTF-8, then, no, it isn't.

If I had meant that, I would have said that.

> Mine is in ASCII. Hence your suggestion means we would have mixed Ascii, UTF-8.

Mixed ASCII and UTF-8 is identical to UTF-8, which is one reason why it would
have been much clearer for you to use a different example.

More generally, the sending MUA can choose any single encoding for a MIME text
part, but a properly internationalized MUA should always choose an encoding
that is able to represent all the characters that are used in the text. [It's
possible to use multiparts to switch charsets in a somewhat klunky fashion,
but practically no user agents actually do that. If a compliant receiving MUA
did come across such a message, it would interpret each part according to its
tagged charset, though, so there would be no problem.]

For example, if the message contains only characters from the ISO-8859-1
repertoire, then a compliant MUA might send it as "text/plain; charset=iso-8859-1".
In that case a transparent representation means that any domain names, email
addresses, URIs, etc. will also be ISO-8859-1. The receiving internationalized
MUA transcodes the message into a UTF, so that if the user clicks on a link or
cuts/copies text from the message, it will be encoded correctly. In no case
does a MUA (or any other application) have to do any additional work in order
to display transparently represented domain names properly.

> The real point is that you appear to be assuming a much more simplified and
> consistent processing environment for characters, across applications, than
> actually exists.

I'm assuming that users that want internationalization support will use
software that provides that support. All the proposals assume that.

> >Call a domain name representation within
> >some surrounding text (or other protocol elements) "transparent", if the
> >representation is of a character string such that each character in the
> >string stands for the same character in the domain name.
> 
> By that definition, UTF-8 is not transparent.

Again you're missing the point. Transparency is a property of "a domain name
representation within some surrounding text"; it's not dependent on which
charset is used to represent that text. For any charset X, a domain name
encoded using X is transparent if it appears in a message or protocol that
also uses X. That property never applies to a name encoded as ACE, because
ACE will never be used as a general-purpose charset.

[...]
> >The thesis behind the ACE-only proposals seems to be that most existing
> >protocols, formats, and implementations assume that domain names only
> >use US-ASCII characters, and so using a non-ASCII representation will
> >break them. In the vast majority of cases, that simply isn't true
> 
> The premise has been present in the design of Internet protocols from the
> beginning.  Much has changed since then, but not everything.
> 
> The premise that all software will correctly process UTF-8 has been
> demonstrated to be false.

My argument doesn't depend on all software being able to correctly process
UTF-8. No software correctly processes ACE at the moment. Remember that
the point of IDNs is to be able to enter and display non-ASCII characters
correctly; if that doesn't work, then a more meaningful ASCII alias for the
domain might as well have been used instead.

> > > As to predictions of the future, well, they are always wrong.
> >
> >That's obvious nonsense. It is a perfectly safe prediction that the
> >default encoding (often the only encoding) used in almost all Internet
> >protocols in the near future will be UTF-8.
> 
> I do not know how many such confident predictions of the future you have
> suffered through, but many of us have watched them made repeatedly over the
> last 25 years and, I am sorry to say, they are usually incorrect.  Not
> always, no.  But frequently.

Congratulations; you've gone from "always" to "usually", then "frequently",
all in one paragraph. "frequently" is much better, although I would say
"sometimes". There is also a reporting bias to be taken into account: cases
where the prediction was wrong are more interesting, so they get cited more
often. Confident predictions of the near future that turn out to be absolutely
correct are rarely commented on.

In any case, my statement stands: "It is a perfectly safe prediction that
the default encoding (often the only encoding) used in almost all Internet
protocols in the near future will be UTF-8." Note that the effect of BCP 18
is to block progression on the standards track of new protocols and protocol
updates that don't allow UTF-8, unless an exemption is granted.

- -- 
David Hopwood <david.hopwood@zetnet.co.uk>

Home page & PGP public key: http://www.users.zetnet.co.uk/hopwood/
RSA 2048-bit; fingerprint 71 8E A6 23 0E D3 4C E5  0F 69 8C D4 FA 66 15 01
Nothing in this message is intended to be legally binding. If I revoke a
public key but refuse to specify why, it is because the private key has been
seized under the Regulation of Investigatory Powers Act; see www.fipr.org/rip


-----BEGIN PGP SIGNATURE-----
Version: 2.6.3i
Charset: noconv

iQEVAwUBO3MYgzkCAxeYt5gVAQG3QAf+IsBl2Nju0VOaTByWUMndLcsMbM+AZhIB
q29heZ9jRP82Eodcqhep9o0E6DKEAtWAnw5e7RztMdYuQJO+FHnteuk6rV4hAO3N
wFpm1QyrdXqOo4RJMlspryRYk5kHnpO5AxF3EHuFIBjQNoq/SfZ8j019A7XUVk6z
99AQu6GL1qog0ByOkr38lOiSeTEadpinfDZ23ujXMvngHc3NDINAszoN27izc79S
WU3rzpmOkMCu5e+hr2KLK4XkU5EMoJensqtwFghNL3fOZZun4Dy7Tb3gbBOXYDK4
2LWCwjuYC1qcnQBEuT3/lyb8J41GdnCkKYu/hMvpY7W1udeLVykXzg==
=qIJ/
-----END PGP SIGNATURE-----