[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [idn] What's wrong with skwan-utf8?



> We could also remember that e-mail introduced something like ACE,
> it is called quoted-printable (and BASE64). It was introduced for the
> same reason: to preserve 7-bits in e-mail so that applications
> should not break. And now look at the result:
> - yes, it probably did not break some applications.

it seems to have broken very few applications, as compared with (say)
the SMTP extensions mechanism which were designed to deal with
SMTP's traditional inability to handle 8bit mail by explicit negotiation...
e.g. there were MTAs that couldn't deal with EHLO being sent instead of 
HELO.

> - yes, it works in the old SMT protocol.
> - but it also introduced a lot of problems:
>   - still after so many years with MIME many applications do not
>     yet handle it or display characters in a user friendly way (using
>     local native characters).

true.  but what this demonstrates is that applications do not 
necessarily get updated quickly - for instance, a great many folks 
are still using ucb mail.  this would seem to support the notion 
that we need to preserve backward compatibility (by using ACE)
rather than break it (by using some non-ASCII compatible encoding)

>   - and quite often it fails somewhat in some applications so that
>     we can see some of the quotable-printable text or the characters
>     gets messed up.

true, and for the same reasons.  a significant number of folks are 
still using 10+-year old MUAs.
 
>   - and what a mess for software developers. identifying what parts
>     of a text is quoted-printable, identifying the encoded character set
>     and decoding it. And doing the reverse when sending.
>     If all text (both headers and text bodies) in e-mail had been in
>     UTF-8 the world had been so very much easier.

this wasn't even an option at the time MIME was developed - 10646 was 
still very much in flux, and if UTF-8 had been invented yet (or if it
had, was still quite obscure).

and even had 10646 and UTF-8 been available, it would still have been
necessary to distinguish text from application data, and perhaps even
to support multiple character sets for the encoding of legacy text files.
and it would have still been necessary to encode text for transport
within SMTP and other protocols.  we might have been able to avoid 
the charset parameter for text/plain, but not the rest of it.

> With an ACE in DNS, we will have e-mail with ACE-encoded domain names and
> of course we want to use non-ASCII in the user name part too. So
> an e-mail address could look like:
> ax--ergh45d6.ax--fddgf@yf--sdff.hello.yf--sdfh.com
> and it could be in a header with comments in quoted-printable.
> What a mess to decode. How many applications will break on that?
> How many will fail to get it right? How many will display the mess
> to the user?

if experience with 2047 is any guide, lots of applications will fail to 
get it right.   but how many more applications will "get it right" if
we encode this in UTF-8 instead?   aside from the unpredictable effects
on parsers that (correctly according to the specifications) expect
email headers to be pure-ASCII, and the need to apply nameprep to IDNs
(however they are encoded) before putting them in protocols, the set of 
deployed MUAs which today, without modification, could correctly display 
such headers is a tiny fraction of the installed base.  when I receive 
mail that has 8bit characters  in the message header, the characters
tend to be in iso8859/1 or some other 256-element character set rather
than in UTF-8.

just because lots of platforms now support UTF-8 does not mean that
existing MUAs are displaying characters as if they were UTF-8.

> If we used one single character encoding form (like UTF-8) I am sure
> much more characters would end up correctely and be displayed
> correctely than with an ACE/quoted-printable scheme. 

I strongly disagree.  very few existing MUAs interpret unencoded mail
headers as if they were UTF-8.

> If all protocols
> use the same encoding only one encode/decode between local character set
> and the standard encoding need to be done than can be used by
> all software.
> 
> So I fully agree with Martin that using an ACE might not be the best
> solution - it might break more applications than using only UTF-8
> might do.

I think you've successfully argued that ACE might break some applications
in minor ways, if by "break" you mean that some applications will fail to
display them adequately.  I don't think there's much disagreement on that.

but I don't think you've given any support to the idea that ACE will break 
more applications than UTF-8.

Keith