[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [idn] IDNs in email message bodies




James Seng wrote:

> What the ppt is focusing on are the issues when IDN names appears on
> RFC821 SMTP command, RFC822 headers (different depending whether it is
> From, or Subject) or bodies. How can we work with ACE wrt to
> encoding+TES used already in the body etc.

Whether an IDN appears in a header or in the body is irrelevant. We can't
change the data.

First of all, users are likely to use whatever characters are available
from their charset in the message text, or they may change charsets if
they have to do additional encodings, but for the most part people will
use whatever characters (and encodings) they already have. Somebody
advertising a Norsk domain name will write and encode it with iso-8859-1
or iso-8859-16 or whatever. Similarly, a Japanese IDN will likely be
written and encoded in iso-2022-jp (or an alternative, maybe unicode, if
they need more characters). But for the most part, the domain names from
email addresses and URLs will be encoded in native characters. Users will
not be converting IDNs to ACE and then writing ACE strings into email
messages. Sorry.

Secondarily, it is not possible to rewrite message data since that
corrupts message signatures (Eudora bug, spell-checking a message after it
was signed invalidated the signature). That means IDNs can't be screwed
with, meaning they must be preserved as encoded by the sender.

If we accept both of those preconditions, then we also come to the
conclusion that there isn't a whole lot of difference we can make here,
other than to request that the client application (or the resolver) use
ACE conversion when they are presented with an IDN. If some additional
decoding is required -- such as converting an iso-2022-jp sequence into
utf8 first -- then that's something that needs to be pointed out. But we
can't really expect that users aren't going to type IDNs in their rich
format when they generate a message which also contains plain text from
some local charset.

Other areas of concern -- like passing Han IDNs when those characters
aren't available (all languages are problematic for charset=us-ascii) --
have the same problem and answer really. We can encourage implementations
(and users) to default to UTF8, and then leave it up to them to do the
proper conversion whenever a DNS lookup is issued. If they can't or won't
type the IDN in its native format then they can type it in as an IDN.
Chances are they will try to copy-and-paste, and hopefully a mailer or a
browser's input box will be smart enough to deal with it.

But what we cannot do is create a new encoding syntax and expect that it
will be used only for domain names. As pointed out above, it won't happen,
because changing data breaks messages, and because users are only going to
type in the charset that they have/know.

-- 
Eric A. Hall                                        http://www.ehsco.com/
Internet Core Protocols          http://www.oreilly.com/catalog/coreprot/