[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [idn] Some comments
At 15.07 +0100 01-01-13, Dan wrote:
>- There was some comments on 8-bit text i Subject and how it quite
>often was mangled beyond recognition.
>This does show some of the problems that an ACE can result in.
The errors for 8bit in email can be classified in three groups:
- Information loss
- Charset weirdness
- Ascii encoding is not decoded
The main problem is the information loss or charset weirdness
(unknown charset for 8bit data) that happens when people send mail
wrongly tagged/encoded. This is the major problem just because the
receiver can not know reconstruct the mangled data regardless of what
his software is doing.
Non-decoded data is not as problematic, because then information has
not been lost during encoding and transport.
An ACE encoding is in the category of "non-decoded" data, and not
"information loss", while UTF-8 on the wire is in the category of
information loss -- if it is mangled.
>To get non-ASCII into the Subject-line you today need a quite
>complex handling where some parts are encoded and some are not,
>length constraints on the encoded form and a mixture of character sets.
There are length constraints on what you send over the wire. What I
see above is a layering violation. You have to keep things apart, and
remember that with the introduction of MIME, SMTP is used as a
transport mechanism, and whatever you want to send with the help of
SMTP have to be encoded correctly.
>It is even worse with the From:-line. Here only some parts may be
>encoded. I have written software to handle the MIME-encoding of header
>lines. If all header lines had been in UTF-8 (or UCS-2) it would have been
>much, much simpler and much less error prone.
It all depends on what you mean by "simpler". Of course an n:m
mapping is harder than n:1 mapping (i.e. if every email had to be in
Unicode, people only had to know how to map between the local
characters and Unicode).
That is a different thing though than the question whether ACE or
UTF-8 should be used.
Keep the layers separated please!
>- There have been talk about URIs and how they will use non-ASCII and
>if Unix users are willing to edit ACE or use a new tool.
>As a person needing more than ASCII (currently using ISO 8859-1) typing
>in URLs using %-escapes or editing domain names in ACE is unacceptable.
>If I edit domain names in a zone file to be loaded by my DNS server,
>I will edit them using my local character set. I will expect my DNS server
>to convert them into whatever format is used in the DNS protocol.
>If I can get an editor which can support ISO 8859-1 and UTF-8 is a
>friendly way, I might accept to edit my zone files using UTF-8.
>And when I enter URLs I enter them using ISO 8859-1. I am not going to
>%-encode them. And I cannot really see any reason to have them
>%-encoded when being transmitted using HTTP. HTTP is fully capable to
>send 8-bit characters. It is OK for them to be converted into UTF-8 for
>transmission, but any displyed to me must not be in UTF-8 or %-encoding,
>if the letters can be displyed using my local character set.
Just too many things are mixed up here to have this paragraph make sense:
- You edit the zone file using your local charset
- The content of the zonefile have to be converted to Unicode
- The content of the zonefile have to be nameprepped
- The content of the zonefile have to be encoded
- You enter URL's in your local charset in your browser
- The browser convert into Unicode
- The browser do nameprep
- The nameprepped data might be %-encoded for HTTP
- The domain part have to be encoded
Note that only where we have "have to be encoded" above, we have the
problem of UTF-8 versus ACE.
Now, you Dan, also think that nameprep/equivalence can be done in the
DNS server, so the DNS query packets can include whatever octets one
want. My position is that that is just out of scope, and just not
acceptable solution as the domainname have to be "reused" in the
actual protocol aswell, in the example above HTTP. So, NOT doing the
full nameprep/encoding in the client is just wrong, and a mistake
architecturally.
Also, for the n:time, for UTF-8 to work we need to change a large
number of protocols, intermediaries/middleware boxes, firewalls etc
because the change to the protocols is quite large. Not as bad as a
change to USC-2 or some other "more" binary encoding, but still a
large change.
With an ACE encoding, we don't have to change the protocols at all.
Do people understand this?
Also, I saw someone saying "the rest of us like UTF-8" (or something
like that). I would say that there is definitly not consensus for use
of UTF-8 instead of an ACE. The contrary.
>In short: during user interaction all data should use the local character set.
Yes.
>As a local character set cannot always display all characters in a text,
>there should be ONE standard way to encode and display the characters not
>supported by the local character set. I do not want many like today
>with: %-encoding, ACE, quoted-printable or base64.
>For every new encodning for it gets more and more complex to handle text.
>Just look at an URL: Is the host part in ACE-encoding or %-encoding?
>Or UTF-8? If a URL is embedded in text, what format is it in then?
>The same character set as the text? Or is it in ACE? How is the tool
>that is displaying the text to know that an URL is embedded and
>decode just that part so that also the URL can be displayed
>in a user friendly manner?
All good questions, and information on why the IDN problem is not
only a problem with DNS, but also a problem with the various
applications which uses domains.
paf