[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
[idn] Some comments
Having no time to daily comment on the flow of comments on the list I
here give some comments on things I have read this week:
- There was some comments on 8-bit text i Subject and how it quite
often was mangled beyond recognition.
This does show some of the problems that an ACE can result in.
To get non-ASCII into the Subject-line you today need a quite
complex handling where some parts are encoded and some are not,
length constraints on the encoded form and a mixture of character sets.
It is even worse with the From:-line. Here only some parts may be
encoded. I have written software to handle the MIME-encoding of header
lines. If all header lines had been in UTF-8 (or UCS-2) it would have been
much, much simpler and much less error prone.
- There have been several references to a "Latin alphabet". As there are
many alphabets that can be called a "Latin alphabet" (for example
the classical Latin alphabet (no G,J or W), the English alphabet or
the Swedish alphabet). While I can guess it is mostly a reference to
the 26 letters of the English alphabet, the English alphabet is not the only
Latin alphabet. Please be more clear.
- There have been talk about URIs and how they will use non-ASCII and
if Unix users are willing to edit ACE or use a new tool.
As a person needing more than ASCII (currently using ISO 8859-1) typing
in URLs using %-escapes or editing domain names in ACE is unacceptable.
If I edit domain names in a zone file to be loaded by my DNS server,
I will edit them using my local character set. I will expect my DNS server
to convert them into whatever format is used in the DNS protocol.
If I can get an editor which can support ISO 8859-1 and UTF-8 is a
friendly way, I might accept to edit my zone files using UTF-8.
And when I enter URLs I enter them using ISO 8859-1. I am not going to
%-encode them. And I cannot really see any reason to have them
%-encoded when being transmitted using HTTP. HTTP is fully capable to
send 8-bit characters. It is OK for them to be converted into UTF-8 for
transmission, but any displyed to me must not be in UTF-8 or %-encoding,
if the letters can be displyed using my local character set.
In short: during user interaction all data should use the local character set.
As a local character set cannot always display all characters in a text,
there should be ONE standard way to encode and display the characters not
supported by the local character set. I do not want many like today
with: %-encoding, ACE, quoted-printable or base64.
For every new encodning for it gets more and more complex to handle text.
Just look at an URL: Is the host part in ACE-encoding or %-encoding?
Or UTF-8? If a URL is embedded in text, what format is it in then?
The same character set as the text? Or is it in ACE? How is the tool
that is displaying the text to know that an URL is embedded and
decode just that part so that also the URL can be displayed
in a user friendly manner?
Dan