[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
[idn] If I used IDNA or UTF-8 at my site
Some more thoughts about the ACE/nameprep versus UTF-8 debate.
I have lately tried to think less on backward compatibility and instead
on the effects on using one or the other in the real world.
One can think of three possibilities: ACE+nameprep, UTF-8 and both.
There are existing areas that are somewhat alike that can be
compared with:
- MIME: In MIME non-ASCII can be encoded using quoted-printable and BASE64.
If ASCII is enough, why even allow 8-bit transfer of documents?
In e-mail today you can send ISO 8859-1 without any ASCII downcoding
by having 8-bit MIME body parts.
- URLs: In URLs you encode non-ASCII using a %-escape mechanism.
But a people are sending URLs in binary format today and there
are drafts defining how non-ASCII should be handled without %-encoding.
Why is this? If ACE is enough for the DNS protocol, %-encoding
should be enough for protocols sending URLs.
The above two examples shows that despite it being enough to always use the
ASCII character set, prople do not stop there. Instead they want to send
the data without having to encode it.
What is the effect on the IDN possibilities?
Here I have looked at how my system is handling DNS and how I expect many
do handle DNS.
In my system we have a few DNS servers who all clients talk to. These DNS
servers then do the talking to other DNS servers around the world.
Simplified it works like this:
client <-> local DNS server <-> the rest of the world
On my system we use ISO 8859-1 as the local character set.
Now what would happen if I used IDNA (ACE with nameprep in client).
- I do now want to change my system vendors resolver code so
all old clients would still use if and the old API.
They would all see ACE.
- As I do not want my clients to worry about ACE a new resolver
library is needed. This resolver would do the translation between
ISO 8859-1 (my local character set) and ACE.
A client would get a host name in ISO 8859-1 from the resolver except
in the case where it cannot be represented as ISO 8859-1, it would then
return ACE.
As a general resolver for this the library would probably work
like: ACE -> UCS -> ISO 8859-1 and in reverse.
- The users would copy host names between applications and files.
This will result in ISO 8859-1 encoded host names ending up in
applications only supporting the old DNS API. This resulting in them
sending ISO 8859-1 hostnames over the DNS protocol to my local
DNS server.
Also, on printed paper and in files, host names will be in ISO 8859-1.
Users will enter host names from printed matter using ISO 8859-1, once
again resulting in ISO 8859-1 ending up over the DNS protocol.
- Due to the above problem I need to fix my local DNS server to
catch the ISO 8859-1 encoded host names and convert them to ACE.
So my local DNS server need to be changed.
- And as people will enter e-mail adresses they get on printed matter
there will be ISO 8859-1 host names going into e-mail.
So I will have to fix my e-mail MTA (sendmail).
- I will fail to have non-ASCII e-mail addresses in SOA records
as IDNA uses nameprep+ACE to encode the labels and that forbids
characters I need in the user name and also mangles case.
(or do IDNA want to handle the SOA lable for e-mail address
differently from all others?).
If I instead used only UTF-8, what happes?
- The local DNS server need to be upgraded to handle UTF-8.
- I do now want to change my system vendors resolver code so
all old clients would still use if and the old API.
They would not get any host name for IP -> name lookups when
the host names is non-ASCII. To avoid breaking e-mail, which
is the most important, I will need to use a ASCII only name
on mail servers.
Some other old clients may have to be disabled or worked around.
Probably will all important things work.
- As I do not want my clients to worry about UTF-8 a new resolver
library is needed. This resolver would do the translation between
ISO 8859-1 and UTF-8.
A client would get a host name in ISO 8859-1 from the resolver except
in the case where it cannot be represented as ISO 8859-1, it would
the have to be represented in some way for the user. Leaving it
as UTF-8 will often be ok.
- The users would copy host names between applications and files.
This will result in ISO 8859-1 encoded host names ending up in
applications only supporting the old DNS API. This resulting in them
sending ISO 8859-1 hostnames over the DNS protocol to my local
DNS server.
Also, on printed paper and in files, host names will be in ISO 8859-1.
Users will enter host names from printed matter using ISO 8859-1, once
again resulting in ISO 8859-1 ending up over the DNS protocol.
- Due to the above problem I need to fix my local DNS server to
catch the ISO 8859-1 encoded host names and convert them to ACE.
So my local DNS server need to handle ISO 8859-1 also.
- And as people will enter e-mail adresses they get on printed matter
there will be ISO 8859-1 host names going into e-mail.
So I will have to fix my e-mail MTA (sendmail).
If I instead use UTF-8 with nameprep+ACE for backward compatibility I get:
- The local DNS server need to be upgraded to handle UTF-8 and ACE.
The server will send UTF-8 to new clients and ACE to old.
- I do now want to change my system vendors resolver code so
all old clients would still use if and the old API.
The will all get ACE in answers.
- As I do not want my clients to worry about UTF-8 a new resolver
library is needed. This resolver would do the translation between
ISO 8859-1 and UTF-8.
A client would get a host name in ISO 8859-1 from the resolver except
in the case where it cannot be represented as ISO 8859-1, it would
the have to be represented in some way for the user. Encoding it using
ACE is fine, it will then work if pasted into an old client.
- The users would copy host names between applications and files.
This will result in ISO 8859-1 encoded host names ending up in
applications only supporting the old DNS API. This resulting in them
sending ISO 8859-1 hostnames over the DNS protocol to my local
DNS server.
Also, on printed paper and in files, host names will be in ISO 8859-1.
Users will enter host names from printed matter using ISO 8859-1, once
again resulting in ISO 8859-1 ending up over the DNS protocol.
- Due to the above problem I need to fix my local DNS server to
catch the ISO 8859-1 encoded host names and convert them to UTF-8.
So my local DNS server need to handle ISO 8859-1 also.
- And as people will enter e-mail adresses they get on printed matter
there will be ISO 8859-1 host names going into e-mail.
So I will have to fix my e-mail MTA (sendmail).
In summary:
- All possibilities result in a new resolver library and in
a modified DNS server.
- The IDNA resolver need to do ACE <-> UCS <-> ISO 8859-1 translations,
the UTF-8 resolver need to do UTF-8 <-> ISO 8859-1 translations,
and the UTF-8+ACE resolver do UTF-8 <-> ISO 8859-1, and UTF-8 -> ACE
translations.
I.e. they will contain more or less the same code. The strict UTF-8 only
is simplest.
The IDNA resolver will need most CPU time as it always translats to/from
ACE.
- All need a modified DNS server unless you want a lot of failed
DNS lookups because of wrong character set being used in old
clients.
So, it looks difficult to tell which is best.
UTF-8 (UCS) being on of major directions for internationalisation today
and the fact that by using UTF-8 all labels and character data in DNS
can be internationlised, speaks in favor of a UTF-8 solution.
I have still not seen a good enough analys if ACE for backward compatibility
gives less problems than a pure UTF-8 solution does.
As indicated above there are problems with a pure UTF-8 solution, but
I cannot say if they can be solved simply enough to avoid ACE. Not allowing
ACE will result in much quicker fixing of old protocols and
applications. It took a long time before enough e-mail software
could handle MIME so that I could use non-ASCII in e-mail.
I can live with not having a non-ASCII e-mail address until enough
e-mail software is fixed for non-ASCII.
May be it all boils down to:
do we once again go for the "encode all into ASCII, and the later extend
to full 8-bit" or instead go directely to "8-bit"?
Dan