[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[idn] If I used IDNA or UTF-8 at my site




Some more thoughts about the ACE/nameprep versus UTF-8 debate.
I have lately tried to think less on backward compatibility and instead
on the effects on using one or the other in the real world.
One can think of three possibilities: ACE+nameprep, UTF-8 and both.
There are existing areas that are somewhat alike that can be
compared with:
- MIME: In MIME non-ASCII can be encoded using quoted-printable and BASE64.
        If ASCII is enough, why even allow 8-bit transfer of documents?
        In e-mail today you can send ISO 8859-1 without any ASCII downcoding
        by having 8-bit MIME body parts.
        
- URLs: In URLs you encode non-ASCII using a %-escape mechanism.
        But a people are sending URLs in binary format today and there
        are drafts defining how non-ASCII should be handled without %-encoding.
        Why is this? If ACE is enough for the DNS protocol, %-encoding
        should be enough for protocols sending URLs.
        
The above two examples shows that despite it being enough to always use the
ASCII character set, prople do not stop there. Instead they want to send
the data without having to encode it.

What is the effect on the IDN possibilities?
Here I have looked at how my system is handling DNS and how I expect many
do handle DNS.
In my system we have a few DNS servers who all clients talk to. These DNS
servers then do the talking to other DNS servers around the world.
Simplified it works like this:
client <-> local DNS server <-> the rest of the world

On my system we use ISO 8859-1 as the local character set.
Now what would happen if I used IDNA (ACE with nameprep in client).
- I do now want to change my system vendors resolver code so
  all old clients would still use if and the old API.
  They would all see ACE.
- As I do not want my clients to worry about ACE a new resolver
  library is needed. This resolver would do the translation between
  ISO 8859-1 (my local character set) and ACE.
  A client would get a host name in ISO 8859-1 from the resolver except
  in the case where it cannot be represented as ISO 8859-1, it would then
  return ACE.
  As a general resolver for this the library would probably work
  like: ACE -> UCS -> ISO 8859-1 and in reverse.
- The users would copy host names between applications and files.
  This will result in ISO 8859-1 encoded host names ending up in
  applications only supporting the old DNS API. This resulting in them
  sending ISO 8859-1 hostnames over the DNS protocol to my local
  DNS server.
  Also, on printed paper and in files, host names will be in ISO 8859-1.
  Users will enter host names from printed matter using ISO 8859-1, once
  again resulting in ISO 8859-1 ending up over the DNS protocol.
- Due to the above problem I need to fix my local DNS server to
  catch the ISO 8859-1 encoded host names and convert them to ACE.
  So my local DNS server need to be changed.
- And as people will enter e-mail adresses they get on printed matter
  there will be ISO 8859-1 host names going into e-mail.
  So I will have to fix my e-mail MTA (sendmail).
- I will fail to have non-ASCII e-mail addresses in SOA records
  as IDNA uses nameprep+ACE to encode the labels and that forbids
  characters I need in the user name and also mangles case.
  (or do IDNA want to handle the SOA lable for e-mail address
   differently from all others?).

If I instead used only UTF-8, what happes?
- The local DNS server need to be upgraded to handle UTF-8.
- I do now want to change my system vendors resolver code so
  all old clients would still use if and the old API.
  They would not get any host name for IP -> name lookups when
  the host names is non-ASCII. To avoid breaking e-mail, which
  is the most important, I will need to use a ASCII only name
  on mail servers.
  Some other old clients may have to be disabled or worked around.
  Probably will all important things work.
- As I do not want my clients to worry about UTF-8 a new resolver
  library is needed. This resolver would do the translation between
  ISO 8859-1 and UTF-8.
  A client would get a host name in ISO 8859-1 from the resolver except
  in the case where it cannot be represented as ISO 8859-1, it would
  the have to be represented in some way for the user. Leaving it
  as UTF-8 will often be ok.
- The users would copy host names between applications and files.
  This will result in ISO 8859-1 encoded host names ending up in
  applications only supporting the old DNS API. This resulting in them
  sending ISO 8859-1 hostnames over the DNS protocol to my local
  DNS server.
  Also, on printed paper and in files, host names will be in ISO 8859-1.
  Users will enter host names from printed matter using ISO 8859-1, once
  again resulting in ISO 8859-1 ending up over the DNS protocol.
- Due to the above problem I need to fix my local DNS server to
  catch the ISO 8859-1 encoded host names and convert them to ACE.
  So my local DNS server need to handle ISO 8859-1 also.
- And as people will enter e-mail adresses they get on printed matter
  there will be ISO 8859-1 host names going into e-mail.
  So I will have to fix my e-mail MTA (sendmail).

If I instead use UTF-8 with nameprep+ACE for backward compatibility I get:
- The local DNS server need to be upgraded to handle UTF-8 and ACE.
  The server will send UTF-8 to new clients and ACE to old.
- I do now want to change my system vendors resolver code so
  all old clients would still use if and the old API.
  The will all get ACE in answers.
- As I do not want my clients to worry about UTF-8 a new resolver
  library is needed. This resolver would do the translation between
  ISO 8859-1 and UTF-8.
  A client would get a host name in ISO 8859-1 from the resolver except
  in the case where it cannot be represented as ISO 8859-1, it would
  the have to be represented in some way for the user. Encoding it using
  ACE is fine, it will then work if pasted into an old client.
- The users would copy host names between applications and files.
  This will result in ISO 8859-1 encoded host names ending up in
  applications only supporting the old DNS API. This resulting in them
  sending ISO 8859-1 hostnames over the DNS protocol to my local
  DNS server.
  Also, on printed paper and in files, host names will be in ISO 8859-1.
  Users will enter host names from printed matter using ISO 8859-1, once
  again resulting in ISO 8859-1 ending up over the DNS protocol.
- Due to the above problem I need to fix my local DNS server to
  catch the ISO 8859-1 encoded host names and convert them to UTF-8.
  So my local DNS server need to handle ISO 8859-1 also.
- And as people will enter e-mail adresses they get on printed matter
  there will be ISO 8859-1 host names going into e-mail.
  So I will have to fix my e-mail MTA (sendmail).
  
In summary:
- All possibilities result in a new resolver library and in
  a modified DNS server.
- The IDNA resolver need to do ACE <-> UCS <-> ISO 8859-1 translations,
  the UTF-8 resolver need to do  UTF-8 <-> ISO 8859-1 translations,
  and the UTF-8+ACE resolver do UTF-8 <-> ISO 8859-1, and UTF-8 -> ACE
  translations.
  I.e. they will contain more or less the same code. The strict UTF-8 only
  is simplest.
  The IDNA resolver will need most CPU time as it always translats to/from
  ACE.
- All need a modified DNS server unless you want a lot of failed
  DNS lookups because of wrong character set being used in old
  clients.

So, it looks difficult to tell which is best.
UTF-8 (UCS) being on of major directions for internationalisation today
and the fact that by using UTF-8 all labels and character data in DNS
can be internationlised, speaks in favor of a UTF-8 solution.
I have still not seen a good enough analys if ACE for backward compatibility
gives less problems than a pure UTF-8 solution does.
As indicated above there are problems with a pure UTF-8 solution, but
I cannot say if they can be solved simply enough to avoid ACE. Not allowing
ACE will result in much quicker fixing of old protocols and
applications. It took a long time before enough e-mail software
could handle MIME so that I could use non-ASCII in e-mail.
I can live with not having a non-ASCII e-mail address until enough
e-mail software is fixed for non-ASCII.

May be it all boils down to: 
do we once again go for the "encode all into ASCII, and the later extend
to full 8-bit" or instead go directely to "8-bit"?


    Dan