[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[idn] Some new ideas in my updated draft



I have given some additional thought on the discussions on use
of UTF-8 and what happes if that get into other protocols. While
I, like some others, very much dislike ASCII only solutions and think
they results in software never getting fixed. And I am not
sure what is best, to maybe break software and get it fixed or
nearly forever never getting it fixed.

Anyway it might be that we must have an ASCII only compatibility
handling to make things work quickly.

Attached is a new version of my draft for international DNS.
I have added several new things, some about how to make it
work with old ASCII only DNS and other software. And there is more
I did not have time during the weekend to get it or work through.
But you are welcome to read through and give comments. And we
could have some discussion on my disign and thoughts about handling
ASCII compatibility. (this is leaving the requirements, but may give
new ides for them).

Regards,

   Dan
Internet Draft                                     Dan Oscarsson
draft-oscarsson-idn-i18ndns.txt                    Telia ProSoft
Updates: RFC 2181, 1035, 1034, 2535                February 2000
Expires: August 2000

         Internationalisation of the Domain Name Service

Status of this memo

   This document is an Internet-Draft and is in full conformance with all
   provisions of Section 10 of RFC2026.

   Internet-Drafts are working documents of the Internet Engineering Task
   Force (IETF), its areas, and its working groups. Note that other
   groups may also distribute working documents as Internet-Drafts.

   Internet-Drafts are draft documents valid for a maximum of six months
   and may be updated, replaced, or obsoleted by other documents at any
   time. It is inappropriate to use Internet-Drafts as reference
   material or to cite them other than as "work in progress."

     The list of current Internet-Drafts can be accessed at
     http://www.ietf.org/ietf/1id-abstracts.txt

     The list of Internet-Draft Shadow Directories can be accessed at
     http://www.ietf.org/shadow.html.


Abstract

   There is a very strong world-wide desire to use characters other than
   ASCII in the DNS, especially in domain names. Domain names have become
   the equivalent of business or product names for many services on the
   Internet, so there is a need to make them usable by people whose native
   scripts are not representable by ASCII.

   This document updates the Domain Name System standard (DNS) [RFC1035] and
   specifies how international characters are handled. It is completely
   compatible with the current DNS (RFC 1034, 1035, 2181, 2535 etc.).



1. Introduction

   There is an immediate need of using international characters (non-ASCII)
   in DNS. This means that DNS cannot be extended as this would take
   too long time. Instead the current ASCII only handling need to
   be extended to non-ASCII in a way that can be used without updating
   current software.

   The basic handling of character data in DNS have several properties
   that need to be preserved:
   - The DNS itself places only one restriction on the particular labels
     that can be used to identify resource records. That one restriction
     relates to the length of the label and the full name. The length of
     any one label is limited to between 1 and 63 octets. A full domain
     name is limited to 255 octets (including the separators).
     [RFC2181]
   - Any binary string whatever can be used as the label of any
     resource record. Similarly, any binary string can serve as the value
     of any record that includes a domain name as some or all of its value
     (SOA, NS, MX, PTR, CNAME, and any others that may be added).
     Implementations of the DNS protocols must not place any restrictions
     on the labels that can be used. In particular, DNS servers must not
     refuse to serve a zone because it contains labels that might not be
     acceptable to some DNS client programs.
     [RFC2181]
   - Names must be compared with case-insensitivity.
     [RFC1035]
   - The original case should be preserved when possible as data is entered
     into the system. This also implies that responses should preserve case
     when possible. [RFC1035]
     Some of the reasons for this are:
       + Domain names are used for many purposes.
       + One is domain names where company names or trademarks could be used.
         Very commonly companies and trademarks are using a combination of
         upper and lower case to enhance the image of the name.
         Many of them would prefer that when you, for example, lookup the
         domain name for an IP address, the correct case is returned.
       + An other is the e-mail address defined in the SOA record.
         While many systems now does a case-insensitive comparison on the
         user name part of the e-mail address, there may still be those that
         don't.
         And also here, e-mail addresses can be made more readable by mixing
         upper and lower case.
       + If you look up a host name form an IP address you may want to use the
         host name to compare with other data. Many services under Unix
         does this, and many of the are not case-insensitive. So they
         need the correct
         case returned.
       + There may be other uses of domain names that requires them to be
         unchanged.
   - The characters in the ASCII character set must still be encoded
     as ASCII.

   This document specifies the update needed of the DNS protocol, user
   interface issues and the effect of other protocols.

   The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT",
   "SHOULD", "SHOULD NOT", "RECOMMENDED",  "MAY", and "OPTIONAL" in this
   document are to be interpreted as described in [RFC2119].


2. The DNS Protocol

   The DNS protocol is used when communicating between DNS servers and
   other DNS servers or DNS clients. User interface issues like the format
   of zone files or how to enter or display domain names are not part
   of the protocol.

   The update of the protocol defined here can be used immediately as
   it is fully compatible with the DNS of today.

2.1 Internationalisation aware software

   Internationalisation aware DNS software (i18n aware) is software that
   handles the rules for handling international text as defined here. Only
   i18n aware software will get all requirements fulfilled.

   Referring to section 4.1.1 in RFC1035 and section 6.1 in RFC2535 the
   the DNS query/response format header is updated by allocation the last
   un-allocated bit in the header. This bit is defined to be zero in
   old servers and resolvers. For description of all field see the sections
   in the above RFCs.

                                           1  1  1  1  1  1
             0  1  2  3  4  5  6  7  8  9  0  1  2  3  4  5
            +--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+
            |                      ID                       |
            +--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+
            |QR|   Opcode  |AA|TC|RD|RA|IN|AD|CD|   RCODE   |
            +--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+
            |                    QDCOUNT                    |
            +--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+
            |                    ANCOUNT                    |
            +--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+
            |                    NSCOUNT                    |
            +--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+
            |                    ARCOUNT                    |
            +--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+

   I18n aware software identifies itself in a query or a response by
   setting the IN bit in the DNS query/response format header.
   As this bit is defined to be zero in old servers and resolvers they
   identify themselves as non-i18n aware.

   I18n aware software MUST set the IN bit in both queries and responses.

   Note: The reason that EDNS [RFC2671] is not used is because:
      - It should work with the current (that is now) DNS software.
      - There should be no additional requests needed to be sent
        for i18n aware software.



2.2 Character data

   Character data need to be able to represent as much as possible of
   the characters in the world as well as being compatible with ASCII.
   It must also be well defined so that it can easily be compared
   in both case and case-insensitive matching and should be compact as
   only 63 octets is available without an extension of the protocol.

   Therefore character data MUST:
   - Be ISO 10646 (UCS) [ISO10646].
   - Be normalised using form C as defined in Unicode technical
     report #15 [UTR15].
   - Encoded using UTF-8 [RFC2279].
   
   Case-insensitive matching MUST:
   - Be done by folding the case to lower case using the CaseFolding.txt
     mapping as defined in Unicode technical report #21 [UTR21] and
     then comparing the data.

   A non-i18n aware DNS server may not be authorative to a zone with
   non-ASCII in it. Because they cannot do case-insensitive matching
   on non-ASCII.


   Note: Normalisation form KC could have been possible to use instead
   of form C, but form KC is both much more complex to handle and
   does not preserver all semantics of the text. Form KC would make
   some character match equally, that will not do that in form C.
   Problems with different character representations can be fixed
   with a separate recommendation of what characters should be used
   in domain names.

   Note: Case folding to lower case using UTR#21 is not perfect. For
   example in Turkey I is lower cased into a dotless i, but UTR#21
   does it in the old ASCII way (I -> i). This way we get a well
   defined lower casing that can be used in matching, but it will
   not be correct for all local rules of different languages.
   The Turkish problem can be dealt with by asking users to
   only use a lower case dotless i, when needed.

   Note: Currently ISO 106464 is at level ISO 10646-1:2000 and
   Unicode at version 3.0. Later on more characters will be added
   to those standards and they may include additional characters that
   need new rules for normalisation and lower casing. As long as
   data is normalised and lower cased outside DNS, it will work
   without problems. (lower casing is probably not needed)
   

2.2.1 Down coding
   As a local character set may not support all of the characters of
   UCS used internally in DNS, a way to encoded unsupported characters
   into the local character set is needed. That way a domain name can
   be used even if the local character set cannot represent all
   characters in a name.

   This will be done by down coding UTF-8 into the local character set.
   It is done as follows:
     - If a character can be represented in the local character set,
       map it from UCS to local character set.
     - If a character cannot be represented in the local character set,
       map the UTF-8 octet sequence for the character to a hyphen ("-")
       followed by the hex code of each octet as two charcters per octet.
     - If it was needed to down code because not all characters could be
       represented in the local character set, all original hyphens
       must be prelced by two hyphens ("--") and the entire strings
       MUST end with a single hyphen.

       Examples:
       If we have the name: Ab-<a with ring above>r<greek omega>z
       this is represented in DNS as UTF-8:
          (HEX) 41 62 2d c3 a5 72 c9 b7 7a
       If the local character set is ISO 8859-1, the down coded name
       is: Ab--<a with ring above>r-c9b7z-.
       If the local character set is ASCII, the down coded name
       is: Ab---c3a5r-c9b7z-.

2.2.2 Up coding
   When character data is entered into i18n aware DNS softare, it must
   be up coded from the down coding format into UTF-8. A down coded
   name is identified by a trailing hyphen. When up coding should
   invalid UTF-8 sequences be left as it is, it may be an old name
   with a trailing hyphen.


2.3 Rules for character data in queries and responses

   [ Below is my original text here, after it are new thoughts that
     make this unneeded ]
   There is only one area which non-i18n aware software cannot
   handle: case-insensitive matching of i18n data.
   Because of this, the IN bit is defined and character data
   MUST be handled as follows:

   - In all queries all character data that will be used by the DNS server
     to lookup records, MUST be in lower case.
   - A request containing an update of the data in the database of the
     DNS server (for example a DNS update) MUST send data in the
     original case.
   - If the server is i18n aware and the client is not,
     a DNS server MUST not send a zone transfer.
   - A DNS server getting a request from an i18n aware client MUST
     return data using original case, just like old software does.
   - A i18n aware DNS server getting a request from a non-i18n aware
     client MUST return all character data that can be used in character
     matching, in lower case.

   The results of the above rules results in that old non-i18n aware
   DNS software only gets lower cased character data so that it can
   still perform character data matching. I18n aware software will
   get data as before, preserving case, but can still optimise
   character matching as all normal queries will have their data
   lower cased.

   [ End original text, now follows new thoughts and questions ]

   There are two important areas: how non-i18n aware DNS software
   works with i18n aware, and how non-ASCII domain names affects
   other software and protocols that are not i18n aware.

   Aspect: case-insensitive matching. This is used when DNS software
   matches data and it may also be used in other software for matching.
   The above old text defines all non-i18n aware DNS software to get
   everything lower cased so that it can compare data. What happens if
   we do not lower case it? An authoriative DNS server must be i18n aware
   so it can also handle matching. A caching DNS server of the non-i18n
   aware type will may cache different case mappings of the same request,
   as it does not recognise the case of some characters, or it might never
   get a match in its cache to a query (it the query has a different case
   than the response and only the response is cached). This may result in
   larger caches in old software and/or more DNS traffic. But will it
   result in invalid responses? I cannot think how. Maybe we do not
   have to lower case data when sending a response to non-i18n aware
   DNS software.

   Aspect: Old non-DNS software and other protocols. Here we have the
   problem that old software may think an UTF-8 encoded name is not
   a valid domain name. For example some SMTP servers may reject the
   e-mail and even fail to return the error to the sender.
   One way to fix mot of this is to define that i18n aware DNS servers
   always down code (see above) the UTF-8 into ASCII before sending
   it to non-18n aware DNS software. The result of this is:
     - Old software gets an ASCII only domain name using only the
       old set of allowed characters.
     - Old software will get an error response if the response would
       include a long i18n domain name that will not fit in the 63 octets
       allowed, after down coding.
     - Both i18n aware DNS servers and client software must handle
       up coding of domain names.
     - If a query comes from a i18n aware client talking to old
       DNS caching server, the query will be in UTF-8 so that the
       i18n aware authorative DNS server will get a query from a
       non-i18n aware with UTF-8 in. This is probably no problem.
       Just answer with down coded data in response and the i18n
       aware client need to up code the response.
     - Domain names used from old software will work in other protocols
       only allowing ASCII names.
     - We may get old software that is never fixed as it still works.
     - We do not get rid of this ugly, user unfriendly, encode everything
       in ASCII handling that is used so much.
     - What more?


   What is best? Down coding to ASCII and not downcasting to lower case
   when taking with non-i18n aware software?


2.4 Canonical DNS Name Order
   The canonical DNS name order as defined in section 8.2 of RFC2535 is
   extended to be done be case folding to lower case, as defined above,
   as then sorted as defined in RFC2535.



3. Characters allowed in domain names

   The DNS protocol do not place any restriction on characters used in
   a domain name. However applications that make use of DNS
   data may have restrictions imposed on what particular values are
   acceptable in their environment. If the client has such restrictions,
   it is solely responsible for validating the data from the DNS to ensure
   that it conforms before it makes any use of that data. [RFC2181]

   For example domains, hosts and e-mail addresses are represented in DNS
   and may have different rules.

   As the whole idea of internationalisation of DNS is to get domain names
   with non-ASCII, the original recommendation in DNS [RFC1035] for
   host/domain names needs to be updated.

   It is recommended that domains, hosts and e-mail addresses all are
   extended to allow all letters, digits and some separators of UCS.
   Also careful thought should be given to different forms of characters
   so that names only different in things like double/normal width are
   not allowed.

   This have to be defined in an other document.


4. User interface issues

   Locally on a system or in a user interface a different character set
   than the one defined to be used in the DNS protocol may be used.
   Therefore
   software must map between the local character set and the character set of
   the protocol, so that human beings can understand it.

   This means that a zone file that is edited in a text editor by a person
   before being loaded into a DNS server must be allowed to be in the local
   character set. Software may not assume that the user can edit text
   encoded in UTF-8. A zone file transmitted between DNS software that
   is not handled by a human, can be transmitted using any format.

   When character data is presented to a human or entered by a human,
   software must, as good as possible, present it using local character
   set and allow it to be entered using the local character set.
   It is the responsibility of the software to convert between the local
   character set and the one used in the protocol, not the human.


4.1 Interfacing with local system in DNS software
   The resolver code used to make queries into the DNS must map between
   the local character set and the character format defined for DNS above.
   When returning data, for example when returning data in a gethostbyaddr
   call, return character data in the local character set, down coded as
   defined above. And the other way around, when given a query, it must
   map from the local character set into UTF-8. It is the resonsibility
   of the resolver code to do the mapping, not the application.
   The software should also allow a way to use UTF-8 in both query
   and response, if the application should want that.

   When a DNS server loads a zone file, it must map from local character
   set to UTF-8. The user cannot be expected to be able to enter data
   in UTF-8.



5. Effect on other protocols

   As now a domain name may include non-ASCII many other protocols
   that include domain names need to be updated. For example
   SMTP, HTTP and URIs.

   In many protocols domain names are used in headers. It is recommended
   that they are updated to be encoded using UCS normalised using form C
   of UTR#15 and encoded using UTF-8. And the same format for
   other character data of the protocols. This way ugly things like
   quoted-printable can be obsoleted.

   We can now expect users to want to have e-mail addresses with
   non-ASCII both before and after the @-sign.

   Software need to be updated to follow the user interface recommendations
   given above, so that a human will see the characters in their local
   character set, if possible.

6. Security Considerations

   As always with data, if software does not check for data that can
   be a problem, security may be affected. As more characters
   than ASCII is allowed, software only expecting ASCII and with no checks
   may now get security problems.

7. References

   [RFC1034]   Mockapetris, P., "Domain Names - Concepts and Facilities",
               STD 13, RFC 1034, November 1987.

   [RFC1035]   Mockapetris, P., "Domain Names - Implementation and
               Specification", STD 13, RFC 1035, November 1987.

   [RFC2279]   F. Yergeau, "UTF-8, a transformation format of 
               ISO 10646," RFC 2279, Alis Technologies, January 1998.


   [RFC2181]  Elz, R. and R. Bush, "Clarifications to the DNS
               Specification", RFC 2181, July 1997.

   [RFC2535]  D. Eastlake, "Domain Name System Security Extensions".
               RFC 2535, March 1999.

   [RFC2671]

   [RFC2119]   Scott Bradner, "Key words for use in RFCs to Indicate
               Requirement Levels", March 1997, RFC 2119.

   [ISO10646]  

   [UTR15]     Mark Davis and Martin Duerst, "Unicode Normalization Forms",
               Unicode Technical Report #15,
               <http://www.unicode.org/unicode/reports/tr15/>.

   [Unicode3]  The Unicode Consortium, "The Unicode Standard -- Version
               3.0", ISBN 0-201-61633-5. Described at
               <http://www.unicode.org/unicode/standard/versions/Unicode3.0.html>.

   [UnicodeData] The Unicode Character Database, 
                <ftp://ftp.unicode.org/Public/UNIDATA/UnicodeData.txt>.
                 The database is described in
                <ftp://ftp.unicode.org/Public/UNIDATA/UnicodeCharacterDatabase.html>.



8. Acknowledgements

   Ideas from drafts by Paul Hoffman, Stuart Kwan, James Gilroy and
   Kent Karlsson.

   Magnus Gustavsson for comments on my draft.

   Discussions and comments by the members of the IDN working group.



9. Author's Address

   Dan Oscarsson
   Telia ProSoft AB
   Box 85
   201 20 Malmo
   Sweden

   E-mail: Dan.Oscarsson@trab.se