[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[idn] 2 new internet-drafts

To: idn@ops.ietf.org
Subject: [idn] 2 new internet-drafts
From: Marc Blanchet <Marc.Blanchet@viagenie.qc.ca>
Date: Wed, 05 Jul 2000 23:04:45 -0400
Cc: phoffman@imc.org
Delivery-date: Wed, 05 Jul 2000 20:06:24 -0700
Envelope-to: idn-data@psg.com

Hi,
	Paul Hoffman and I have written two internet-drafts that we just submitted 
to the internet-drafts:
draft-ietf-idn-nameprep-00.txt and draft-ietf-idn-idne-00.txt. They work 
together.  We are very open receiving comments!
	Since they are not big files, I'm including the drafts in this email 
because it seems these days that the secretariat takes time to process the 
drafts (probably because of the number of drafts coming near the deadline), 
so that you can take a look to them right now and comment before pittsburg.

Regards, Marc and Paul.

Internet Draft                                   Marc Blanchet
draft-ietf-idn-idne-00.txt                            Viagenie
July 5, 2000                                     Paul  Hoffman
Expires in six months                               IMC & VPNC

          Internationalized domain names using EDNS (IDNE)

Status of this Memo

This document is an Internet-Draft and is in full conformance with all
provisions of Section 10 of RFC2026.

Internet-Drafts are working documents of the Internet Engineering Task
Force (IETF), its areas, and its working groups. Note that other groups
may also distribute working documents as Internet-Drafts.

Internet-Drafts are draft documents valid for a maximum of six months
and may be updated, replaced, or obsoleted by other documents at any
time. It is inappropriate to use Internet-Drafts as reference material
or to cite them other than as "work in progress."

To view the entire list of Internet-Draft Shadow Directories, see
http://www.ietf.org/shadow.html.


Abstract

The current DNS infrastructure does not provide a way to use
internationalized domain names (IDN). This document describes an
extension mechanism based on EDNS which enables the use of IDN without
causing harm to the current DNS. IDNE enables IDN host names with a as
many characters as current ASCII-only host names. It fully supports
UTF-8 and conforms to the IDN requirements.


1. Introduction

Various proposals for IDN have tried to integrate IDN into the current
limited ASCII DNS. However, the compatibility issues make too many
constraints on the architecture. Many of these proposals require
modifications to the applications or to the DNS protocol or to the
servers. This proposal take a different approach: it uses the
standardized extension mechanism for DNS (EDNS) and uses UTF-8 as the
mandatory charset. It causes no harm to the current DNS because it uses
the ENDS extension mechanism. The major drawback of this proposal is
that all protocols, applications and DNS servers will have to be
upgraded to support this proposal.

1.1 Terminology

The key words "MUST", "SHALL", "REQUIRED", "SHOULD", "RECOMMENDED", and
"MAY" in this document are to be interpreted as described in RFC 2119
[RFC2119].

Hexadecimal values are shown preceded with an "0x". For example,
"0xa1b5" indicates two octets, 0xa1 followed by 0xb5. Binary values are
shown preceded with an "0b". For example, a nine-bit value might be
shown as "0b101101111".

1.2 IDN summary

Using the terminology in [IDNComp],  this protocol specifies an IDN
architecture of arch-2 (send binary or ACE). The binary format is
bin-1.1 (UTF-8), and the method for distinguishing binary from current
names is bin-2.4 (mark binary with EDNS0). The transition period is not
specified.


2. Functional Description

DNS query and responses containing IDNE labels have the following
properties:

- The string in the label MUST be pre-processed as described in
[NAMEPREP] before the query or response is prepared.

- The characters in the label MUST be encoded using UTF-8 [RFC2279].

- The entire label MUST be encoded EDNS [RFC2671].

- The version of the IDN protocol MUST be identified.


3. Encoding

An IDNE label uses the EDNS extended label type prefix (0b01), as
described in [RFC2671]. (A normal label type always begin with 0b00). A
new extended label type for IDNE is used to identify an IDNE label. This
document uses 0b000010 as the extended label type; however, the label
type will be assigned by IANA and it may not be 0b000010.

        0                   1                   2
bits  0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2     . . .
       +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-//+-+-+-+-+-+-+
       |0 1|    ELT    |     Size      |        IDN label ...        |
       +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+//-+-+-+-+-+-+-+


ELT: The six-bit extended label type to be assigned by the IANA for an
IDN label. In this document, the value 0b000010 is used, although that
might be changed by IANA.

Size: Size (in octets) of the IDN label following.

IDN label: Label, encoded in UTF-8 [RFC2279]. Note that this label might
contain all ASCII characters, and thus can be used for host name labels
that are legal in [STD13].

IDNE labels can be mixed with STD13 labels in a domain name.

The compression scheme in section 4.1.4 of [STD13] is supported as is.
Pointers can refer to either IDN labels or non-IDN labels.

3.1 Examples

3.1.1 Basic example

The following example shows the label me.com where the "e" in "me" is
replaced by a <LATIN CAPITAL LETTER E WITH ACUTE>, which has the
codepoint 0x00C9. The decomposition and downcasing specified in
[NAMEPREP] produces the string <LATIN SMALL LETTER E><COMBINING ACUTE
ACCENT>, which is 0x00650301. This is then transformed using
UTF-8[RFC2279] to: 0x65CC81.

Ignoring the other fields of the message, the domain name portion of the
datagram could look like:

        +--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+
     20 | 0  1  0  0  0  0  1  0| 0  0  0  0  0  1  0  1|
        +--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+
     22 |         0x6D (m)      |       0x65 (e)        |
        +--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+
     24 |         0xCC ('(1))   |       0x81 ('(2))     |
        +--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+
     26 |         3             |       0x63 (c)        |
        +--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+
     28 |         0x6F (o)      |       0x6D (m)        |
        +--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+
     30 |         0x00          |                       |
        +--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+

Octet 20 means EDNS extended label type (0b01) using the IDN label
      type (0b000010).
Octet 21 means size of label is 4 octets following.
Octet 22-24 are the "m*" label (where the "*" is
       <LATIN SMALL LETTER E><COMBINING ACUTE ACCENT>)
Octet 26-29 are "com" encoded as a STD13 label
Octet 30 is the root domain

3.1.2 Example with compression

Using the previous labels, one datagram might contain "www.m*.com" and
"m*.com" (where the "*" is <LATIN SMALL LETTER E><COMBINING ACUTE
ACCENT>).

Ignoring the other fields of the message, the domain name portions of
the datagram could look like:

        +--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+
     20 | 0  1  0  0  0  0  1  0| 0  0  0  0  0  1  0  1|
        +--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+
     22 |         0x6D (m)      |       0x65 (e)        |
        +--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+
     24 |         0xCC ('(1))   |       0x81 ('(2))     |
        +--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+
     26 |         3             |       0x63 (c)        |
        +--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+
     28 |         0x6F (o)      |       0x6D (m)        |
        +--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+
     30 |         0x00          |                       |
        +--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+
    .    .    .
        +--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+
     40 |           3           |       0x77 (w)        |
        +--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+
     42 |       0x77 (w)        |       0x77 (w)        |
        +--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+
     44 | 1  1|                20                       |
        +--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+

The domain name "m*.com" is shown at offset 20. The domain name
"www.m*.com" is shown at offset 40; this definition uses a pointer to
concatenate a label for www to the previously defined "m*.com".


4. Label Size

In IDNE, the maximum length of a label is 255 octets, and the maximum
size for a domain name is 1023 octets. The reason for using these values
is so that IDNE labels can have the same number of characters as the
ASCII-based labels in [STD13]. Because character encoding in UTF-8 is
variable length, the maximum octet length for characters expected in the
foreseeable future (that is, 4 octets for a single character) was used.
Note that this extension allows some IDNE labels to be longer than 63
characters and some IDNE names to be longer than 255 octets.

Software creating DNS queries or responses using IDNE MUST verify that,
after IDN preparation and transformation to UTF8, that no labels are
longer than 255 octets and that no names are longer than 1023 octets. If
there is a user interface associated with the process creating the query
or response, that interface SHOULD give the user an error message.

Software MUST NOT transmit DNS queries or responses which contain labels
that are longer than 255 octets or names that are longer than 1023
octets. Servers MUST NOT accept DNS queries or responses which contain
labels that are longer than 255 octets or names that are longer than
1023 octets, and MUST send the NOTIMPL RCODE error message if such
queries or responses are received.


5. UDP Packet Size

IDNE-capable senders and receivers MUST support UDP packet sizes of 1220
octets, not including IP and UDP headers (note that the minimum MTU for
IPv6 is 1280 [RFC2460]). A sender MUST announce its capability in the
OPT pseudo-RR described in section 4.3 of [RFC2671] by having the CLASS
sender's UDP payload size be greater than or equal to 1220.


6. Canonalization, Prohibited Characters, and Case Folding

The string in the label MUST be pre-processed as described in [NAMEPREP]
before the query or response is prepared. A query or response MUST NOT
contain a label that does not conform to [NAMEPREP].

DNS servers MUST check for prohibited chars in the labels. If any label
in a query is found, a NOTIMPL RCODE MUST be returned.


7. Versions of IDNE

The IDN protocol version number MUST be included in the OPT RR RDATA of
EDNS (described in Section 4.4 of [RFC2671]). An OPTION-CODE will be
assigned by IANA for storing the IDNE protocol version number; this
document uses 0x0001 for the OPTION-CODE. The value (that
is, the OPTION-DATA) is the version number coded in 8 bits.

All requesters MUST send this information as part of the OPT RR included
in the EDNS packet.

7.1 This version of IDNE

This document describes version 1 of IDNE. This version is a combination
of the protocol in this document and the rules as described in
[NAMEPREP]. Note that [NAMEPREP] describes a single version of the list
of canonicalization, case folding, and prohibited characters, and that
this document is linked to that single version of [NAMEPREP].

The identifiers for this specification are:
OPTION-CODE =   0x0001  (IDNE protocol version)
OPTION-LENGTH = 0x0001  (1 octet following)
OPTION-DATA =   0x01  (IDNE protocol version 1)

7.2 Creating new versions of IDNE

A new version of IDNE is created by a standards-track RFC that
specifies:

- a normative reference to [NAMEPREP] or a successor document to
[NAMEPREP]

- an IDNE version number that is 1 greater than the highest IDNE version
number at the time the RFC is published

If there are any changes to the encoding or interpretation of the
protocol, they must also be specified in the same standards-track RFC.

7.3 Prohibited characters and versions of IDNE

If a server receives a request containing an illegal or unknown
character (as described in the version number in the request), it MUST
send a NOTIMPL RCODE to the client. For example, if a server that
understands both version 1 and version 2 receives a request that is
marked as version 1, but contains a label that includes a character that
is prohibited in version 1 but allowed in version 2, that server must
still send a NOTIMPL RCODE to the client.


8. API Specifications

The current API for TCP/IP uses gethostbyname and gethostbyaddr for IPv4
and getnodeipbyname and getnodeipbyaddr (specified in [RFC 2671]) for
both IPv4 and IPv6. These function calls returns hostent structs, where
the h_name field contains a pointer to a char. In this context,
receiving a UTF-8 string mean that the application should know that
UTF-8 uses more than one octet per char.

A new flag "IDN" (to appear in netdb.h) is defined to be passed in the
flags argument of getnodeipbynode and getnodeipbyaddr. This flag tells
the resolver to request an IDNE-encoded name. No new return code is
defined since the returned codes in RFC 2671 are meaningful in the IDNE
context.

If one has not yet converted his code to IPv6 and still wants to enable
IDNs with this API, one can do a macro of the getnodeipby* functions
mapped to the IPv4 gethostby* ones, including the "IDN" flag, and then
process differently based on the presence of the flag.


9. Transition and Deployment

Deployment of this proposal means updating clients and servers, as well
as applications and protocols, and therefore a transition strategy is
proposed. Because many DNS servers do not yet handle IDNE and may take
years or decades to do so, an ASCII-compatible encoding (ACE) format for
IDN names is also needed as a transition to an all-IDNE DNS. Note that
IDNE and an ACE are not related, and do not interact in the DNS. If the
IETF chooses to have an ACE mechanism in use at the same time as IDNE,
it would be wise to choose an ACE that allows as many characters as
possible in the name parts and full names.

IDNE allows names with as many characters as current names. This means
that it is possible to create names in IDNE that are longer than those
that can be created in the ACE protocols that have been described so
far. Although not prohibited, it is unwise to create a name that can be
legally represented in IDNE but not in the ACE, or a name that can be
legally represented in the ACE but not in IDNE.

The IETF should periodically evaluate the benefits and problems
associated with having three different formats for names (STD13, IDNE,
and ACE). If at some point it is decided that the problems outweigh the
benefits, the IETF can state a time when one or more of the services
should not be used on the Internet.


10. Root Server Considerations

Because this specification uses ENDS, root servers should be prepared to
receive EDNS requests. This specification handles IDN top-level domains
in exactly the same fashion as it does every other domain.
Considerations about IDN top-level domains are outside of this work, but
the first IDN top-level domains would require all root servers to be
ready for IDNE requests.


11. IANA Considerations

[[ TBD. This section will have two parts. The first will request an EDNS
option code. The second will specify how IDNE version numbers are
allocated (namely, standards-track RFC only). ]]


12. Security Considerations

Because IDNE uses ENDS, it inherits the same security considerations as
EDNS.


13. References

[IDNComp] Paul Hoffman, "Comparison of Internationalized Domain Name
Proposals", draft-ietf-idn-compare.

[NAMEPREP] Paul Hoffman & Marc Blanchet, "Preparation of
Internationalized Host Names", draft-ietf-idn-nameprep.

[RFC2119] Scott Bradner, "Key words for use in RFCs to Indicate
Requirement Levels", March 1997, RFC 2119.

[RFC2279] Francois Yergeau, "UTF-8, a transformation format of ISO
10646", January 1998, RFC 2279.

[RFC2460] Steve Deering & Bob Hinden, "Internet Protocol, Version 6 (IPv6)
Specification", December 1998, RFC 2460.

[RFC2671] Paul Vixie, "Extension Mechanisms for DNS (EDNS0)", August
1999, RFC 2671.

[STD13] Paul Mockapetris, "Domain names - implementation and
specification", November 1987, STD 13 (RFC 1035).


A. Authors' Addresses

Marc Blanchet
Viagenie
2875 boul. Laurier, bureau 300
Sainte-Foy, QC  G1V 2M2 Canada
Marc.Blanchet@viagenie.qc.ca

Paul Hoffman
Internet Mail Consortium and VPN Consortium
127 Segre Place
Santa Cruz, CA  95060 USA
phoffman@imc.org

Internet Draft                                          Paul Hoffman
draft-ietf-idn-nameprep-00.txt                            IMC & VPNC
July 3, 2000                                           Marc Blanchet
Expires in six months                                       ViaGenie

             Preparation of Internationalized Host Names

Status of this memo

This document is an Internet-Draft and is in full conformance with all
provisions of Section 10 of RFC2026.

Internet-Drafts are working documents of the Internet Engineering Task
Force (IETF), its areas, and its working groups. Note that other groups
may also distribute working documents as Internet-Drafts.

Internet-Drafts are draft documents valid for a maximum of six months
and may be updated, replaced, or obsoleted by other documents at any
time. It is inappropriate to use Internet-Drafts as reference material
or to cite them other than as "work in progress."

To view the list Internet-Draft Shadow Directories, see
http://www.ietf.org/shadow.html.


Abstract

This document describes how to prepare internationalized host names for
transmission on the wire. The steps include excluding characters that
are prohibited from appearing in internationalized host names, changing
all characters that have case properties to be lowercase, and
normalizing the characters. Further, this document lists the prohibited
characters.


1. Introduction

When expanding today's DNS to include internationalized host names,
those new names will be handled in many parts of the DNS. The IDN
Working Group's requirements document [IDNReq] describes a framework for
domain name handling as well as requirements for the new names. The IDN
Working Group's comparison document [IDNComp] gives a framework for how
various parts of the IDN solution work together.

A user can enter a domain name into an application program in a myriad
of fashions. Depending on the input method, the characters entered in
the domain name may or may not be those that are allowed in
internationalized host names. Thus, there must be a way to canonicalized
the user's input before the name is resolved in the DNS.

It is a design goal of this document to allow users to enter host names
in applications and have the highest chance of getting the name correct.
This means that the user should not be limited to only entering exactly
the characters that might have been used, but to instead be able to
enter characters that unambiguously canonicalize to characters in the
desired host name. At the same time, this process must not introduce any
chance that two host names could be represented by two distinct strings
of characters that look identical to typical users. It is also a design
goal to have all preprocessing of IDN done before going on the wire, so
that no transformation is done in the DNS server space.

This document describes the steps needed to convert a name part from one
that is entered by the user to one that can be used in the DNS.

1.1 Terminology

The key words "MUST", "SHALL", "REQUIRED", "SHOULD", "RECOMMENDED", and
"MAY" in this document are to be interpreted as described in RFC 2119
[RFC2119].

Examples in this document use the notation from the Unicode Standard
[Unicode3] as well as the ISO 10646 [ISO10646] names. For example, the
letter "a" may be represented as either "U+0061" or "LATIN SMALL LETTER
A". In the lists of prohibited characters, the "U+" is left off to make
the lists easier to read.

1.2 IDN summary

Using the terminology in [IDNComp], this document specifies all of the
prohibited characters and the canonicalization for an IDN solution.
Specifically, it covers the following sections from [IDNComp]:

prohib-1: Identical and near-identical characters
prohib-2: Separators
prohib-3: Non-displaying and non-spacing characters
prohib-4: Private use characters
prohib-5: Punctuation
prohib-6: Symbols
canon-1.2: Normalization Form KC
canon-2.1: Case folding in ASCII
canon-2.2: Case folding in non-ASCII

Note that this document does not cover:
canon-1.1: Normalization Form C
canon-2.3: Han folding

1.3 Open issues

This is the first draft of this document. Although there has been much
discussion on the WG mailing list about the topics here, there has not
yet been much agreement on some issues. Now that there is a document to
talk about, that discussion can be more focussed.

1.3.1 Where to do name preparation

Section 2.1 says to do name preparation in the resolver. An argument can
be made for doing name preparation in the application, before the
application service interface. An advantage of that proposal is that
resolvers would not need to do any name preparation. A disadvantage is
that applications would have to be updated each time the IDN protocol is
updated, such as if new characters are added to the repertoire of
allowed characters. It seems likely that resolvers are more easily
updated than all the individual applications that use internationalized
host names.

1.3.2 Choosing between normalization form C and KC

Much of the discussion of normalization on the WG mailing list assumed
that normalization form C would be used. Near the time that this
document was written, people started considering form KC instead of C.
This document used form KC, but the reasons for doing so could be
contentious.

1.3.3 Does the prohibition catch all bad characters?

On the mailing list, it was discussed doing prohibition in two steps: a
short list of prohibited characters before case folding in order to
prevent uppercase characters that have no lowercase equivalents from
getting through, and then a full check on the output of normalization.
In this draft, all checking is done before case folding, based on the
(possibly wrong) assumption that none of the prohibited characters will
re-appear after the case folding and normalization. If that assumption
turns out to be wrong, a check for just those problematic characters can
be added after normalization, or a full check against the prohibited
characters can be added.


2. Preparation Overview

This section describes where name preparation happens and the steps that
name preparation software must take.

2.1 Where name preparation happens

Part of the chart in section 1.4 of [IDNReq] looks like this:

+---------------+
| Application   |
+---------------+
      |  Application service interface
      |  For ex. GethostbyXXXX interface
+---------------+
| Resolver      |
+---------------+
      |     <-----   DNS service interface
+-------------------------------------------+
 
In this specification, the name preparation is done in the resolver,
before the DNS service interface. That is, it is acceptable for software
in the application service interface (such as a "GetHostByName" API) to
pass the resolver a name that has not been prepared. However, the
resolver MUST prepare the name as described in this specification before
passing it to the DNS service interface.

2.2 Name preparation steps

The steps for preparing names are:

1) Input from the application service interface -- This can be done in
many ways and is not specified in this document

2) Look for prohibited input -- Check for any characters that are not
allowed in the input. If any are found, return an error to the
application service interface. This step is necessary to prevent errors
in the following two steps. This step fulfills prohib-1, prohib-2,
prohib-3, prohib-4, prohib-5, and prohib-6 from [IDNComp].

3) Fold case -- Change all uppercase characters into lowercase
characters. Design note: this step could just as easily have been
"change all lowercase characters into uppercase characters". However,
the upper-to-lower folding was chosen because most users of the Internet
today enter host names in lowercase. This step fulfills canon-2.1 and
canon-2.2 from [IDNComp].

4) Canonicalize -- Normalize the characters. This step fulfils canon-1.2
from [IDNComp].

5) Resolution of the prepared name -- This must be specified in a
different IDN document.

The above steps MUST be performed in the order given in order to comply
with this specification.


3. Prohibited Input

Before the text can be processed, it must be checked for prohibited
characters. There is a variety of prohibited characters, as described in
this section.

Note that one of the goals of IDN is to allow the widest possible set of
host names as long as those host names do not cause other problems, such
as possible ambiguity. Specifically, experience with current DNS names
have shown that there is a desire for host names that include personal
names, company names, and spoken phrases. A goal of this section is to
prohibit as few characters that might be used in these contexts as
possible while making sure that characters that might easily cause
confusion or ambiguity are prohibited.

Note that every character listed in this section MUST NOT be transmitted
on the DNS service interface. Although the checking is being performed
before case folding and canonicalization, those steps cannot result in
any of these characters if these characters are not in the input stream.
[[[NOTE: THIS STATEMENT NEEDS TO BE CHECKED ALGORITHMICALLY.]]] If a DNS
server receives a request containing a prohibited character, then the
IDN protocol MUST return an error message.


Note that some characters listed in one section would also appear in
other sections. Each character is only listed once.

3.1 prohib-1: Identical and near-identical characters

Many characters in [ISO10646] are identical or nearly identical to other
characters. These were often included for compatibility with other
character sets.

The characters prohibited because they are identical or nearly identical
to allowed characters are:

00AD        SOFT HYPHEN
00D7        MULTIPLICATION SIGN
01C3        LATIN LETTER RETROFLEX CLICK
02B0-02FF   [SPACING MODIFIER LETTERS]
066D        ARABIC FIVE POINTED STAR
1806        MONGOLIAN TODO SOFT HYPHEN
2010        HYPHEN
2011        NON-BREAKING HYPHEN
2012        FIGURE DASH
2013        EN DASH
2014        EM DASH
2160-217F   [ROMAN NUMERALS]
FB1D-FB4F   [HEBREW PRESENTATION FORMS]
FB50-FDFF   [ARABIC PRESENTATION FORMS A]
FE20-FE2F   [COMBINING HALF MARKS]
FE30-FE4F   [CJK COMPATIBILITY FORMS]
FE50-FE6F   [SMALL FORM VARIANTS]
FE70-FEFC   [ARABIC PRESENTATION FORMS B]
FF00-FFEF   [HALFWIDTH AND FULLWIDTH FORMS]

3.2 prohib-2: Separators

Horizontal and vertical spacing characters would make it unclear where a
host name begins and ends. The prohibited spacing characters are:

0020        SPACE
00A0        NO-BREAK SPACE
1680        OGHAM SPACE MARK
2000-200B   [SPACES]
2028        LINE SEPARATOR
2029        PARAGRAPH SEPARATOR
202F        NARROW NO-BREAK SPACE
3000        IDEOGRAPHIC SPACE

Allowing periods and period-like characters as characters within a name
part would also cause similar confusion. The prohibited periods,
characters that look like periods, and characters that canonicalize to a
period or to a period-like character are:

002E        FULL STOP
06D4        ARABIC FULL STOP
2024        ONE DOT LEADER
2025        TWO DOT LEADER
2026        HORIZONTAL ELLIPSIS
2488        DIGIT ONE FULL STOP
2489        DIGIT TWO FULL STOP
248A        DIGIT THREE FULL STOP
248B        DIGIT FOUR FULL STOP
248C        DIGIT FIVE FULL STOP
248D        DIGIT SIX FULL STOP
248E        DIGIT SEVEN FULL STOP
248F        DIGIT EIGHT FULL STOP
2490        DIGIT NINE FULL STOP
2491        NUMBER TEN FULL STOP
2492        NUMBER ELEVEN FULL STOP
2493        NUMBER TWELVE FULL STOP
2494        NUMBER THIRTEEN FULL STOP
2495        NUMBER FOURTEEN FULL STOP
2496        NUMBER FIFTEEN FULL STOP
2497        NUMBER SIXTEEN FULL STOP
2498        NUMBER SEVENTEEN FULL STOP
2499        NUMBER EIGHTEEN FULL STOP
249A        NUMBER NINETEEN FULL STOP
249B        NUMBER TWENTY FULL STOP
33C2        SQUARE AM
33C2        SQUARE AM
33C7        SQUARE CO
33D8        SQUARE PM
33D8        SQUARE PM

3.3 prohib-3: Non-displaying and non-spacing characters

There are many characters that cannot be seen in the ISO 10646 character
set. These include control characters, non-breaking spaces, formatting
characters, and tagging characters. These characters would certainly
cause confusion if allowed in host names.

0000-001F   [CONTROL CHARACTERS]
007F        DELETE
0080-009F   [CONTROL CHARACTERS]
070F        SYRIAC ABBREVIATION MARK
180B        MONGOLIAN FREE VARIATION SELECTOR ONE
180C        MONGOLIAN FREE VARIATION SELECTOR TWO
180D        MONGOLIAN FREE VARIATION SELECTOR THREE
180E        MONGOLIAN VOWEL SEPARATOR
200C        ZERO WIDTH NON-JOINER
200D        ZERO WIDTH JOINER
200E        LEFT-TO-RIGHT MARK
200F        RIGHT-TO-LEFT MARK
202A        LEFT-TO-RIGHT EMBEDDING
202B        RIGHT-TO-LEFT EMBEDDING
202C        POP DIRECTIONAL FORMATTING
202D        LEFT-TO-RIGHT OVERRIDE
202E        RIGHT-TO-LEFT OVERRIDE
206A        INHIBIT SYMMETRIC SWAPPING
206B        ACTIVATE SYMMETRIC SWAPPING
206C        INHIBIT ARABIC FORM SHAPING
206D        ACTIVATE ARABIC FORM SHAPING
206E        NATIONAL DIGIT SHAPES
206F        NOMINAL DIGIT SHAPES
FEFF        ZERO WIDTH NO-BREAK SPACE
FFF9        INTERLINEAR ANNOTATION ANCHOR
FFFA        INTERLINEAR ANNOTATION SEPARATOR
FFFB        INTERLINEAR ANNOTATION TERMINATOR
FFFC        OBJECT REPLACEMENT CHARACTER
FFFD        REPLACEMENT CHARACTER

3.4 prohib-4: Private use characters

Because private-use characters do not have defined meanings, they are
prohibited. The private-use characters are:

E000-F8FF   [PRIVATE USE, PLANE 0]

3.5 prohib-5: Punctuation

The following characters are reserved or delimiters in URLs [RFC2396]
and [RFC2732]:

" # $ % & + , . / : ; < = > ? @ [ ]

3.5.1 Characters from URLs

The following punctuation characters are prohibited because they are
reserved or delimiters in URLs.

0022        QUOTATION MARK
0023        NUMBER SIGN
0024        DOLLAR SIGN
0025        PERCENT SIGN
0026        AMPERSAND
002B        PLUS SIGN
002C        COMMA
002E        FULL STOP
002F        SOLIDUS
003A        COLON
003B        SEMICOLON
003C        LESS-THAN SIGN
003D        EQUALS SIGN
003E        GREATER-THAN SIGN
003F        QUESTION MARK
0040        COMMERCIAL AT
005B        LEFT SQUARE BRACKET
005D        RIGHT SQUARE BRACKET

3.5.2 Characters that canonicalize to characters from URLs

The following punctuation characters are prohibited because their
normalization contains one or more of the characters from section 3.5.1.

037E        GREEK QUESTION MARK
2048        QUESTION EXCLAMATION MARK
2049        EXCLAMATION QUESTION MARK
207A        SUPERSCRIPT PLUS SIGN
207C        SUPERSCRIPT EQUALS SIGN
208A        SUBSCRIPT PLUS SIGN
208C        SUBSCRIPT EQUALS SIGN
2100        ACCOUNT OF
2101        ADDRESSED TO THE SUBJECT
2105        CARE OF
2106        CADA UNA

3.5.3 Characters that look like characters from URLs

The following are prohibited because they look indistinguishable from
the characters listed in section 3.5.1.

037E        GREEK QUESTION MARK
0589        ARMENIAN FULL STOP
060C        ARABIC COMMA
061B        ARABIC SEMICOLON
066A        ARABIC PERCENT SIGN
201A        SINGLE LOW-9 QUOTATION MARK
2030        PER MILLE SIGN
2031        PER TEN THOUSAND SIGN
2033        DOUBLE PRIME
2039        SINGLE LEFT-POINTING ANGLE QUOTATION MARK
2044        FRACTION SLASH
203A        SINGLE RIGHT-POINTING ANGLE QUOTATION MARK
203D        INTERROBANG
3001        IDEOGRAPHIC COMMA
3002        IDEOGRAPHIC FULL STOP
3003        DITTO MARK
3008        LEFT ANGLE BRACKET
3009        RIGHT ANGLE BRACKET
3014        LEFT TORTOISE SHELL BRACKET
3015        RIGHT TORTOISE SHELL BRACKET
301A        LEFT WHITE SQUARE BRACKET
301B        RIGHT WHITE SQUARE BRACKET

3.5.4 Other punctuation

The following punctuation are prohibited because they are unlikely to
be used in names and may be confusing to users or to character-entry
processes:

005C        REVERSE SOLIDUS

3.6 prohib-6: Symbols

[UniData] has non-normative categories for symbols. The four symbol
categories are:

Symbol, Currency: Currency symbols could appear in company names and
spoken phrases, so they are not prohibited.

Symbol, Modifier: Stand-alone modifiers might appear in personal names,
company names, and spoken phrases, so they are not prohibited.

Symbol, Math: It is very unlikely that there are any significant
personal names, company names, or spoken phrases that contain
mathematical symbols. Further, many of these symbols are the same or
similar to other punctuation, thereby leading to ambiguity. For this
reason, math-specific symbols are prohibited. These prohibited math
symbols are:

00AC        NOT SIGN
00B1        PLUS-MINUS SIGN
2200-22FF   [MATHEMATICAL OPERATORS]

Further, the following characters canonicalize to characters in the
above math list, and therefore are also prohibited:

00BC        VULGAR FRACTION ONE QUARTER
00BD        VULGAR FRACTION ONE HALF
00BE        VULGAR FRACTION THREE QUARTERS
207B        SUPERSCRIPT MINUS
208B        SUBSCRIPT MINUS
2153        VULGAR FRACTION ONE THIRD
2154        VULGAR FRACTION TWO THIRDS
2155        VULGAR FRACTION ONE FIFTH
2156        VULGAR FRACTION TWO FIFTHS
2157        VULGAR FRACTION THREE FIFTHS
2158        VULGAR FRACTION FOUR FIFTHS
2159        VULGAR FRACTION ONE SIXTH
215A        VULGAR FRACTION FIVE SIXTHS
215B        VULGAR FRACTION ONE EIGHTH
215C        VULGAR FRACTION THREE EIGHTHS
215D        VULGAR FRACTION FIVE EIGHTHS
215E        VULGAR FRACTION SEVEN EIGHTHS
215F        FRACTION NUMERATOR ONE
33A7        SQUARE M OVER S
33A8        SQUARE M OVER S SQUARED
33AE        SQUARE RAD OVER S
33AF        SQUARE RAD OVER S SQUARED
33C6        SQUARE C OVER KG

Symbol, Other: This category covers a multitude of symbols, few of which
would ever appear in personal names, company names, and spoken phrases.
The rest of the prohibited symbols are:

2190-21FF   [ARROWS]
2300-23FF   [MISCELLANEOUS TECHNICAL]
2400-243F   [CONTROL PICTURES]
2440-245F   [OPTICAL CHARACTER RECOGNITION]
2500-257F   [BOX DRAWING]
2580-259F   [BLOCK ELEMENTS]
25A0-25FF   [GEOMETRIC SHAPES]
2600-267F   [MISCELLANEOUS SYMBOLS]
2700-27BF   [DINGBATS]
2800-287F   [BRAILLE PATTERNS]

3.7 Additional prohibited characters

3.7.1 Unassigned characters

All characters not yet assigned in [ISO10646] are prohibited. Although
this may at first seem trivial, it is extremely important because
characters that may be assigned in the future might have properties that
would cause them to be prohibited or might have case-folding properties.
As is the case of all prohibited characters, if a DNS server receives a
request containing an unassigned character, then the IDN protocol MUST
return an error message.

3.7.2 Surrogate characters

So far, all proposals for binary encodings of internationalized name
parts have specified UTF-8 as the encoding format. In such an encoding,
surrogate characters MUST NOT be used. Therefore, for UTF-8 encodings,
the following are prohibited:

D800-DFFF   [SURROGATE CHARACTERS]

3.7.3 Uppercase characters with no lowercase mappings

There are many uppercase characters in [ISO10646] which do not have
lowercase equivalents in [UniData]. Therefore, they are prohibited on
input because they would get through the case mapping step while still
being in uppercase.

The characters that are prohibited on input because they are uppercase
but have no lowercase mappings are:

03D2        GREEK UPSILON WITH HOOK SYMBOL
03D3        GREEK UPSILON WITH ACUTE AND HOOK SYMBOL
03D4        GREEK UPSILON WITH DIAERESIS AND HOOK SYMBOL
04C0        CYRILLIC LETTER PALOCHKA
10A0-10C5   [GEORGIAN CAPITAL LETTERS]

Note that many characters in the range U+1200 to U+213A, the letterlike
symbols, also are uppercase but have no lowercase mappings. However,
they are not listed here because the entire range is already prohibited
in section 3.6.

3.7.4 Radicals and Ideographic Description

Some Han characters can be informally defined in terms of ideographic
descriptions. However, ideographic descriptions can lead to multiple
character streams leading to the same character in a fashion that does
not canonicalize. Thus, the radicals for ideographic description and the
ideographic description characters themselves are prohibited. These
characters are:

2E80-2EFF   [CJK RADICALS SUPPLEMENT]
2F00-2FDF   [KANGXI RADICALS]
2FF0-2FFF   [IDEOGRAPHIC DESCRIPTION CHARACTERS]

3.8 Summary of prohibited characters

The following is a collected list from the previous sections.

0000-001F   [CONTROL CHARACTERS]
0020        SPACE
0022        QUOTATION MARK
0023        NUMBER SIGN
0024        DOLLAR SIGN
0025        PERCENT SIGN
0026        AMPERSAND
002B        PLUS SIGN
002C        COMMA
002E        FULL STOP
002E        FULL STOP
002F        SOLIDUS
003A        COLON
003B        SEMICOLON
003C        LESS-THAN SIGN
003D        EQUALS SIGN
003E        GREATER-THAN SIGN
003F        QUESTION MARK
0040        COMMERCIAL AT
005B        LEFT SQUARE BRACKET
005C        REVERSE SOLIDUS
005D        RIGHT SQUARE BRACKET
007F        DELETE
0080-009F   [CONTROL CHARACTERS]
00A0        NO-BREAK SPACE
00AC        NOT SIGN
00AD        SOFT HYPHEN
00B1        PLUS-MINUS SIGN
00BC        VULGAR FRACTION ONE QUARTER
00BD        VULGAR FRACTION ONE HALF
00BE        VULGAR FRACTION THREE QUARTERS
00D7        MULTIPLICATION SIGN
01C3        LATIN LETTER RETROFLEX CLICK
02B0-02FF   [SPACING MODIFIER LETTERS]
037E        GREEK QUESTION MARK
037E        GREEK QUESTION MARK
03D2        GREEK UPSILON WITH HOOK SYMBOL
03D3        GREEK UPSILON WITH ACUTE AND HOOK SYMBOL
03D4        GREEK UPSILON WITH DIAERESIS AND HOOK SYMBOL
04C0        CYRILLIC LETTER PALOCHKA
0589        ARMENIAN FULL STOP
060C        ARABIC COMMA
061B        ARABIC SEMICOLON
066A        ARABIC PERCENT SIGN
066D        ARABIC FIVE POINTED STAR
06D4        ARABIC FULL STOP
070F        SYRIAC ABBREVIATION MARK
10A0-10C5   [GEORGIAN CAPITAL LETTERS]
1680        OGHAM SPACE MARK
1806        MONGOLIAN TODO SOFT HYPHEN
180B        MONGOLIAN FREE VARIATION SELECTOR ONE
180C        MONGOLIAN FREE VARIATION SELECTOR TWO
180D        MONGOLIAN FREE VARIATION SELECTOR THREE
180E        MONGOLIAN VOWEL SEPARATOR
2000-200B   [SPACES]
200C        ZERO WIDTH NON-JOINER
200D        ZERO WIDTH JOINER
200E        LEFT-TO-RIGHT MARK
200F        RIGHT-TO-LEFT MARK
2010        HYPHEN
2011        NON-BREAKING HYPHEN
2012        FIGURE DASH
2013        EN DASH
2014        EM DASH
201A        SINGLE LOW-9 QUOTATION MARK
2024        ONE DOT LEADER
2025        TWO DOT LEADER
2026        HORIZONTAL ELLIPSIS
2028        LINE SEPARATOR
2029        PARAGRAPH SEPARATOR
202A        LEFT-TO-RIGHT EMBEDDING
202B        RIGHT-TO-LEFT EMBEDDING
202C        POP DIRECTIONAL FORMATTING
202D        LEFT-TO-RIGHT OVERRIDE
202E        RIGHT-TO-LEFT OVERRIDE
202F        NARROW NO-BREAK SPACE
2030        PER MILLE SIGN
2031        PER TEN THOUSAND SIGN
2033        DOUBLE PRIME
2039        SINGLE LEFT-POINTING ANGLE QUOTATION MARK
203A        SINGLE RIGHT-POINTING ANGLE QUOTATION MARK
203D        INTERROBANG
2044        FRACTION SLASH
2048        QUESTION EXCLAMATION MARK
2049        EXCLAMATION QUESTION MARK
206A        INHIBIT SYMMETRIC SWAPPING
206B        ACTIVATE SYMMETRIC SWAPPING
206C        INHIBIT ARABIC FORM SHAPING
206D        ACTIVATE ARABIC FORM SHAPING
206E        NATIONAL DIGIT SHAPES
206F        NOMINAL DIGIT SHAPES
207A        SUPERSCRIPT PLUS SIGN
207B        SUPERSCRIPT MINUS
207C        SUPERSCRIPT EQUALS SIGN
208A        SUBSCRIPT PLUS SIGN
208B        SUBSCRIPT MINUS
208C        SUBSCRIPT EQUALS SIGN
2100        ACCOUNT OF
2101        ADDRESSED TO THE SUBJECT
2105        CARE OF
2106        CADA UNA
2153        VULGAR FRACTION ONE THIRD
2154        VULGAR FRACTION TWO THIRDS
2155        VULGAR FRACTION ONE FIFTH
2156        VULGAR FRACTION TWO FIFTHS
2157        VULGAR FRACTION THREE FIFTHS
2158        VULGAR FRACTION FOUR FIFTHS
2159        VULGAR FRACTION ONE SIXTH
215A        VULGAR FRACTION FIVE SIXTHS
215B        VULGAR FRACTION ONE EIGHTH
215C        VULGAR FRACTION THREE EIGHTHS
215D        VULGAR FRACTION FIVE EIGHTHS
215E        VULGAR FRACTION SEVEN EIGHTHS
215F        FRACTION NUMERATOR ONE
2160-217F   [ROMAN NUMERALS]
2190-21FF   [ARROWS]
2200-22FF   [MATHEMATICAL OPERATORS]
2300-23FF   [MISCELLANEOUS TECHNICAL]
2400-243F   [CONTROL PICTURES]
2440-245F   [OPTICAL CHARACTER RECOGNITION]
2488        DIGIT ONE FULL STOP
2489        DIGIT TWO FULL STOP
248A        DIGIT THREE FULL STOP
248B        DIGIT FOUR FULL STOP
248C        DIGIT FIVE FULL STOP
248D        DIGIT SIX FULL STOP
248E        DIGIT SEVEN FULL STOP
248F        DIGIT EIGHT FULL STOP
2490        DIGIT NINE FULL STOP
2491        NUMBER TEN FULL STOP
2492        NUMBER ELEVEN FULL STOP
2493        NUMBER TWELVE FULL STOP
2494        NUMBER THIRTEEN FULL STOP
2495        NUMBER FOURTEEN FULL STOP
2496        NUMBER FIFTEEN FULL STOP
2497        NUMBER SIXTEEN FULL STOP
2498        NUMBER SEVENTEEN FULL STOP
2499        NUMBER EIGHTEEN FULL STOP
249A        NUMBER NINETEEN FULL STOP
249B        NUMBER TWENTY FULL STOP
2500-257F   [BOX DRAWING]
2580-259F   [BLOCK ELEMENTS]
25A0-25FF   [GEOMETRIC SHAPES]
2600-267F   [MISCELLANEOUS SYMBOLS]
2700-27BF   [DINGBATS]
2800-287F   [BRAILLE PATTERNS]
2E80-2EFF   [CJK RADICALS SUPPLEMENT]
2F00-2FDF   [KANGXI RADICALS]
2FF0-2FFF   [IDEOGRAPHIC DESCRIPTION CHARACTERS]
3000        IDEOGRAPHIC SPACE
3001        IDEOGRAPHIC COMMA
3002        IDEOGRAPHIC FULL STOP
3003        DITTO MARK
3008        LEFT ANGLE BRACKET
3009        RIGHT ANGLE BRACKET
33A7        SQUARE M OVER S
33A8        SQUARE M OVER S SQUARED
33AE        SQUARE RAD OVER S
33AF        SQUARE RAD OVER S SQUARED
33C2        SQUARE AM
33C2        SQUARE AM
33C6        SQUARE C OVER KG
33C7        SQUARE CO
33D8        SQUARE PM
33D8        SQUARE PM
D800-DFFF   [SURROGATE CHARACTERS]
E000-F8FF   [PRIVATE USE, PLANE 0]
FB1D-FB4F   [HEBREW PRESENTATION FORMS]
FB50-FDFF   [ARABIC PRESENTATION FORMS A]
FE20-FE2F   [COMBINING HALF MARKS]
FE30-FE4F   [CJK COMPATIBILITY FORMS]
FE50-FE6F   [SMALL FORM VARIANTS]
FE70-FEFC   [ARABIC PRESENTATION FORMS B]
FEFF        ZERO WIDTH NO-BREAK SPACE
FF00-FFEF   [HALFWIDTH AND FULLWIDTH FORMS]
FFF9        INTERLINEAR ANNOTATION ANCHOR
FFFA        INTERLINEAR ANNOTATION SEPARATOR
FFFB        INTERLINEAR ANNOTATION TERMINATOR
FFFC        OBJECT REPLACEMENT CHARACTER
FFFD        REPLACEMENT CHARACTER
Unassigned characters


4. Case Folding

After it has been verified that the input text has none of the
characters prohibited for case folding, the case-folding step itself is
quite straight-forward. For each character in the input, if there is a
lowercase mapping for that character in [UniData], the input character
is changed to the mapped lowercase letter.


5. Canonicalization

After case folding, the input string is normalized using form KC, as
described in [UTR15].

6. IDN Table Revisions

A table consisting of all characters allowed and prohibited and the
rules for case folding and canonicalization will be created based on the
content of the [UniData] and on the content of this document. This table
will be the authority for implementations to follow and will be
normatively referenced by this document. Such a table will enable the
IDN protocol to have versions independent of the revisions to Unicode
and/or to ISO 10646 because the revision of IDN and its deployment may
not in sync with revisions to Unicode and ISO 10646.

In a future draft of this document, IANA will be asked to keep this
table, with an initial version number of 1. Each new version of the
table will have a new, higher version number.


7. Security Considerations

Much of the security of the Internet relies on the DNS. Thus, any change
to the characteristics of the DNS can change the security of much of the
Internet.

Host names are used by users to connect to Internet servers. The
security of the Internet would be compromised if a user entering a
single internationalized name could be connected to different servers
based on different interpretations of the internationalized host name.


8. References

[IDNComp] Paul Hoffman, "Comparison of Internationalized Domain Name
Proposals", draft-ietf-idn-compare.

[IDNReq] James Seng, "Requirements of Internationalized Domain Names",
draft-ietf-idn-requirement.

[ISO10646] ISO/IEC 10646-1:1993. International Standard -- Information
technology -- Universal Multiple-Octet Coded Character Set (UCS) -- Part
1: Architecture and Basic Multilingual Plane.  Five amendments and a
technical corrigendum have been published up to now. UTF-16 is described
in Annex Q, published as Amendment 1. 17 other amendments are currently
at various stages of standardization. [[[ THIS REFERENCE NEEDS TO BE
UPDATED AFTER DETERMINING ACCEPTABLE WORDING ]]]

[Normalize] Character Normalization in IETF Protocols,
draft-duerst-i18n-norm-03

[RFC2119] Scott Bradner, "Key words for use in RFCs to Indicate
Requirement Levels", March 1997, RFC 2119.

[RFC2396] Tim Berners-Lee, et. al., "Uniform Resource Identifiers (URI):
Generic Syntax", August 1998, RFC 2396.

[RFC2732] Robert Hinden, et. al., Format for Literal IPv6 Addresses in
URL's, December 1999, RFC 2732.

[STD13] Paul Mockapetris, "Domain names - implementation and
specification", November 1987, STD 13 (RFC 1035).

[Unicode3] The Unicode Consortium, "The Unicode Standard -- Version
3.0", ISBN 0-201-61633-5. Described at
<http://www.unicode.org/unicode/standard/versions/Unicode3.0.html>.

[UniData] The Unicode Consortium. UnicodeData File.
<ftp://ftp.unicode.org/Public/UNIDATA/UnicodeData.txt>.

[UTR15] Mark Davis and Martin Duerst. Unicode Normalization Forms.
Unicode Technical Report #15.
<http://www.unicode.org/unicode/reports/tr15/>.


A. Acknowledgements

Many people from the IETF IDN Working Group and the Unicode Technical
Committee contributed ideas that went into the first draft of this
document. Mark Davis was particularly helpful in some of the early
ideas.


B. Changes From Previous Versions of this Draft

This is the -00 version, so there are no changes.


C. IANA Considerations

There are no specific IANA considerations in this draft, but there will
be in a future draft of this document.


D. Author Contact Information

Paul Hoffman
Internet Mail Consortium and VPN Consortium
127 Segre Place
Santa Cruz, CA  95060 USA
paul.hoffman@imc.org and paul.hoffman@vpnc.org

Marc Blanchet
Viagenie inc.
2875 boul. Laurier, bur. 300
Ste-Foy, Quebec, Canada, G1V 2M2
Marc.Blanchet@viagenie.qc.ca


Marc Blanchet
Viag�nie inc.
tel: 418-656-9254
http://www.viagenie.qc.ca

----------------------------------------------------------
Normos (http://www.normos.org): Internet standards portal:
IETF RFC, drafts, IANA, W3C, ATMForum, ISO, ... all in one place.

Prev by Date: Re: [idn] =?UTF-8?B?Rlc6IEFkZGVkIHJlZmVyZW5jZSwgYW5kIHN5bmNoLiBvZiB0ZXJt?= =?UTF-8?B?aW5vbG9neQ==?=
Next by Date: [idn] [Fwd: Proposal for creation of new gTLD for IDN]
Prev by thread: Re: [idn] [Fwd: Proposal for creation of new gTLD for IDN]
Next by thread: Re: [idn] 2 new internet-drafts
Index(es):
- Date
- Thread