[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
[idn] 2 new internet-drafts
Hi,
Paul Hoffman and I have written two internet-drafts that we just submitted
to the internet-drafts:
draft-ietf-idn-nameprep-00.txt and draft-ietf-idn-idne-00.txt. They work
together. We are very open receiving comments!
Since they are not big files, I'm including the drafts in this email
because it seems these days that the secretariat takes time to process the
drafts (probably because of the number of drafts coming near the deadline),
so that you can take a look to them right now and comment before pittsburg.
Regards, Marc and Paul.
Internet Draft Marc Blanchet
draft-ietf-idn-idne-00.txt Viagenie
July 5, 2000 Paul Hoffman
Expires in six months IMC & VPNC
Internationalized domain names using EDNS (IDNE)
Status of this Memo
This document is an Internet-Draft and is in full conformance with all
provisions of Section 10 of RFC2026.
Internet-Drafts are working documents of the Internet Engineering Task
Force (IETF), its areas, and its working groups. Note that other groups
may also distribute working documents as Internet-Drafts.
Internet-Drafts are draft documents valid for a maximum of six months
and may be updated, replaced, or obsoleted by other documents at any
time. It is inappropriate to use Internet-Drafts as reference material
or to cite them other than as "work in progress."
To view the entire list of Internet-Draft Shadow Directories, see
http://www.ietf.org/shadow.html.
Abstract
The current DNS infrastructure does not provide a way to use
internationalized domain names (IDN). This document describes an
extension mechanism based on EDNS which enables the use of IDN without
causing harm to the current DNS. IDNE enables IDN host names with a as
many characters as current ASCII-only host names. It fully supports
UTF-8 and conforms to the IDN requirements.
1. Introduction
Various proposals for IDN have tried to integrate IDN into the current
limited ASCII DNS. However, the compatibility issues make too many
constraints on the architecture. Many of these proposals require
modifications to the applications or to the DNS protocol or to the
servers. This proposal take a different approach: it uses the
standardized extension mechanism for DNS (EDNS) and uses UTF-8 as the
mandatory charset. It causes no harm to the current DNS because it uses
the ENDS extension mechanism. The major drawback of this proposal is
that all protocols, applications and DNS servers will have to be
upgraded to support this proposal.
1.1 Terminology
The key words "MUST", "SHALL", "REQUIRED", "SHOULD", "RECOMMENDED", and
"MAY" in this document are to be interpreted as described in RFC 2119
[RFC2119].
Hexadecimal values are shown preceded with an "0x". For example,
"0xa1b5" indicates two octets, 0xa1 followed by 0xb5. Binary values are
shown preceded with an "0b". For example, a nine-bit value might be
shown as "0b101101111".
1.2 IDN summary
Using the terminology in [IDNComp], this protocol specifies an IDN
architecture of arch-2 (send binary or ACE). The binary format is
bin-1.1 (UTF-8), and the method for distinguishing binary from current
names is bin-2.4 (mark binary with EDNS0). The transition period is not
specified.
2. Functional Description
DNS query and responses containing IDNE labels have the following
properties:
- The string in the label MUST be pre-processed as described in
[NAMEPREP] before the query or response is prepared.
- The characters in the label MUST be encoded using UTF-8 [RFC2279].
- The entire label MUST be encoded EDNS [RFC2671].
- The version of the IDN protocol MUST be identified.
3. Encoding
An IDNE label uses the EDNS extended label type prefix (0b01), as
described in [RFC2671]. (A normal label type always begin with 0b00). A
new extended label type for IDNE is used to identify an IDNE label. This
document uses 0b000010 as the extended label type; however, the label
type will be assigned by IANA and it may not be 0b000010.
0 1 2
bits 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 . . .
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-//+-+-+-+-+-+-+
|0 1| ELT | Size | IDN label ... |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+//-+-+-+-+-+-+-+
ELT: The six-bit extended label type to be assigned by the IANA for an
IDN label. In this document, the value 0b000010 is used, although that
might be changed by IANA.
Size: Size (in octets) of the IDN label following.
IDN label: Label, encoded in UTF-8 [RFC2279]. Note that this label might
contain all ASCII characters, and thus can be used for host name labels
that are legal in [STD13].
IDNE labels can be mixed with STD13 labels in a domain name.
The compression scheme in section 4.1.4 of [STD13] is supported as is.
Pointers can refer to either IDN labels or non-IDN labels.
3.1 Examples
3.1.1 Basic example
The following example shows the label me.com where the "e" in "me" is
replaced by a <LATIN CAPITAL LETTER E WITH ACUTE>, which has the
codepoint 0x00C9. The decomposition and downcasing specified in
[NAMEPREP] produces the string <LATIN SMALL LETTER E><COMBINING ACUTE
ACCENT>, which is 0x00650301. This is then transformed using
UTF-8[RFC2279] to: 0x65CC81.
Ignoring the other fields of the message, the domain name portion of the
datagram could look like:
+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+
20 | 0 1 0 0 0 0 1 0| 0 0 0 0 0 1 0 1|
+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+
22 | 0x6D (m) | 0x65 (e) |
+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+
24 | 0xCC ('(1)) | 0x81 ('(2)) |
+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+
26 | 3 | 0x63 (c) |
+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+
28 | 0x6F (o) | 0x6D (m) |
+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+
30 | 0x00 | |
+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+
Octet 20 means EDNS extended label type (0b01) using the IDN label
type (0b000010).
Octet 21 means size of label is 4 octets following.
Octet 22-24 are the "m*" label (where the "*" is
<LATIN SMALL LETTER E><COMBINING ACUTE ACCENT>)
Octet 26-29 are "com" encoded as a STD13 label
Octet 30 is the root domain
3.1.2 Example with compression
Using the previous labels, one datagram might contain "www.m*.com" and
"m*.com" (where the "*" is <LATIN SMALL LETTER E><COMBINING ACUTE
ACCENT>).
Ignoring the other fields of the message, the domain name portions of
the datagram could look like:
+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+
20 | 0 1 0 0 0 0 1 0| 0 0 0 0 0 1 0 1|
+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+
22 | 0x6D (m) | 0x65 (e) |
+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+
24 | 0xCC ('(1)) | 0x81 ('(2)) |
+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+
26 | 3 | 0x63 (c) |
+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+
28 | 0x6F (o) | 0x6D (m) |
+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+
30 | 0x00 | |
+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+
. . .
+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+
40 | 3 | 0x77 (w) |
+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+
42 | 0x77 (w) | 0x77 (w) |
+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+
44 | 1 1| 20 |
+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+
The domain name "m*.com" is shown at offset 20. The domain name
"www.m*.com" is shown at offset 40; this definition uses a pointer to
concatenate a label for www to the previously defined "m*.com".
4. Label Size
In IDNE, the maximum length of a label is 255 octets, and the maximum
size for a domain name is 1023 octets. The reason for using these values
is so that IDNE labels can have the same number of characters as the
ASCII-based labels in [STD13]. Because character encoding in UTF-8 is
variable length, the maximum octet length for characters expected in the
foreseeable future (that is, 4 octets for a single character) was used.
Note that this extension allows some IDNE labels to be longer than 63
characters and some IDNE names to be longer than 255 octets.
Software creating DNS queries or responses using IDNE MUST verify that,
after IDN preparation and transformation to UTF8, that no labels are
longer than 255 octets and that no names are longer than 1023 octets. If
there is a user interface associated with the process creating the query
or response, that interface SHOULD give the user an error message.
Software MUST NOT transmit DNS queries or responses which contain labels
that are longer than 255 octets or names that are longer than 1023
octets. Servers MUST NOT accept DNS queries or responses which contain
labels that are longer than 255 octets or names that are longer than
1023 octets, and MUST send the NOTIMPL RCODE error message if such
queries or responses are received.
5. UDP Packet Size
IDNE-capable senders and receivers MUST support UDP packet sizes of 1220
octets, not including IP and UDP headers (note that the minimum MTU for
IPv6 is 1280 [RFC2460]). A sender MUST announce its capability in the
OPT pseudo-RR described in section 4.3 of [RFC2671] by having the CLASS
sender's UDP payload size be greater than or equal to 1220.
6. Canonalization, Prohibited Characters, and Case Folding
The string in the label MUST be pre-processed as described in [NAMEPREP]
before the query or response is prepared. A query or response MUST NOT
contain a label that does not conform to [NAMEPREP].
DNS servers MUST check for prohibited chars in the labels. If any label
in a query is found, a NOTIMPL RCODE MUST be returned.
7. Versions of IDNE
The IDN protocol version number MUST be included in the OPT RR RDATA of
EDNS (described in Section 4.4 of [RFC2671]). An OPTION-CODE will be
assigned by IANA for storing the IDNE protocol version number; this
document uses 0x0001 for the OPTION-CODE. The value (that
is, the OPTION-DATA) is the version number coded in 8 bits.
All requesters MUST send this information as part of the OPT RR included
in the EDNS packet.
7.1 This version of IDNE
This document describes version 1 of IDNE. This version is a combination
of the protocol in this document and the rules as described in
[NAMEPREP]. Note that [NAMEPREP] describes a single version of the list
of canonicalization, case folding, and prohibited characters, and that
this document is linked to that single version of [NAMEPREP].
The identifiers for this specification are:
OPTION-CODE = 0x0001 (IDNE protocol version)
OPTION-LENGTH = 0x0001 (1 octet following)
OPTION-DATA = 0x01 (IDNE protocol version 1)
7.2 Creating new versions of IDNE
A new version of IDNE is created by a standards-track RFC that
specifies:
- a normative reference to [NAMEPREP] or a successor document to
[NAMEPREP]
- an IDNE version number that is 1 greater than the highest IDNE version
number at the time the RFC is published
If there are any changes to the encoding or interpretation of the
protocol, they must also be specified in the same standards-track RFC.
7.3 Prohibited characters and versions of IDNE
If a server receives a request containing an illegal or unknown
character (as described in the version number in the request), it MUST
send a NOTIMPL RCODE to the client. For example, if a server that
understands both version 1 and version 2 receives a request that is
marked as version 1, but contains a label that includes a character that
is prohibited in version 1 but allowed in version 2, that server must
still send a NOTIMPL RCODE to the client.
8. API Specifications
The current API for TCP/IP uses gethostbyname and gethostbyaddr for IPv4
and getnodeipbyname and getnodeipbyaddr (specified in [RFC 2671]) for
both IPv4 and IPv6. These function calls returns hostent structs, where
the h_name field contains a pointer to a char. In this context,
receiving a UTF-8 string mean that the application should know that
UTF-8 uses more than one octet per char.
A new flag "IDN" (to appear in netdb.h) is defined to be passed in the
flags argument of getnodeipbynode and getnodeipbyaddr. This flag tells
the resolver to request an IDNE-encoded name. No new return code is
defined since the returned codes in RFC 2671 are meaningful in the IDNE
context.
If one has not yet converted his code to IPv6 and still wants to enable
IDNs with this API, one can do a macro of the getnodeipby* functions
mapped to the IPv4 gethostby* ones, including the "IDN" flag, and then
process differently based on the presence of the flag.
9. Transition and Deployment
Deployment of this proposal means updating clients and servers, as well
as applications and protocols, and therefore a transition strategy is
proposed. Because many DNS servers do not yet handle IDNE and may take
years or decades to do so, an ASCII-compatible encoding (ACE) format for
IDN names is also needed as a transition to an all-IDNE DNS. Note that
IDNE and an ACE are not related, and do not interact in the DNS. If the
IETF chooses to have an ACE mechanism in use at the same time as IDNE,
it would be wise to choose an ACE that allows as many characters as
possible in the name parts and full names.
IDNE allows names with as many characters as current names. This means
that it is possible to create names in IDNE that are longer than those
that can be created in the ACE protocols that have been described so
far. Although not prohibited, it is unwise to create a name that can be
legally represented in IDNE but not in the ACE, or a name that can be
legally represented in the ACE but not in IDNE.
The IETF should periodically evaluate the benefits and problems
associated with having three different formats for names (STD13, IDNE,
and ACE). If at some point it is decided that the problems outweigh the
benefits, the IETF can state a time when one or more of the services
should not be used on the Internet.
10. Root Server Considerations
Because this specification uses ENDS, root servers should be prepared to
receive EDNS requests. This specification handles IDN top-level domains
in exactly the same fashion as it does every other domain.
Considerations about IDN top-level domains are outside of this work, but
the first IDN top-level domains would require all root servers to be
ready for IDNE requests.
11. IANA Considerations
[[ TBD. This section will have two parts. The first will request an EDNS
option code. The second will specify how IDNE version numbers are
allocated (namely, standards-track RFC only). ]]
12. Security Considerations
Because IDNE uses ENDS, it inherits the same security considerations as
EDNS.
13. References
[IDNComp] Paul Hoffman, "Comparison of Internationalized Domain Name
Proposals", draft-ietf-idn-compare.
[NAMEPREP] Paul Hoffman & Marc Blanchet, "Preparation of
Internationalized Host Names", draft-ietf-idn-nameprep.
[RFC2119] Scott Bradner, "Key words for use in RFCs to Indicate
Requirement Levels", March 1997, RFC 2119.
[RFC2279] Francois Yergeau, "UTF-8, a transformation format of ISO
10646", January 1998, RFC 2279.
[RFC2460] Steve Deering & Bob Hinden, "Internet Protocol, Version 6 (IPv6)
Specification", December 1998, RFC 2460.
[RFC2671] Paul Vixie, "Extension Mechanisms for DNS (EDNS0)", August
1999, RFC 2671.
[STD13] Paul Mockapetris, "Domain names - implementation and
specification", November 1987, STD 13 (RFC 1035).
A. Authors' Addresses
Marc Blanchet
Viagenie
2875 boul. Laurier, bureau 300
Sainte-Foy, QC G1V 2M2 Canada
Marc.Blanchet@viagenie.qc.ca
Paul Hoffman
Internet Mail Consortium and VPN Consortium
127 Segre Place
Santa Cruz, CA 95060 USA
phoffman@imc.org
Internet Draft Paul Hoffman
draft-ietf-idn-nameprep-00.txt IMC & VPNC
July 3, 2000 Marc Blanchet
Expires in six months ViaGenie
Preparation of Internationalized Host Names
Status of this memo
This document is an Internet-Draft and is in full conformance with all
provisions of Section 10 of RFC2026.
Internet-Drafts are working documents of the Internet Engineering Task
Force (IETF), its areas, and its working groups. Note that other groups
may also distribute working documents as Internet-Drafts.
Internet-Drafts are draft documents valid for a maximum of six months
and may be updated, replaced, or obsoleted by other documents at any
time. It is inappropriate to use Internet-Drafts as reference material
or to cite them other than as "work in progress."
To view the list Internet-Draft Shadow Directories, see
http://www.ietf.org/shadow.html.
Abstract
This document describes how to prepare internationalized host names for
transmission on the wire. The steps include excluding characters that
are prohibited from appearing in internationalized host names, changing
all characters that have case properties to be lowercase, and
normalizing the characters. Further, this document lists the prohibited
characters.
1. Introduction
When expanding today's DNS to include internationalized host names,
those new names will be handled in many parts of the DNS. The IDN
Working Group's requirements document [IDNReq] describes a framework for
domain name handling as well as requirements for the new names. The IDN
Working Group's comparison document [IDNComp] gives a framework for how
various parts of the IDN solution work together.
A user can enter a domain name into an application program in a myriad
of fashions. Depending on the input method, the characters entered in
the domain name may or may not be those that are allowed in
internationalized host names. Thus, there must be a way to canonicalized
the user's input before the name is resolved in the DNS.
It is a design goal of this document to allow users to enter host names
in applications and have the highest chance of getting the name correct.
This means that the user should not be limited to only entering exactly
the characters that might have been used, but to instead be able to
enter characters that unambiguously canonicalize to characters in the
desired host name. At the same time, this process must not introduce any
chance that two host names could be represented by two distinct strings
of characters that look identical to typical users. It is also a design
goal to have all preprocessing of IDN done before going on the wire, so
that no transformation is done in the DNS server space.
This document describes the steps needed to convert a name part from one
that is entered by the user to one that can be used in the DNS.
1.1 Terminology
The key words "MUST", "SHALL", "REQUIRED", "SHOULD", "RECOMMENDED", and
"MAY" in this document are to be interpreted as described in RFC 2119
[RFC2119].
Examples in this document use the notation from the Unicode Standard
[Unicode3] as well as the ISO 10646 [ISO10646] names. For example, the
letter "a" may be represented as either "U+0061" or "LATIN SMALL LETTER
A". In the lists of prohibited characters, the "U+" is left off to make
the lists easier to read.
1.2 IDN summary
Using the terminology in [IDNComp], this document specifies all of the
prohibited characters and the canonicalization for an IDN solution.
Specifically, it covers the following sections from [IDNComp]:
prohib-1: Identical and near-identical characters
prohib-2: Separators
prohib-3: Non-displaying and non-spacing characters
prohib-4: Private use characters
prohib-5: Punctuation
prohib-6: Symbols
canon-1.2: Normalization Form KC
canon-2.1: Case folding in ASCII
canon-2.2: Case folding in non-ASCII
Note that this document does not cover:
canon-1.1: Normalization Form C
canon-2.3: Han folding
1.3 Open issues
This is the first draft of this document. Although there has been much
discussion on the WG mailing list about the topics here, there has not
yet been much agreement on some issues. Now that there is a document to
talk about, that discussion can be more focussed.
1.3.1 Where to do name preparation
Section 2.1 says to do name preparation in the resolver. An argument can
be made for doing name preparation in the application, before the
application service interface. An advantage of that proposal is that
resolvers would not need to do any name preparation. A disadvantage is
that applications would have to be updated each time the IDN protocol is
updated, such as if new characters are added to the repertoire of
allowed characters. It seems likely that resolvers are more easily
updated than all the individual applications that use internationalized
host names.
1.3.2 Choosing between normalization form C and KC
Much of the discussion of normalization on the WG mailing list assumed
that normalization form C would be used. Near the time that this
document was written, people started considering form KC instead of C.
This document used form KC, but the reasons for doing so could be
contentious.
1.3.3 Does the prohibition catch all bad characters?
On the mailing list, it was discussed doing prohibition in two steps: a
short list of prohibited characters before case folding in order to
prevent uppercase characters that have no lowercase equivalents from
getting through, and then a full check on the output of normalization.
In this draft, all checking is done before case folding, based on the
(possibly wrong) assumption that none of the prohibited characters will
re-appear after the case folding and normalization. If that assumption
turns out to be wrong, a check for just those problematic characters can
be added after normalization, or a full check against the prohibited
characters can be added.
2. Preparation Overview
This section describes where name preparation happens and the steps that
name preparation software must take.
2.1 Where name preparation happens
Part of the chart in section 1.4 of [IDNReq] looks like this:
+---------------+
| Application |
+---------------+
| Application service interface
| For ex. GethostbyXXXX interface
+---------------+
| Resolver |
+---------------+
| <----- DNS service interface
+-------------------------------------------+
In this specification, the name preparation is done in the resolver,
before the DNS service interface. That is, it is acceptable for software
in the application service interface (such as a "GetHostByName" API) to
pass the resolver a name that has not been prepared. However, the
resolver MUST prepare the name as described in this specification before
passing it to the DNS service interface.
2.2 Name preparation steps
The steps for preparing names are:
1) Input from the application service interface -- This can be done in
many ways and is not specified in this document
2) Look for prohibited input -- Check for any characters that are not
allowed in the input. If any are found, return an error to the
application service interface. This step is necessary to prevent errors
in the following two steps. This step fulfills prohib-1, prohib-2,
prohib-3, prohib-4, prohib-5, and prohib-6 from [IDNComp].
3) Fold case -- Change all uppercase characters into lowercase
characters. Design note: this step could just as easily have been
"change all lowercase characters into uppercase characters". However,
the upper-to-lower folding was chosen because most users of the Internet
today enter host names in lowercase. This step fulfills canon-2.1 and
canon-2.2 from [IDNComp].
4) Canonicalize -- Normalize the characters. This step fulfils canon-1.2
from [IDNComp].
5) Resolution of the prepared name -- This must be specified in a
different IDN document.
The above steps MUST be performed in the order given in order to comply
with this specification.
3. Prohibited Input
Before the text can be processed, it must be checked for prohibited
characters. There is a variety of prohibited characters, as described in
this section.
Note that one of the goals of IDN is to allow the widest possible set of
host names as long as those host names do not cause other problems, such
as possible ambiguity. Specifically, experience with current DNS names
have shown that there is a desire for host names that include personal
names, company names, and spoken phrases. A goal of this section is to
prohibit as few characters that might be used in these contexts as
possible while making sure that characters that might easily cause
confusion or ambiguity are prohibited.
Note that every character listed in this section MUST NOT be transmitted
on the DNS service interface. Although the checking is being performed
before case folding and canonicalization, those steps cannot result in
any of these characters if these characters are not in the input stream.
[[[NOTE: THIS STATEMENT NEEDS TO BE CHECKED ALGORITHMICALLY.]]] If a DNS
server receives a request containing a prohibited character, then the
IDN protocol MUST return an error message.
Note that some characters listed in one section would also appear in
other sections. Each character is only listed once.
3.1 prohib-1: Identical and near-identical characters
Many characters in [ISO10646] are identical or nearly identical to other
characters. These were often included for compatibility with other
character sets.
The characters prohibited because they are identical or nearly identical
to allowed characters are:
00AD SOFT HYPHEN
00D7 MULTIPLICATION SIGN
01C3 LATIN LETTER RETROFLEX CLICK
02B0-02FF [SPACING MODIFIER LETTERS]
066D ARABIC FIVE POINTED STAR
1806 MONGOLIAN TODO SOFT HYPHEN
2010 HYPHEN
2011 NON-BREAKING HYPHEN
2012 FIGURE DASH
2013 EN DASH
2014 EM DASH
2160-217F [ROMAN NUMERALS]
FB1D-FB4F [HEBREW PRESENTATION FORMS]
FB50-FDFF [ARABIC PRESENTATION FORMS A]
FE20-FE2F [COMBINING HALF MARKS]
FE30-FE4F [CJK COMPATIBILITY FORMS]
FE50-FE6F [SMALL FORM VARIANTS]
FE70-FEFC [ARABIC PRESENTATION FORMS B]
FF00-FFEF [HALFWIDTH AND FULLWIDTH FORMS]
3.2 prohib-2: Separators
Horizontal and vertical spacing characters would make it unclear where a
host name begins and ends. The prohibited spacing characters are:
0020 SPACE
00A0 NO-BREAK SPACE
1680 OGHAM SPACE MARK
2000-200B [SPACES]
2028 LINE SEPARATOR
2029 PARAGRAPH SEPARATOR
202F NARROW NO-BREAK SPACE
3000 IDEOGRAPHIC SPACE
Allowing periods and period-like characters as characters within a name
part would also cause similar confusion. The prohibited periods,
characters that look like periods, and characters that canonicalize to a
period or to a period-like character are:
002E FULL STOP
06D4 ARABIC FULL STOP
2024 ONE DOT LEADER
2025 TWO DOT LEADER
2026 HORIZONTAL ELLIPSIS
2488 DIGIT ONE FULL STOP
2489 DIGIT TWO FULL STOP
248A DIGIT THREE FULL STOP
248B DIGIT FOUR FULL STOP
248C DIGIT FIVE FULL STOP
248D DIGIT SIX FULL STOP
248E DIGIT SEVEN FULL STOP
248F DIGIT EIGHT FULL STOP
2490 DIGIT NINE FULL STOP
2491 NUMBER TEN FULL STOP
2492 NUMBER ELEVEN FULL STOP
2493 NUMBER TWELVE FULL STOP
2494 NUMBER THIRTEEN FULL STOP
2495 NUMBER FOURTEEN FULL STOP
2496 NUMBER FIFTEEN FULL STOP
2497 NUMBER SIXTEEN FULL STOP
2498 NUMBER SEVENTEEN FULL STOP
2499 NUMBER EIGHTEEN FULL STOP
249A NUMBER NINETEEN FULL STOP
249B NUMBER TWENTY FULL STOP
33C2 SQUARE AM
33C2 SQUARE AM
33C7 SQUARE CO
33D8 SQUARE PM
33D8 SQUARE PM
3.3 prohib-3: Non-displaying and non-spacing characters
There are many characters that cannot be seen in the ISO 10646 character
set. These include control characters, non-breaking spaces, formatting
characters, and tagging characters. These characters would certainly
cause confusion if allowed in host names.
0000-001F [CONTROL CHARACTERS]
007F DELETE
0080-009F [CONTROL CHARACTERS]
070F SYRIAC ABBREVIATION MARK
180B MONGOLIAN FREE VARIATION SELECTOR ONE
180C MONGOLIAN FREE VARIATION SELECTOR TWO
180D MONGOLIAN FREE VARIATION SELECTOR THREE
180E MONGOLIAN VOWEL SEPARATOR
200C ZERO WIDTH NON-JOINER
200D ZERO WIDTH JOINER
200E LEFT-TO-RIGHT MARK
200F RIGHT-TO-LEFT MARK
202A LEFT-TO-RIGHT EMBEDDING
202B RIGHT-TO-LEFT EMBEDDING
202C POP DIRECTIONAL FORMATTING
202D LEFT-TO-RIGHT OVERRIDE
202E RIGHT-TO-LEFT OVERRIDE
206A INHIBIT SYMMETRIC SWAPPING
206B ACTIVATE SYMMETRIC SWAPPING
206C INHIBIT ARABIC FORM SHAPING
206D ACTIVATE ARABIC FORM SHAPING
206E NATIONAL DIGIT SHAPES
206F NOMINAL DIGIT SHAPES
FEFF ZERO WIDTH NO-BREAK SPACE
FFF9 INTERLINEAR ANNOTATION ANCHOR
FFFA INTERLINEAR ANNOTATION SEPARATOR
FFFB INTERLINEAR ANNOTATION TERMINATOR
FFFC OBJECT REPLACEMENT CHARACTER
FFFD REPLACEMENT CHARACTER
3.4 prohib-4: Private use characters
Because private-use characters do not have defined meanings, they are
prohibited. The private-use characters are:
E000-F8FF [PRIVATE USE, PLANE 0]
3.5 prohib-5: Punctuation
The following characters are reserved or delimiters in URLs [RFC2396]
and [RFC2732]:
" # $ % & + , . / : ; < = > ? @ [ ]
3.5.1 Characters from URLs
The following punctuation characters are prohibited because they are
reserved or delimiters in URLs.
0022 QUOTATION MARK
0023 NUMBER SIGN
0024 DOLLAR SIGN
0025 PERCENT SIGN
0026 AMPERSAND
002B PLUS SIGN
002C COMMA
002E FULL STOP
002F SOLIDUS
003A COLON
003B SEMICOLON
003C LESS-THAN SIGN
003D EQUALS SIGN
003E GREATER-THAN SIGN
003F QUESTION MARK
0040 COMMERCIAL AT
005B LEFT SQUARE BRACKET
005D RIGHT SQUARE BRACKET
3.5.2 Characters that canonicalize to characters from URLs
The following punctuation characters are prohibited because their
normalization contains one or more of the characters from section 3.5.1.
037E GREEK QUESTION MARK
2048 QUESTION EXCLAMATION MARK
2049 EXCLAMATION QUESTION MARK
207A SUPERSCRIPT PLUS SIGN
207C SUPERSCRIPT EQUALS SIGN
208A SUBSCRIPT PLUS SIGN
208C SUBSCRIPT EQUALS SIGN
2100 ACCOUNT OF
2101 ADDRESSED TO THE SUBJECT
2105 CARE OF
2106 CADA UNA
3.5.3 Characters that look like characters from URLs
The following are prohibited because they look indistinguishable from
the characters listed in section 3.5.1.
037E GREEK QUESTION MARK
0589 ARMENIAN FULL STOP
060C ARABIC COMMA
061B ARABIC SEMICOLON
066A ARABIC PERCENT SIGN
201A SINGLE LOW-9 QUOTATION MARK
2030 PER MILLE SIGN
2031 PER TEN THOUSAND SIGN
2033 DOUBLE PRIME
2039 SINGLE LEFT-POINTING ANGLE QUOTATION MARK
2044 FRACTION SLASH
203A SINGLE RIGHT-POINTING ANGLE QUOTATION MARK
203D INTERROBANG
3001 IDEOGRAPHIC COMMA
3002 IDEOGRAPHIC FULL STOP
3003 DITTO MARK
3008 LEFT ANGLE BRACKET
3009 RIGHT ANGLE BRACKET
3014 LEFT TORTOISE SHELL BRACKET
3015 RIGHT TORTOISE SHELL BRACKET
301A LEFT WHITE SQUARE BRACKET
301B RIGHT WHITE SQUARE BRACKET
3.5.4 Other punctuation
The following punctuation are prohibited because they are unlikely to
be used in names and may be confusing to users or to character-entry
processes:
005C REVERSE SOLIDUS
3.6 prohib-6: Symbols
[UniData] has non-normative categories for symbols. The four symbol
categories are:
Symbol, Currency: Currency symbols could appear in company names and
spoken phrases, so they are not prohibited.
Symbol, Modifier: Stand-alone modifiers might appear in personal names,
company names, and spoken phrases, so they are not prohibited.
Symbol, Math: It is very unlikely that there are any significant
personal names, company names, or spoken phrases that contain
mathematical symbols. Further, many of these symbols are the same or
similar to other punctuation, thereby leading to ambiguity. For this
reason, math-specific symbols are prohibited. These prohibited math
symbols are:
00AC NOT SIGN
00B1 PLUS-MINUS SIGN
2200-22FF [MATHEMATICAL OPERATORS]
Further, the following characters canonicalize to characters in the
above math list, and therefore are also prohibited:
00BC VULGAR FRACTION ONE QUARTER
00BD VULGAR FRACTION ONE HALF
00BE VULGAR FRACTION THREE QUARTERS
207B SUPERSCRIPT MINUS
208B SUBSCRIPT MINUS
2153 VULGAR FRACTION ONE THIRD
2154 VULGAR FRACTION TWO THIRDS
2155 VULGAR FRACTION ONE FIFTH
2156 VULGAR FRACTION TWO FIFTHS
2157 VULGAR FRACTION THREE FIFTHS
2158 VULGAR FRACTION FOUR FIFTHS
2159 VULGAR FRACTION ONE SIXTH
215A VULGAR FRACTION FIVE SIXTHS
215B VULGAR FRACTION ONE EIGHTH
215C VULGAR FRACTION THREE EIGHTHS
215D VULGAR FRACTION FIVE EIGHTHS
215E VULGAR FRACTION SEVEN EIGHTHS
215F FRACTION NUMERATOR ONE
33A7 SQUARE M OVER S
33A8 SQUARE M OVER S SQUARED
33AE SQUARE RAD OVER S
33AF SQUARE RAD OVER S SQUARED
33C6 SQUARE C OVER KG
Symbol, Other: This category covers a multitude of symbols, few of which
would ever appear in personal names, company names, and spoken phrases.
The rest of the prohibited symbols are:
2190-21FF [ARROWS]
2300-23FF [MISCELLANEOUS TECHNICAL]
2400-243F [CONTROL PICTURES]
2440-245F [OPTICAL CHARACTER RECOGNITION]
2500-257F [BOX DRAWING]
2580-259F [BLOCK ELEMENTS]
25A0-25FF [GEOMETRIC SHAPES]
2600-267F [MISCELLANEOUS SYMBOLS]
2700-27BF [DINGBATS]
2800-287F [BRAILLE PATTERNS]
3.7 Additional prohibited characters
3.7.1 Unassigned characters
All characters not yet assigned in [ISO10646] are prohibited. Although
this may at first seem trivial, it is extremely important because
characters that may be assigned in the future might have properties that
would cause them to be prohibited or might have case-folding properties.
As is the case of all prohibited characters, if a DNS server receives a
request containing an unassigned character, then the IDN protocol MUST
return an error message.
3.7.2 Surrogate characters
So far, all proposals for binary encodings of internationalized name
parts have specified UTF-8 as the encoding format. In such an encoding,
surrogate characters MUST NOT be used. Therefore, for UTF-8 encodings,
the following are prohibited:
D800-DFFF [SURROGATE CHARACTERS]
3.7.3 Uppercase characters with no lowercase mappings
There are many uppercase characters in [ISO10646] which do not have
lowercase equivalents in [UniData]. Therefore, they are prohibited on
input because they would get through the case mapping step while still
being in uppercase.
The characters that are prohibited on input because they are uppercase
but have no lowercase mappings are:
03D2 GREEK UPSILON WITH HOOK SYMBOL
03D3 GREEK UPSILON WITH ACUTE AND HOOK SYMBOL
03D4 GREEK UPSILON WITH DIAERESIS AND HOOK SYMBOL
04C0 CYRILLIC LETTER PALOCHKA
10A0-10C5 [GEORGIAN CAPITAL LETTERS]
Note that many characters in the range U+1200 to U+213A, the letterlike
symbols, also are uppercase but have no lowercase mappings. However,
they are not listed here because the entire range is already prohibited
in section 3.6.
3.7.4 Radicals and Ideographic Description
Some Han characters can be informally defined in terms of ideographic
descriptions. However, ideographic descriptions can lead to multiple
character streams leading to the same character in a fashion that does
not canonicalize. Thus, the radicals for ideographic description and the
ideographic description characters themselves are prohibited. These
characters are:
2E80-2EFF [CJK RADICALS SUPPLEMENT]
2F00-2FDF [KANGXI RADICALS]
2FF0-2FFF [IDEOGRAPHIC DESCRIPTION CHARACTERS]
3.8 Summary of prohibited characters
The following is a collected list from the previous sections.
0000-001F [CONTROL CHARACTERS]
0020 SPACE
0022 QUOTATION MARK
0023 NUMBER SIGN
0024 DOLLAR SIGN
0025 PERCENT SIGN
0026 AMPERSAND
002B PLUS SIGN
002C COMMA
002E FULL STOP
002E FULL STOP
002F SOLIDUS
003A COLON
003B SEMICOLON
003C LESS-THAN SIGN
003D EQUALS SIGN
003E GREATER-THAN SIGN
003F QUESTION MARK
0040 COMMERCIAL AT
005B LEFT SQUARE BRACKET
005C REVERSE SOLIDUS
005D RIGHT SQUARE BRACKET
007F DELETE
0080-009F [CONTROL CHARACTERS]
00A0 NO-BREAK SPACE
00AC NOT SIGN
00AD SOFT HYPHEN
00B1 PLUS-MINUS SIGN
00BC VULGAR FRACTION ONE QUARTER
00BD VULGAR FRACTION ONE HALF
00BE VULGAR FRACTION THREE QUARTERS
00D7 MULTIPLICATION SIGN
01C3 LATIN LETTER RETROFLEX CLICK
02B0-02FF [SPACING MODIFIER LETTERS]
037E GREEK QUESTION MARK
037E GREEK QUESTION MARK
03D2 GREEK UPSILON WITH HOOK SYMBOL
03D3 GREEK UPSILON WITH ACUTE AND HOOK SYMBOL
03D4 GREEK UPSILON WITH DIAERESIS AND HOOK SYMBOL
04C0 CYRILLIC LETTER PALOCHKA
0589 ARMENIAN FULL STOP
060C ARABIC COMMA
061B ARABIC SEMICOLON
066A ARABIC PERCENT SIGN
066D ARABIC FIVE POINTED STAR
06D4 ARABIC FULL STOP
070F SYRIAC ABBREVIATION MARK
10A0-10C5 [GEORGIAN CAPITAL LETTERS]
1680 OGHAM SPACE MARK
1806 MONGOLIAN TODO SOFT HYPHEN
180B MONGOLIAN FREE VARIATION SELECTOR ONE
180C MONGOLIAN FREE VARIATION SELECTOR TWO
180D MONGOLIAN FREE VARIATION SELECTOR THREE
180E MONGOLIAN VOWEL SEPARATOR
2000-200B [SPACES]
200C ZERO WIDTH NON-JOINER
200D ZERO WIDTH JOINER
200E LEFT-TO-RIGHT MARK
200F RIGHT-TO-LEFT MARK
2010 HYPHEN
2011 NON-BREAKING HYPHEN
2012 FIGURE DASH
2013 EN DASH
2014 EM DASH
201A SINGLE LOW-9 QUOTATION MARK
2024 ONE DOT LEADER
2025 TWO DOT LEADER
2026 HORIZONTAL ELLIPSIS
2028 LINE SEPARATOR
2029 PARAGRAPH SEPARATOR
202A LEFT-TO-RIGHT EMBEDDING
202B RIGHT-TO-LEFT EMBEDDING
202C POP DIRECTIONAL FORMATTING
202D LEFT-TO-RIGHT OVERRIDE
202E RIGHT-TO-LEFT OVERRIDE
202F NARROW NO-BREAK SPACE
2030 PER MILLE SIGN
2031 PER TEN THOUSAND SIGN
2033 DOUBLE PRIME
2039 SINGLE LEFT-POINTING ANGLE QUOTATION MARK
203A SINGLE RIGHT-POINTING ANGLE QUOTATION MARK
203D INTERROBANG
2044 FRACTION SLASH
2048 QUESTION EXCLAMATION MARK
2049 EXCLAMATION QUESTION MARK
206A INHIBIT SYMMETRIC SWAPPING
206B ACTIVATE SYMMETRIC SWAPPING
206C INHIBIT ARABIC FORM SHAPING
206D ACTIVATE ARABIC FORM SHAPING
206E NATIONAL DIGIT SHAPES
206F NOMINAL DIGIT SHAPES
207A SUPERSCRIPT PLUS SIGN
207B SUPERSCRIPT MINUS
207C SUPERSCRIPT EQUALS SIGN
208A SUBSCRIPT PLUS SIGN
208B SUBSCRIPT MINUS
208C SUBSCRIPT EQUALS SIGN
2100 ACCOUNT OF
2101 ADDRESSED TO THE SUBJECT
2105 CARE OF
2106 CADA UNA
2153 VULGAR FRACTION ONE THIRD
2154 VULGAR FRACTION TWO THIRDS
2155 VULGAR FRACTION ONE FIFTH
2156 VULGAR FRACTION TWO FIFTHS
2157 VULGAR FRACTION THREE FIFTHS
2158 VULGAR FRACTION FOUR FIFTHS
2159 VULGAR FRACTION ONE SIXTH
215A VULGAR FRACTION FIVE SIXTHS
215B VULGAR FRACTION ONE EIGHTH
215C VULGAR FRACTION THREE EIGHTHS
215D VULGAR FRACTION FIVE EIGHTHS
215E VULGAR FRACTION SEVEN EIGHTHS
215F FRACTION NUMERATOR ONE
2160-217F [ROMAN NUMERALS]
2190-21FF [ARROWS]
2200-22FF [MATHEMATICAL OPERATORS]
2300-23FF [MISCELLANEOUS TECHNICAL]
2400-243F [CONTROL PICTURES]
2440-245F [OPTICAL CHARACTER RECOGNITION]
2488 DIGIT ONE FULL STOP
2489 DIGIT TWO FULL STOP
248A DIGIT THREE FULL STOP
248B DIGIT FOUR FULL STOP
248C DIGIT FIVE FULL STOP
248D DIGIT SIX FULL STOP
248E DIGIT SEVEN FULL STOP
248F DIGIT EIGHT FULL STOP
2490 DIGIT NINE FULL STOP
2491 NUMBER TEN FULL STOP
2492 NUMBER ELEVEN FULL STOP
2493 NUMBER TWELVE FULL STOP
2494 NUMBER THIRTEEN FULL STOP
2495 NUMBER FOURTEEN FULL STOP
2496 NUMBER FIFTEEN FULL STOP
2497 NUMBER SIXTEEN FULL STOP
2498 NUMBER SEVENTEEN FULL STOP
2499 NUMBER EIGHTEEN FULL STOP
249A NUMBER NINETEEN FULL STOP
249B NUMBER TWENTY FULL STOP
2500-257F [BOX DRAWING]
2580-259F [BLOCK ELEMENTS]
25A0-25FF [GEOMETRIC SHAPES]
2600-267F [MISCELLANEOUS SYMBOLS]
2700-27BF [DINGBATS]
2800-287F [BRAILLE PATTERNS]
2E80-2EFF [CJK RADICALS SUPPLEMENT]
2F00-2FDF [KANGXI RADICALS]
2FF0-2FFF [IDEOGRAPHIC DESCRIPTION CHARACTERS]
3000 IDEOGRAPHIC SPACE
3001 IDEOGRAPHIC COMMA
3002 IDEOGRAPHIC FULL STOP
3003 DITTO MARK
3008 LEFT ANGLE BRACKET
3009 RIGHT ANGLE BRACKET
33A7 SQUARE M OVER S
33A8 SQUARE M OVER S SQUARED
33AE SQUARE RAD OVER S
33AF SQUARE RAD OVER S SQUARED
33C2 SQUARE AM
33C2 SQUARE AM
33C6 SQUARE C OVER KG
33C7 SQUARE CO
33D8 SQUARE PM
33D8 SQUARE PM
D800-DFFF [SURROGATE CHARACTERS]
E000-F8FF [PRIVATE USE, PLANE 0]
FB1D-FB4F [HEBREW PRESENTATION FORMS]
FB50-FDFF [ARABIC PRESENTATION FORMS A]
FE20-FE2F [COMBINING HALF MARKS]
FE30-FE4F [CJK COMPATIBILITY FORMS]
FE50-FE6F [SMALL FORM VARIANTS]
FE70-FEFC [ARABIC PRESENTATION FORMS B]
FEFF ZERO WIDTH NO-BREAK SPACE
FF00-FFEF [HALFWIDTH AND FULLWIDTH FORMS]
FFF9 INTERLINEAR ANNOTATION ANCHOR
FFFA INTERLINEAR ANNOTATION SEPARATOR
FFFB INTERLINEAR ANNOTATION TERMINATOR
FFFC OBJECT REPLACEMENT CHARACTER
FFFD REPLACEMENT CHARACTER
Unassigned characters
4. Case Folding
After it has been verified that the input text has none of the
characters prohibited for case folding, the case-folding step itself is
quite straight-forward. For each character in the input, if there is a
lowercase mapping for that character in [UniData], the input character
is changed to the mapped lowercase letter.
5. Canonicalization
After case folding, the input string is normalized using form KC, as
described in [UTR15].
6. IDN Table Revisions
A table consisting of all characters allowed and prohibited and the
rules for case folding and canonicalization will be created based on the
content of the [UniData] and on the content of this document. This table
will be the authority for implementations to follow and will be
normatively referenced by this document. Such a table will enable the
IDN protocol to have versions independent of the revisions to Unicode
and/or to ISO 10646 because the revision of IDN and its deployment may
not in sync with revisions to Unicode and ISO 10646.
In a future draft of this document, IANA will be asked to keep this
table, with an initial version number of 1. Each new version of the
table will have a new, higher version number.
7. Security Considerations
Much of the security of the Internet relies on the DNS. Thus, any change
to the characteristics of the DNS can change the security of much of the
Internet.
Host names are used by users to connect to Internet servers. The
security of the Internet would be compromised if a user entering a
single internationalized name could be connected to different servers
based on different interpretations of the internationalized host name.
8. References
[IDNComp] Paul Hoffman, "Comparison of Internationalized Domain Name
Proposals", draft-ietf-idn-compare.
[IDNReq] James Seng, "Requirements of Internationalized Domain Names",
draft-ietf-idn-requirement.
[ISO10646] ISO/IEC 10646-1:1993. International Standard -- Information
technology -- Universal Multiple-Octet Coded Character Set (UCS) -- Part
1: Architecture and Basic Multilingual Plane. Five amendments and a
technical corrigendum have been published up to now. UTF-16 is described
in Annex Q, published as Amendment 1. 17 other amendments are currently
at various stages of standardization. [[[ THIS REFERENCE NEEDS TO BE
UPDATED AFTER DETERMINING ACCEPTABLE WORDING ]]]
[Normalize] Character Normalization in IETF Protocols,
draft-duerst-i18n-norm-03
[RFC2119] Scott Bradner, "Key words for use in RFCs to Indicate
Requirement Levels", March 1997, RFC 2119.
[RFC2396] Tim Berners-Lee, et. al., "Uniform Resource Identifiers (URI):
Generic Syntax", August 1998, RFC 2396.
[RFC2732] Robert Hinden, et. al., Format for Literal IPv6 Addresses in
URL's, December 1999, RFC 2732.
[STD13] Paul Mockapetris, "Domain names - implementation and
specification", November 1987, STD 13 (RFC 1035).
[Unicode3] The Unicode Consortium, "The Unicode Standard -- Version
3.0", ISBN 0-201-61633-5. Described at
<http://www.unicode.org/unicode/standard/versions/Unicode3.0.html>.
[UniData] The Unicode Consortium. UnicodeData File.
<ftp://ftp.unicode.org/Public/UNIDATA/UnicodeData.txt>.
[UTR15] Mark Davis and Martin Duerst. Unicode Normalization Forms.
Unicode Technical Report #15.
<http://www.unicode.org/unicode/reports/tr15/>.
A. Acknowledgements
Many people from the IETF IDN Working Group and the Unicode Technical
Committee contributed ideas that went into the first draft of this
document. Mark Davis was particularly helpful in some of the early
ideas.
B. Changes From Previous Versions of this Draft
This is the -00 version, so there are no changes.
C. IANA Considerations
There are no specific IANA considerations in this draft, but there will
be in a future draft of this document.
D. Author Contact Information
Paul Hoffman
Internet Mail Consortium and VPN Consortium
127 Segre Place
Santa Cruz, CA 95060 USA
paul.hoffman@imc.org and paul.hoffman@vpnc.org
Marc Blanchet
Viagenie inc.
2875 boul. Laurier, bur. 300
Ste-Foy, Quebec, Canada, G1V 2M2
Marc.Blanchet@viagenie.qc.ca
Marc Blanchet
Viagénie inc.
tel: 418-656-9254
http://www.viagenie.qc.ca
----------------------------------------------------------
Normos (http://www.normos.org): Internet standards portal:
IETF RFC, drafts, IANA, W3C, ATMForum, ISO, ... all in one place.