[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
[idn] Internet Draft - Phased Implementation for IDNA
Dear IETF IDN members,
The attached draft "Phased Implementation for IDNA" was submitted.
You are welcome to commont on the draft.
FYI
Kenny Huang
Internet Draft Lee Ming Tseng
<draft-ietf-tseng-piidna-00.txt> Jan Ming Ho
01 Feb 2002 Kenny Huang
expires 01 August 2002
Phased Implementation for Internationalized Domain Names in Applications
Status of this Memo
This document is an Internet-Draft and is in full conformance
with all provisions of Section 10 of RFC2026.
Internet-Drafts are working documents of the Internet
Engineering Task Force (IETF), its areas, and its working
groups. Note that other groups may also distribute working
documents as Internet-Drafts.
Internet-Drafts are draft documents valid for a maximum of
six months and may be updated, replaced, or obsoleted by other
documents at any time. It is inappropriate to use Internet-
Drafts as reference material or to cite them other than as
"work in progress."
The list of current Internet-Drafts can be accessed at
http://www.ietf.org/ietf/1id-abstracts.txt
The list of Internet-Draft Shadow Directories can be accessed at
http://www.ietf.org/shadow.html
A copy of this particular draft is also archived at
http://www.twnic.net.tw
Abstract
This document proposes a phased implementation for IDNA
(Internationalized Domain Names in Applications). DNS infrastructure
is critical for the Internet operation. The implementation of IDNA
shall be carefully considered and examined. Deployment of IDN
infrastructure shall be migrated step by step to ensure the reliability
of the new infrastructure. To fulfill the incremental change requirements,
this document proposes a phased implementation for IDNA.
1 Terminology
The key words "MUST", "SHALL", "REQUIRED", "SHOULD", "RECOMMENDED", and
"MAY" in this document are to be interpreted as described in RFC 2119 [7].
A "code point" is an integer value associated with a character in a coded
character set.
"TC" is an abbreviation for Traditional Chinese.
"SC" is an abbreviation for Simplified Chinese.
"CDN" is defined as an acronym of Chinese Domain Name that represents
internationalized domain name, which contains at least one Chinese
character. As to the scope of Chinese character, please refer to
ISO/IEC 10646-1:2000(E) [second edition 2000-09-15], if one character is
marked "C and G-Hanzi-T", it MUST be a Chinese character, such a definition
does not mean that it is not a character of other countries that use HAN
ideograph.[8]
2 Proposed Phased implementation of IDNA
The IDN Working Group decides to use Unicode as the basis to enable
IDN services. This proposal proposes two phases implementation for
IDNA, namely Bootstrapping Phase and Mature Phase as described below.
2.1 Bootstrapping Phase
At bootstrapping phase, the lists in Appendix B shall be applied
to prohibit these code points until future update is requested by
user community. The description on how Appendix B was formed is
specified in the section 3.
+------+
| User |
+------+
^
| Input and display: local interface methods
| (pen, keyboard, glowing phosphorus, ...)
+-------------------|-------------------------------+
| v |
| +-----------------------------+ |
| | NamePrep | |
| | 1 Mapping | |
| | 2 Normalization | |
| | 3 Prohibited Output | |
| +-----------------------------+ |
| ^ |
| | |
| v |
| +-----------------------------+ |
| | Extended Prohibited Output | |
| +-----------------------------+ |
| ^ |
| | |
| v |
| +-----------------------------+ |
| | Punycode[5] | |
| +-----------------------------+ |
| ^ ^ | End system
| | | |
| Call to resolver: | | Application-specific |
| ACE | | protocol: |
| v | predefined by the |
| +----------+ | protocol or defaults |
| | Resolver | | to ACE |
| +----------+ | |
| ^ | |
+-----------------|----------|----------------------+
DNS protocol: | |
ACE | |
v v
+-------------+ +---------------------+
| DNS servers | | Application servers |
+-------------+ +---------------------+
Table 1. IDNA architecture [4] with extended prohibited output module.
2.2 Mature Phase
The phased implementation of IDNA shall maintain the flexibility for
future revision. Unknown code points will be sent to the extended
prohibited output module. Valid code points on the other hand will
never be prohibited. The future version of IDNA simply removes
the prohibition on the code points listed in Appendix B, resulting
in the same IDNA that's now on the table.
+------+
| User |
+------+
^
| Input and display: local interface methods
| (pen, keyboard, glowing phosphorus, ...)
+-------------------|-------------------------------+
| v |
| +-----------------------------+ |
| | NamePrep | |
| | 1 Mapping | |
| | 2 Normalization | |
| | 3 Prohibited Output | |
| +-----------------------------+ |
| ^ |
| | |
| v |
| +-----------------------------+ |
| | Punycode[5] | |
| +-----------------------------+ |
| ^ ^ | End system
| | | |
| Call to resolver: | | Application-specific |
| ACE | | protocol: |
| v | predefined by the |
| +----------+ | protocol or defaults |
| | Resolver | | to ACE |
| +----------+ | |
| ^ | |
+-----------------|----------|----------------------+
DNS protocol: | |
ACE | |
v v
+-------------+ +---------------------+
| DNS servers | | Application servers |
+-------------+ +---------------------+
Table 2. IDNA architecture [4].
3 Extended Prohibited Output
This diagram specifies how the extended prohibition table
(Appendix B) is used. The code points listed in Appendix B
are proposed by the authors. Appendix B covers Partial Han
code points, which may be used in Japan, Korea, Taiwan and
China.
The subsections below describe why the code points are selected
in Appendix B. Implementations of this diagram MUST be based
on Appdendix B, not based on the descriptions in this section.
The lists in Appendix B MUST be used by implementations of
this specification.
3.1 Equivalent matching
Some character sets has the issue of equivalent matching, such as
Han code points. Han characters are used in many countries in Asia.
For a single written language, two Han characters are said to be
variants of each other if they have the same meaning and pronounce
the same. In other words, they are supposed to be matched as
equivalent characters. But, the variant relation can be either
context sensitive or context free. [1][2]
It is also true that some variant relation in one country does not
exist in other countries. Since Han ideograph is an open set, it is
still growing even in modern days. What makes it even more complicated
is the number of variants of Han character in different versions of
Unicode. The number of unified Han characters is 21,204 in Unicode 2.0,
27,786 in Unicode 3.0, and 70,207 in Unicode 3.1 [6]. The larger is the
size of Unicode, the larger is the size of its associated variants.
We noticed that there are some dictionaries of variants. But, international
standardizing efforts on variants based on Unicode had not been engaged
by any organization at the time the authors are preparing this document.
We also recognize that one does not have to consider the existence of
variants if names are nothing but identifiers. But, if a name itself
is a product with commercial value as is the case in domain name
services, then the ambiguity introduced by the variants into delegation
and resolution processes must be minimized. A domain name service
which is unable to minimize these ambiguities will cause serious
consumer protection problems.
On possible solution to the Han variant problem is to standardize
a variant relation,which is context free and is true for all nations
or regions, with respect to a given subset of Han characters. The
purpose is not to provide a complete solution to the Han variant
problem given the fact that Chinese character is an open set. Instead,
its purpose is to define a maximal set of equivalent variants such that
ambiguity in a name service can be minimized at a reasonable cost by
a low-level mechanism like IDNA. It is easier and thus is recommended
by the authors to define variant relation on a small subset of Han
ideograph, e.g., Unicode 2.0. If this is the case, then Han characters
beyond this code range should be forbidden in a domain name. Note that
Han characters outside of Unicode 2.0 are not commonly used in our
daily life. It is also possible to work on a more recent version of
Unicode if it is justifiable though. Han variant can be standardized
in other standardization bodies, e.g., in Unicode Consortium.
Note that Han variants refer to relation of characters. It is different
from the equivalence of the words "color" and "colour" which refers to
relation of strings of characters.
As mentioned earlier, once variant relation is defined in a closed
subset of Han ideograph, then character-level equivalence matching
can be implemented at IDNA. On the other hand, intelligent matching
algorithms can also be developed at higher layers to match
context-sensitive and localized Han variants [15].
The degree of severity for an inconsistent matching rule is distinct
from different language communities. The requirements and importance
of equivalent CDN were also addressed by Chinese Domain Name
Consortium (CDNC) and JET (Joint Engineering Team, formed by JPNIC,
KRNIC, CNNIC, TWNIC). CDN requirements are listed in Appdedix A. Before
standardizing a set of consistent matching rules, these controversial
code points are recommended to be temporarily prohibited in the
bootstrapping stage.
3.2 Visual difficulty
Some code points are visually impossible to differentiate and
could lead to many user entry errors. In this case these
code points can cause unpredictable results when queried.
The issue of visual diffculty may exist in many scripts, but
the impact of visual difficulty by different language groups
should be particularly evaluated.
3.3 Solutions incompleteness
It is generally accepted that the IDNA solution does not solve the
CDN problems that listed in Appendix A. Although the WG considered
some possible solutions to the CDN problem, those solutions did not
meet the IETF's requirements. Thus, this document proposes prohibiting
the Han characters listed in Appendix B until a solution that is
acceptable to the IETF can be found, or until it is clear that no
such solution is possible.
4. Security Considerations [3]
Additional function of the architecture imply addition of opportunities
for compromising the mechanism. Another security issue is, if a user
entering a name from the extended prohibited table that results in a
failure in the bootstrapping phase..
Current applications may assume that the characters allowed in host
names will always be the same as they are in RFC1034[16], RFC1035[17].
NamePrep[3] infrastructure vastly increases the number of characters
available in host names. Every program that uses "special" characters
in conjunction with host names may be vulnerable to attack based on
the new characters allowed by NamePrep[3] specification.
5 Other Considerations for Appendix B
5.1 Other scripts requirements
Other scripts (e.g., Arabic and Hebrew..,etc.) may have the same
issues as described in the subsections of section 3. The Appendix B
includes but is not only limited to Han code points. To expedite IDN
deployment,"Go fast and prohibit only the code points you understand"
model is recommended, thus Appendix B encompass only major Han code
points for this version.
However, Appendix B can be extended if there are other code points
proposed by other scripts users.
5.2 Issues for prohibiting Han code points
The Han code points are used in many countries and territories,such
as Japan, Korea, China, Taiwan, Hong Kong, Macao, Singapore..,etc.
Except Han code points, Kana is also used in Japan and Hangel is used in Korea.
The proposal will temporarily prevent the users especially in the above
areas from using CDN in the bootstrapping phase. CDN service can only
be activated in the mature phase. This proposed document will cause the
delay of CDN services, on the other hand this will create a good
opportunity to pursue a more complete CDN solution.
6. Acknowledgement:
Many people from the JET (Joint Engineering Team), CDNC (Chinese
Domain Name Consortium)and IETF IDN Working Group contributed ideas
that went into this document, include
Paul Hoffman
John Klensin
Fred Baker
Vincent Chen
Hua Lin Qian
Yang Woo Ko
Yoshiro Yoneya
Kazunori Konishi
Ching Chun Hsieh
Scott Bradner
7. Author Contact Information:
Li Ming Tseng, Prof
National Central University, TWNIC
Email: tsenglm@cc.ncu.edu.tw
Tel: +886-3-490-4421
Jan Ming Ho, Prof
Academia Sinica, TWNIC
Email: hoho@iis.sinica.edu.tw
Tel: +886-2-2788-3799 x 1803
Kenny Huang
AsiaInfra, Academia Sinica, TWNIC
Email: huangk@alum.sinica.edu
Tel: +886-2-2658-6510
8. References:
[1] A Complete Set of Simplified Chinese Characters, published
in 1986 by the Committee of National Language and Chinese
Character of China.
[2] Dictionary of Chinese Character Variants, compiled by Mandarin
Promotion Council of Taiwan. Version 2 was published in Aug 2001
on Web site.http://140.111.1.40/
[3] Paul Hoffman, Marc Blanchet, " Stringprep Profile for
Internationalized Host Names",2002-Jan-09,
draft-ietf-idn-nameprep-07.txt
[4] Patrik Falstrom, Paul Hoffman, "Internationalizing Domain Names
In Applications (IDNA)", 2002-Jan-07, draft-ietf-idn-idna-06.txt
[5] Adam Costello, "Punycode version 0.3.3", 2002-Jan-06,
draft-ietf-idn-punycode-00
[6] The Unicode Consortium, "The Unicode Standard",
http://www.unicode.org/unicode/standard/standard.html.
[7] Scott Bradner, "Key words for use in RFCs to Indicate
Requirement Levels", March 1997, RFC 2119.
[8] ISO/IEC 10646-1:2000(E). International Standard - Information
technology -- universal Multiple-Octet Coded Character Set (UCS)
[9] H. Alvestrand, "IETF Policy on Character Sets and Languages",
1998-Jan, RFC 2277
[10] F. Yergeau, "UTF-8, a transformation format of ISO 10646",
1998-Jan, RFC 2279
[11] P. Vixie, "Extension Mechanisms for DNS (EDNS0)",1999-Aug,
RFC 2671
[12] CJKV Information Processing, ISBN 1-56592-224-7
[13] Unicode Normalization Forms, Mark Davis and Martin Duerst,
Unicode Technical Report 15 [UTR15].
[14] Case Mappings, Mark Davis, Unicode Technical Report 21 [UTR21].
[15] John C. Klensin, "A Search-based access model for the DNS",
2001-Nov-16, draft-klensin-dns-search-02d.txt
[16] Paul Mockapetris, "Domain names - concepts and facilities",
1987-Nov, RFC1034
[17] Paul Mockapetris, "Domain names - implementation and
specification", 1987-Nov, RFC1035
Appendix A CDN Requirements:
The original list of CDN requirements were derived from the result
of the consensus of 7th JET meeting held on Nov 19th, 2001 in Beijing.
The requirements of traditional and simplified Chinese domain name
include
(1) Traditional/Simplified CDN solution MUST be consistent for all
CDN users, including but not limited to end users and
administrators.
(2) The need to do multiple registrations and delegation for an
equivalent CDN MUST be minimized. There MUST be only one
registration for equivalent S-CDN. The delegation(s) for an
equivalent CDN MUST be consistent.
(3) Equivalent S-CDN MUST be treated as equivalent in IDN comparison.
(4) There SHOULD be a consistent mechanism to validate CDN. The
validation algorithm of CDN MAY be revised.
(5) Applications that support CDN MAY display the equivalent S-CDN
to users depending on the priority order of user preference
followed by default original form and then lastly ACE fallback.
(6) Implementation of IDN that supports CDN MUST preserve the
original form of CDN.
(7) IDN requirements MUST accommodate CDN user requirements.
Appendix B. Extended Prohibited Code Point List
----- Start Extended Prohibited Table -----
4E00-9FAF
3400-4DBF
----- End Extended Prohibited Table -----