[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Layer 2 and "idn identities" (was: Re: [idn] what are the IDN identifiers?)

To: klensin@jck.com
Subject: Re: Layer 2 and "idn identities" (was: Re: [idn] what are the IDN identifiers?)
From: liana Ye <liana.ydisg@juno.com>
Date: Sat, 1 Dec 2001 16:02:03 -0600
Cc: klensin@jck.com, idn@ops.ietf.org


On Fri, 30 Nov 2001 16:48:27 -0600 liana Ye <liana.ydisg@juno.com>
writes:
> I like the way you have summarized, and it is easier 
> for me to address the real issue, and have a chance to 
> post my thinking.
> 
> For the following issue in your post:
> 
> One way to look at the above is that the DNS just doesn't have
> enough information available during matching.  The matching
> algorithms don't have access to language information, country
> information, or other things than could be used to sort out
> similarities and variants.  And the DNS does exact matches -- no
> ambiguities permitted.  If the needed information isn't there,
> no matching tricks or "preparation" is going to help -- there is
> no place in the DNS or either magic of "do what I mean"
> capabilities either.
> 
> Discussion:
>   The country information has been in TLD already, it can be 
> addressed in Layer 3. 
>   The language information is not coded in anyway except local
> standard - that is the input processor.  
>  The script information is implied by code blocks from UCS. 
> 
> There is no way to put country information back in DNS label.
> There is a way to put language tag onto a label by using 
> zh--china.com or mo--mongolian.com.  
> There is a way to extract script imformation from UCS block #
> by UCS codepoint itself.
> 
> Problems:
>   Language infor is different from script infor.  Script infor can 
> not
> separate C,J,Ks with UCS codepoints, the only way to separate
> them is stick with language infor and combine with codepoints 
> to tell the difference. 
> 
>  For example, without input language information as UCS codepoints: 
>   kana+CJK is Japanese, using Japanese rules;
>   Hangul+CJK is Korean, using Korean rules;
>   CJK only defaults to Chinese rules and subject to TC/SC 
> equivalence
> examination and label comparison. 
> 
> With input buffer protocol, the language infor. is easy and can be 
> saved.  
>  
> If we are agree with this part, then I can continue.  Because this 
> is the language tag I am proposing.  The tag can be saved as 
> zh--china.com mo--mongolian.com to going into DNS for 
> comparison.
> 

Since nobody disbute with me, I take it as we are agree to
the above discussion.   I'd like to refer to my  I-D 
draft-liana-idn-map-00.txt  for more discussion in this direction. 

Liana

Internet Draft                                     Liana Ye
draft-Liana-idn-map-00.txt                          Y&D ISG
Sept. 11, 2001
Expires in six months (Mar. 2002)                         
		   
	 IDN Code Exchange Mapping Structure 
		
Status of this memo

This document is an Internet-Draft and is in full conformance with
all provisions of Section 10 of RFC2026.

Internet-Drafts are working documents of the Internet Engineering
Task Force (IETF), its areas, and its working groups. Note that
other groups may also distribute working documents as
Internet-Drafts.

Internet-Drafts are draft documents valid for a maximum of six
months and may be updated, replaced, or obsoleted by other documents
at any time. It is inappropriate to use Internet-Drafts as reference
material or to cite them other than as "work in progress."

     The list of current Internet-Drafts can be accessed at
     http://www.ietf.org/ietf/1id-abstracts.txt

     The list of Internet-Draft Shadow Directories can be accessed
	 at http://www.ietf.org/shadow.html.

Abstract

The client side of IDN [IDN] has to accomdate users of different scripts,
with many existing national and internetional standards and different
clients and local servers. The server side of IDN is a proven stable 
US-ASCII only DNS system . An Internetional Unicode standard based, 
national standard tabulation exchange structure called IDN-map is 
described. 

Contents
1. Introduction
2. IDN Standards Code Exchange Table
  2.1 Structure of IDN Code Exchange Table
  2.2 Access of IDN Code Exchange Table
3. Version control and Language tags of IDN Code Exchange Table
  3.1 Language Tags
  3.2 Language Tag File Format
  3.3 Identification of a Tag of an Input String
4. Interface with IDN Code Exchange Map
  4.1 Language Specific Modules
  4.2 Script Specific Canonicalization 
  4.3 Language Specific Normalization and Presentation
  4.4 Language Tagged IDN Label Conversions
  4.5 Uniform Idn-label Protocol 
5. Prefered Embodiments of IDN Code Exchange Map

1. Introduction

  Users from international travelers, to middle school students on 
Tibet Plateau, to librarians in Washington D.C. wish to have direct 
access to internet from their familar desktop with their native 
languages for years, the internet community has been trying to bring
that services to the users from many locations around the world. Some 
servers have successfully demonstrated the concept for such a service,
for example, http://www.3721.com is providing Pinyin [Pinyin] based 
mnemonic registration for Chinese users and allow clicking through on 
users' screen from Chinese URL[URL] window. This document suggests a 
client side structure and cooporated by servers to support such a direct 
and speedy universal URL access for all users on Internet. 

 1.1 Context

Symbols of natural languages are open sets for CJK[CJK] as well as for 
English [ALPHBET]. For example, Chinese continuously discovers 
characters, "Zi", to add onto their character set exceeding the number of
100,000 already. In the United States, many European symbols appears in 
American names, which makes its symbol set exceeds the original of English 
26 letters. Combinations of symbols are called "word" in English, "ci" in 
Chinese, and "string" in term of domain names. In this document, the 
discussion is focused on a mapping structure, called IDN-map for 
symbols, which are referred as UCS[UCS] "Code Points", to specify its 
relationships among various national symbol standards in term of code 
points to support accuate, speedy combinations of symbols for Internet 
domain name identification.

Due to the nature of UCS character set as a multi-script, for multi-language
users, besides the issue of equal speedy access, IDN-map has to address 
three additional issues in recognizing the nature of an open symbol set. 
The first issue is allowing more mixed script use when there is enough 
experience in dealing with existing mixed script use. The second
issue is allowing new symbols to be added into the table in the future. 
The third issue is to let depreciated local standards drop out without 
implicating the international structure and IDN-map's life expectancy. 

IDN-map needs two key mechanism to accomodate above issues in addition 
to current [nameprep] proposal. The first key mechanism is a traffic signal,
called "Language Tag" [RFC 3066], since the users are using different spoken 
languages as they are defined in [ISO 639]. These languages are expressed 
with symbols specified in UCS[ISO 10646], as well as ASCII[ASCII], GB[GB], 
BIG5[BIG5], JIS[JIS], KSC[KSC], ISCII[ISCII]. The users dictate which symbol 
to be used and from where in the UCS, which exhibits very high locality for 
legitimate uses, and here is called "Script Range" of a specific language 
tag.  A script range may include more than one code blocks of UCS, such 
that it permits the deployment of IDN in multiple stages, and allows a 
script range to be expanded in the future for mixed script use.

The second key of IDN-map is a two-level symbol switching mechanism, 
called langauge tagged ASCII compatible character encoding, short for T-ACE. 
The T, for the tag part, is the switch between different spoken languages 
which may implies various national and international standards including 
ASCII. The ACE part is the switch among symbols within the same script range. 
The ACE part of the switch is a massive one for Chinese tag: it is a range 
from 2,000 for student readers of "People's Daily", to 50,000 and above for 
a librarian and many other variants in between, not including Japanese, 
Korean and other spoken languages. To provide a switching system for such a 
variation use of symbols, each switch in the system needs to be labeled for 
a human. It needs to be a mneumonic switch and it needs to be scaleble for 
different user groups too. The proposed ACE is a mnemonic encoding scheme, 
and is called StepCode [StepCode]. With T-ACE in a multiple standard 
tabulation, a simple uniform keyboard control of a domain name identifier 
becomes possible.

 1.2 Author's Disclaimer

The author is not associated in anyway either as a member or as a consultant 
with any of the above mentioned standards, or standard bodies, or any other 
commercially operated entities and can not be responsible for any 
consequence raised from either inclusion or exclusion of any names mentioned 
herein.

 1.3 Terminology

The key words "MUST", "SHALL", "REQUIRED", "SHOULD", "RECOMMENDED",
and "MAY" in this document are to be interpreted as described in
[RFC2119].

Examples in this document use the notation from the Unicode Standard
[Unicode3] as well as the ISO 10646 names. For example, the letter
"a" may be represented as either "U+0061" or "LATIN SMALL LETTER A".
Examples also use octet notation from national code exchange standards
to represent a Unicode character, such as "5167".

 1.4 IDN summary

IDN-Map is a basic international code exchange table to support 
interoperability across various existing clients and local servers 
on the Internet. It accomodates existing user requirements, engineering 
feasibility, DNS stability and security, and provides a bridge from 
existing user platforms to new applications based on the table of Unicode 
standard. 

2. IDN Standards Code Exchange Table

The character set in UCS is a super set of many national code exchange 
standards as well as many symbols outside those standards. Vast existing 
applications built on such national code exchange standards are highly 
crafted to serve large groups of language specific users [UNAME]. While 
these existing local standards are not compatible with each other, they are 
compatible with ASCII, any of its symbols may be expressible with 
alphanumeral of ASCII characters. Through such an alphanumeral, a mapping 
between a symbol in a local standard to a code point in UCS is easily 
achievable.

2.1 Structure of IDN Code Exchange Table

Due to IDN name preparation requirement [IDN req], many of the symbols used
in common names need to be normalized and canonicalized [nameprep] before 
they can be used as IDN identifiers. Thus the IDN Code Exchange Table has 
two columns to satisfy such a primary requirement, and the third column
is the corresponding T-ACE identifier for each UCS IDN identifier of the 
primary language users of those identifiers. The three columns are called 
Unicode-full-section Unicode-primary-fold and ACE-primary tagged, and 
short as U-s, U-p, and A-p respectively as in the following example:

U-s	U-p	A-p
U+0041  U+0061	a      (Latin Letter A case folding)
U+2fc2  U+2ee5	yv2    (Han character fish for Chinese case folding)

The three columns define a primary IDN code exchange table, and referred
as "IDN Primary Map" here after. When there are more than one spoken 
language users for the same UCS codepoints, one or more secondary languages 
are added to the primary map. For example, a Japanese Kanji "Fish" 
corresponding with the same UCS code point U+2fc2 is added to the above 
map, then:

U-s     U-p     A-p  U-j     A-j
U+0041  U+0061	a                  (Latin Letter A case folding)
U+2fc2  U+2ee5	yv2  U+2fc2  uo    (Han character fish)

The U-j column equally can be U-k, for Unicode tagged as Korean, a Hangul 
code point may be there just as well. Or Korean can be additional two
columns added to the secondary map.

2.1.1 IDN-Map that Never Shrinks

It is REQUIRED that a IDN Primary Map contains a column of all permitted
symbols, sorted by UCS code points, used in an IDN names, and it is called
the "UCS input codepoint".  It is also REQUIRED that a IDN Primary Map 
contains a column of corresponding IDN identifier symbols, called 
UCS-folded codepoints, and a column of corresponding ASCII symbols permitted 
by [STD13] to be used for DNS identifiers, called DNS-codepoints. Data items 
in a IDN Primary Map MUST NOT be removed, MUST NOT be altered in anyway 
whence it is deployed. 

It is REQUIRED that after a secondary language added onto an IDN primary
map, the items in such an addition MAY NOT be removed, MAY NOT be altered 
in anyway whence it is deployed. The additional columns of a secondary 
language is called IDN secondary map, and each item in a secondary map
MUST correponding with its primary map entry in associated UCS input 
codepoints.

2.1.2 Equivalent Symbol Set Mapping

Equivalent Symbol set of a script is common and it is important to identify
such equivalency in the context of IDN identifiers on naming the same 
entity with semanticly equivalent symbols especially when IDN provides
far more potential use for symbols from mixed scripts. IDN-map is a
convenient vehicle to carry out equivalent symbol set by prividing more 
referencing columns, called Equites Map and shorted as U-e, to the IDN 
Primary Map, as such:

U-e     U-s     U-p     A-p  U-j     A-j
U+0410  U+0041  U+0061	a                  (Latin Letter A case folding)
U+????  U+2fc2  U+2ee5	yv2  U+2fc2  uo    (Han character fish)

or in IDN Primary Map format:
U-s   U-p    A-p
a      a     a
a'     a     a
a"     a     a

Access support to Equites Map is NOT RECOMMENDED for applications discussed 
in this document, since the focus here is for the ease of the largest 
common denominator code exchange.

2.2 Access of IDN Code Exchange Table

Many access method can be supported with IDN code exchange map, they are
the universal access and local access, where a local access MAY be 
deprecated in the future when universal access becomes direct global access 
to every one in particular local area.

2.2.1 Universal Access

The IDN Primary Map offers two types of access: 1) Unicode input through a 
screen selection or URL buffer and receive a DNS codepoint in its primary 
language users' favor, and is called "idn-umap"; 2) to access through a DNS 
codepoint and retrieve its corresponding UCS codepoint for display. 

The IDN maps sorted by codepoints in particular column are called IDN 
access maps, and the access through primary DNS compatible codes is called 
the IDN Primary Access Map, and is called IDN Tagged Primary Access Map, for 
subsets of IDN Primary Access Map. For example, UCS CJK section in IDN 
Primary Access Map is called IDN Chinese Primary Access MAP, or 
"idn-zh-pmap" for short. It is REQUIRED that the IDN Tagged Primary Access 
Maps are NOT overlap with each other in terms of UCS codepoints. 

There is also the potential in over fregmenting IDN Primary Map, and 
causing unnecessary processing overhead for both machine time and user
fustration.  Reasonable studies are REQUIRED in defining Primary Access 
Maps to facilitate different language groups using the same Primary Access 
Maps, such that Primary Access Maps are not fregmented into excessively
small maps.

The DNS codepoint access map for a secondary language user is called IDN
Tagged Secondary Universal Access Map. Thus a Korean universal access map 
is named as "idn-kr-amap".

IDN Universal Access Maps MUST be updated when IDN primary map is updated.

2.2.2 Local Access

Many existing local display standards are the basic code points in the
client system and local server systems. They are limited to highly efficient 
set of operations for the end users as well as processes for the local 
servers.  To facilitate end users for the speed of IDN access as well as 
compatibility with existing applications, it is RECOMMENDED that an IDN
code exchange table inculdes applicable local display standards 
corresponding with each applicable codepoints in UCS. Taking the example 
from Section 2.1:

U-s     U-p     A-p  U-j     A-j
U+0041  U+0061	a                  (Latin Letter A case folding)
U+2fc2  U+2ee5	yv2  U+2fc2  uo    (Han character fish)

after including local code standards, it becomes:

0       2-1     2    2+1   2+2   6-1     6    6+1   (Column number)
U-s     U-p     A-p  G-p   B-p   U-j     A-j  J-j   (Column header)
U+0041  U+0061	a                                  (Case folding)
U+2fc2  U+2ee5	yv2  5167  b3bd  U+2fc2  uo   ???  (Han character fish)

Where G-p: GB standard in primary language of codepoint U+2fc2
      B-p: Big5 standard in primary language of codepoint U+2fc2
      J-j: JIS standard in Japanese language of codepoint U+2fc2 

The Column number in the first row are identified with a language tag 
discussed in Section 2.3.1.  The column# with "+" are local access
maps. They are called idn-zh-lmap-gb, idn-zh-lmap-b5 and idn-ja-lmap-ji
respectively, and their column number is an off-set index from its tagged
ACE column number. 

It is RECOMMENDED, that when a local display code standard is not 
used for any legitimate reasons, it MAY be deprecated from IDN code 
exchange table, and any new application based on the IDN-map MAY NOT
depend on local access maps.

2.2.3 Summary of IDN Maps

A list of IDN maps using the Column header in example in section 2.2.2, 
where (S) indicates the sorted column with the map naming:

Full maps:
         0       2-1  2    2+1  2+2    6-1  6    6+1   (Column number)
idn-umap U-s(S)  U-p  A-p  G-p  B-p    U-j  A-j  J-j   (UCS Map)

          0-3   0-2   0-1   0           2         6    (Column number)
idn-emap  U-e"  U-e'  U-e  U-s(S)  U-p  A-p  U-j  A-j  (Equites Map)
 

Tagged section maps:
idn-la-pmap U-s	 U-p  A-p(S)                          (Latin section)
idn-zh-pmap U-s	 U-p  A-p(S)  G-p  B-p  U-j  A-j  J-j (Chinese CJK section)
idn-ja-pmap U-s	 U-p  A-p(S)  G-p  B-p  U-j  A-j  J-j (Japanese Kana section)
...
idn-ja-amap U-s	 U-p  A-p  G-p  B-p  U-j  A-j(S)  J-j (Japanese CJK section)
...


Local access maps:
idn-zh-lmap-gb U-s U-p  A-p  G-p(S)  B-p  U-j  A-j  J-j (Chinese GB access)
idn-zh-lmap-b5 U-s U-p  A-p  G-p  B-p(S)  U-j  A-j  J-j (Chinese BIG5 access)
idn-ja-lmap-ji U-s U-p  A-p  G-p  B-p  U-j  A-j  J-j(S) (Japanese JIS access)
...
  
            
2.2.4 Syntax of IDN Maps

The syntax of IDN maps MUST conform in full with definition specified in 
Section 3 of [Version]. In addition, a third field of the values is 
specified as the language tagged, [STD13] conforming IDN names, or DNS 
identifiers. 

It is further specified, if any fields in a line is empty within a given 
language tagged code block, a field separator ";" MUST be used to maintain 
data fields alinement.

It is REQUIRED that each line of IDN-map is treated in its entirety in 
sorting and its columns MUST consistent with its column number sepcified in 
its full map, idn-umap. 

A separated text file, and is proposed to be named as "idntag-xy.txt", to 
specify particular Unicode blocks applicable to a particular language tag 
and its data fields or column number definition. More discussion regarding 
the IDNTAG file is in the next section.

3. Version control and Language tags of IDN Code Exchange Table

UCS character set is an open set, there are possible updates to let in new 
scripts as well as new individual characters. There are also possible longer 
preparation time required for certain subsets to be deployed, as well as 
possible increased user demand for mixed script use in the future. Language
tag defined by [ISO639][RFC 3066] MUST be used as a flag in 1) defining a 
ready to serve a language group as apposed to unspecified language group 
such as mathematic "language", 2) ready to serve script range in terms of 
Unicode blocks, 3) ready to find corresponding mneumonic ACE for a UCS 
codepoint and vice versa. 

3.1 Language Tags

A language tag is define by [ISO639-2/T] and [RFC 3066], and it MUST be 
prepended to a DNS name label and followed by a hyphen "-" in the form of 
"xx-".  A tag MUST have at least one non-zero Unicode block, R1, as its 
associated script range, defined by a triple: (start-point, end-point, 
Column# of T-ACE in IDN-Map), or (0001, ffff, n), where start-point 
<= R1 >= end-point of Unicode code points;  and column# MUST be an positive 
interger, n, where n-1 is the tagged Unicode folded column, and n+1, n+2,
... , m, are column# of the local display standards of the language tag. 

The first code block of a script range is the primary range of a language 
tag. It is REQUIRED that none of the primary ranges of language tags are 
overlap for feasible covering of error checking and consistent assignment 
of T-ACE value. It is also RECOMMENDED to test for operational complexity 
before increasing its associating number of blocks, or to expand its script
range. It is REQURIED to register a language tag with IANA and its 
associated script range whenever it is modified. The repertoire of the 
registered tags and their script ranges is called IDNTAG file here after.

3.2 Language Tag File Format

IDNTAG file has a consistent format specified in [Version] Section 3, 
that is:

  one language tag per line
  lines separated by CR/LF
  each field in the line separated by ";"
  each subfield in the line separated by ","
  the third subfield of the first triple field in a line is a constant for
    all primary language tagges for ease of maintainess.

such that the IDNTAG file takes on the form:
tag-name; version#; block-1; block-2; block-3;...

where each block has a three subfields, specifing the starting and ending 
codepoint of a block in form of Unicode hexadecimal, and a interger as the 
number of T-ACE column in IDN map.  For example:

tag1;1.0;HHHH,HHHH,2;HHHH,HHHH,6;HHHH,HHHH,5;
tag2;1.0;HHHH,HHHH,2;HHHH,HHHH,5;
...
 
3.3 Identification of a Tag of an Input String

An IDN address in URL format may be in any mixed scripts, but all the 
characters of an IDN label MUST be in the same script range of one
language tag. This conformity ensures correct treatment of an IDN label by 
any URL parsers, and minimizes confusion codepoints among different 
scripts. To use mixed scripts in one IDN label is NOT RECOMMEMDED for an 
early deployment of IDN.

3.3.1 IDN Tag File Interface
 
An IDN label can be an arbitary byte stream in IDN-Map permitted display
code standards ([ISO10646] and others to be decided), and a four 
parameters for such an interface to IDNTAG file is defined as:
   
   stat = find-tag(input, tag-file, input-std, tag-rec)
	
where find-tag MUST have four parameters: 
input: a string in input standard byte stream,
input-std: one of the code exchange standard permitted in IDN-map, including
      (UCS, USASCII, GB, BIG5, JIS, KSC, ISC ...)
idntag-file: tag definition file specified in Section 2.3.2. 
tag-rec:  a buffer for returning triples as defined in Section 2.3.1. 
stat: status of the search including
      (ERROR, USASCII, UCS, ALPH, CONS, CJK, NO-TAG, LOC), and discussed in
      Section 3.3.3.

Find-tag portocol is REQUIRED before each access to IDN-Map.

3.3.2 IDN Language Tag Identification Protocol

The above find-tag protocol is REQUIRED to include the following actions
and the actions are performed in the following orders: 
 
1) to identify tag prefix of an DNS label and returns a tag's triples;
2) to identify an ASCII DNS label, if it is conforming to [STD13], and
    assign USASCII to the tag value, and return USASCII to tag status;  
3) to assign a tag, if the input standard has a known language tag, for 
    example, input standard JIS implys language tag "ja", returns tag 
    triples;
4) be defaulted to UCS and checking for script range error.  It is 
  RECOMMENDED that at least two of the input Unicode codepoints to be 
  checked for more acurate tag identification. Inconsistent tag values 
  between the two check points, the more specific value MUST be returned, 
  and a coresponding tag triples MUST be returned;
5) to assign a language tag status to the protocol, when no applicable tag 
   is found, and a prohibited codepoint is not encounted, a NO-TAG value 
   MUST be returned. 

3.3.3 IDN Language Tag Identification Status Protocol

Tag identification is RECOMMENTED to use at least two of input 
codepoints, for higher accuracy and a two-step classification as well:
one for its script group, the other for the script within the group.
The first step is to identify script group.  Since scripts may be 
treated in three different groups: alphabet, consonant and syllabic 
or character-based systems.  The three groups is reflected by
the following code blocks in UCS as shown bellow:

        Alphabet Sys.  Consonant Sys.  Character Sys.

From:	0020            0530            2e80
to:	052f            1bff            d7af

include:Latin           Armenian        CJK
        Greek           Hebrew          Kanji
        Cyrillic        Arabic          Kana
        IPA             Devanagari      Hangul
        Vietnamese      Malayalam       Yi
                        Thai
                        Lao
                        Tibetan
                        ...

Some cultures often use more than two scripts within the same group, 
such as Japanese, but rarely using another script especially from a 
different group. The three different groups also reflect different
processing consideration as well. 

Scripts in Alphabet group are frequently used by different language users 
who may mix two or more different spoken language names using the same 
script.  Also, alphabet has semanticly equivalent two sets of symbols:
uppercase and lowercase letters, which can be folded under [nameprep]
canonicalization. The main treatment issue is to consider mixed symbol use, 
for differen language groups, for example, an Azerbaijian may wish to 
switch between Latin and Cyrillic at easy.

The majority scripts in Consonant group are one language per script,
where many of the symbols from different scripts are look-alike but
have unrelated values.  However, when such a look-alike symbol in its own 
script contexts, its value is unambignous.  IF the script is correctly 
identified, potential symbol confusion would be resolved. In this group, 
more language tag identification care should be given than members of 
other script groups. 

Treatment of Character based scripts is largely a uniqueness of 
characters' indices. The issue is more contentious if a character T-ACE 
collides with another T-ACE on a different character.  Also, due to its 
mear size of symbols, its T-ACE index system has to be easily mastered 
and to be sorted for fast access [StepCode]. The main issue in IDN-Map
is to identify character equivalent sets, and reduce the number of 
applicable IDN identifiers by 1) limiting the applicable IDN input code
points to Plane 0 of Unicode table, 2) assigning one IDN identifier from
each semanticly equivalent character class suggested by [CJK], [tsconv]. 

3.3.4 Summary of IDN Language Tag Status Protocol

The three major script groups are status as ALPH, CONS, and CJK, as 
they are mentioned in Section 3.3.1. and 3.3.3. It is suggested that
language tags are fall into the same script groups, MAY be treated with
the same language specific normalization and presentation methods discussed
later in Section 4.3 of this document, to reduce implementation complexity.

IDN Language Tag status also has 

NO-TAG: Unicode input code points without primary language tag defined,
ERROR: prohibited UCS input code points [nameprep], 
LOC: code points of local standards permitted in IDN-map other than Unicode;
USASCII: [STD13] complient input string.

4. Interface with IDN Code Exchange Map

A uniform interface with IDN map is specified for interoperability among
different clients and local server systems, and feasible upgrade of 
language specifice modules associated to an individual language tag. These
language tag specific modules are called "language tagged procedures".

4.1  Language Specific Modules

A spoken languages is expressed with specific symbols grouped into a 
corresponding script, which may be scattered in different UCS bolcks. Each 
script has its own methods in manipulating its symbols,  in decomposing a 
symbol into parts, in selecting a symbol from an equivalent symbol set, in 
combining symbols into a string, as well as in presentation of a string on a 
screen.  However, each language has each own systematic way to treat its 
script, some processes can be captured in simple procedures, others have 
to be treated on an individual basis, and many variations are in between. 
It is RECOMMENDED that reasonable studies are given to each language to 
classify script treatment model, and a cost vs. benifit analysis in select 
a long term script specific processing protocol to be embedded in IDN 
language specific modules. It is RECOMMENDED that processing speed and 
simplicity of its implementation takes the highest priority in such a 
decision. 
 
Two levals of script specific processing are supported with IDN-Map 
structure. The lowest level is the language tagged IDN map in favor of the 
primary users of a script (Section 2.1),  where a simple code equivalence 
from input to an IDN identifier can be assigned, and is referred as 
canonicalization in [UTR21],[tsconv], [jpchar], [hangeul]. The second level 
is IDN label nomalization and presentation.

4.2 Script Specific Canonicalization 

The first level of script specific canonicalization have been addressed in 
[nameprep], [tsconv], [jpchar], [hangeul], [bidi], [UTR21], [CJK], where 
a mechanism of folding by Domain Name registration services and at client
site for the purposes of preventing confusing allocations CJK Domain Names 
or the likes take much higher priority in domain name services.

For local server based deployment of IDN, a partial solution of recover
the registered codepoints MAY be achieved by specifing the presentation 
of IDN use prefolded form for all of the names. For example, "JOES-Pizza"
is folded to "joes-pizza", and recoved to "JOES-PIZZA" when the user has
such a desire. 

Another complete recovering solution would involve a different server 
transport of the original registered form, where a supporting mechanism 
is discussed in [UNAME} and is used in CJK specific procedures in Section
4.3.1 and 4.3.2.

Uniform interface to IDN map has one procedure with 5 parameters:

   idn-folding(input-list, input-std, tag-rec, output-std, output-list);

where 
 input-list is the normalized and error checked codepoints [bidi][UAX15],
 input-std is the code standard of the normalized input label (Sec.3.3.1),
 tag-rec is the returned tag triples from find-tag protocol(Sec.3.2),
 output-std is the requested code standard, same as input-std,
 output-list is a list of all the codepoints retrieved from IDN Map in 
    output-std;
and
 input-std and output-std are couples of intergers in the form of (a,b),
where the interger, a, is the input-std(Sec.3.3.1) and the second interger, 
b, is the off set number of columns from corresponding T-ACE column number
(Sec.2.2.3). 

4.3 Language Specific Normalization and Presentation

The second level of script specific processing have been addressed in [IDNA],
[icdn], [UAX15], [UAX9] and [bidi] are referred as normalization procedures, 
and presentation procedures. 

Normalization is to break an input string into a list of UCS codepoints in
input code standard. Presentation is to combine a list of UCS codepoints 
into a string in output code standard. Presentation may joint certain 
symbols between UCS codepoints or randering the order of UCS codepoints' 
presentation as a string.  Normalization MUST reverse all the randerings
made by its corresponding presentation procedure on a label string when it 
break a string into a list of UCS code points. When input is an ACE string
similar processes are calles "fitting" and decompose". The relations are:

 input      Processes      output
  UCS  normalize-->fitting  ACE
                \/
                /\
  ACE  decompose-->present  UCS

For convenience, these procedures are proposed to be named with the exact 
language tag defined in IDNTAG file in the name, such that a language 
tagged normalizasion procedure is named as "idn-XY-normalize", where "XY" 
represents the language tag of associating procedure.  Following the 
same convention that "idn-XY-present", "idn-XY-fitting", "idn-XY-decompose" 
would be the names for respective DNS name decompose procedure and IDN name 
presentation procedures. For example, "idn-zh-present" is the langauge 
tagged IDN label presentation procedure for Chinese.

Two language specific script treatment procedures are REQUIRED for each 
language tag registered: 1) Normalize and 2) Present, and two additional
T-ACE specific script treatment procedures 3) Fitting, 4) Decompose are
RECOMMENDED for non-alphabet languages. It is also RECOMMENDED that a 
NO-TAG general compressive ACE [AMC] is registered as compress and 
decompress procedures corresponding with Fitting and Decompose procedures 
with IANA. It is REQUIRED that when a language tag is registered with IANA, 
the associated script specific procedures to be registered at the same 
time. 

4.3.1 Language Tagged Normalization and Input Error Checking

The find-tag interface gives the legal search range for error checking 
and normalization process to insure all the codepoints in input IDN label 
are legal IDN codepoints, which SHOULD NOT be rejected by IDN Map. The 
returned list of UCS codepoints MUST be checked for such an error, to 
prevent illegal IDN codepoint slip through and burden its following search 
in IDN-Map. The nomalization protocol is:

   stat=idn-XY-normalize(input, input-std, tag-rec, input-list, err-report)

It is REQUIRED that each language tagged nomalization procedure perform:
1) check for disallowed input-std,
2) check for disallowed codepoints in its script range,
3) normalize input string to IDN-Map allowed input codepoints,
4) return input-list with one UCS codepoint per record,
5) report any errors.

A similar protocol for 

   stat=idn-XY-decompose(input, USASCII, tag-rec, input-list, err-report)

It is RECOMMENDED that each T-ACE decomposition procedure perform:
1) check zonefile for cached IDN label
2) check for Non-ASCII input string for transport corruption,
3) check label length, if it is up to the maximum, request for the original 
    registered IDN label from registrar,
4) strip language tag,
5) decompose input string to IDN-Map permitted UCS code points,
6) return input-list with ACE for each UCS codepoint per record,
7) report any errors.

4.3.2 Language Tagged Presentation and Preserving Character Boundary 

When idn-fold protocol returns a list of output UCS codepoints, a 
presentation process checks correctness of output codepoints and 
combines these codepoints into a display string. If output codepoints 
contain errors, presentation procedure SHOULD report an error, and request 
the original IDN display codepoints to be send, and make its best effort in 
display the current IDN string. The presentation protocol is:

   stat=idn-XY-present(output-list, output-string, err-report)

It is RECOMMENDED that each language tagged presentation procedure perform:
1) if a codepoint contain an error, request for the original registered IDN 
   label from original registrar,
2) reverse randerings made to a string by normalization procedure, 
3) arrange string display order/direction,
4) concatenate output-list to output label and return the output label,
5) report any errors.

A similar protocol for 

   stat=idn-XY-fitting(output-list, output-name, err-report)

is to put in necessary separtors for easy decomposing, and make it certain 
the encoding length fit into limited label space of 63 octets. If the 
encoding is over maximum label length, it SHOULD record both input string 
and T-ACE name to local zonefile, and compose a DNS identifier from 
output-list codepoints. 

It is RECOMMENDED that each T-ACE fitting procedure perform:
1) check for total code length, truncate certain tailing ACE to fit into the 
   label length limit if required, 
2) when necessary, put codepoint separator for proper decomposing, 
3) concatenate ACE from each UCS code point to an output-name,
4) prepend the language tag to output-name,
5) report any errors.

4.3.3  Special Attention to Mix Scripts
 
A string mixed with CJK and Kana is Japanese, CJK and Hangul mix is 
Korean. However, an all CJK character string MUST presumed to be in the
primary language tag, that is Chinese, and registered as the only IDN name, 
unless the registrant requests a second and a third language to access the 
same IDN name. In this case, there could be more than one DNS label to be 
maintained by the registrant, and the IDN-Map becomes an automatic name 
translation agency.

Tag identification of an arbitary input string proposed in find-tag 
protocol is an language indicator at its best. More careful check should 
be given in normalizing and error checking procedure. For example, the 
Chinese tagged normalizing procedure, idn-zh-nomalize, MUST check all input 
points to be certain about the correctness of returned value from find-tag 
procedure, and alter when it is necessary. It SHOULD identify a CJK-Kana
mix as Japanese tag, and CJK-Hangl mix as Korean tag.

4.4 Language Tagged IDN Label Conversions

The primary IDN label conversions are from UCS to [STD13] and vice versa. 
A backward compatibility utilitary support is also given to a limited set of
local standards. Uniform IDN interface to applications is concured by IETF 
IDN Working group session(August 2001, London, England). The protocol SHOULD 
treat any possible input string with the same procedures, and divert 
language specific requirement to language tagged procedures at fixed points 
of IDN label conversions. 

The uniform IDN interface to applications is proposed to be:

    idn-label(input, input-std, tag-file, zone-buff, idn-name, output-std);

where
 input: IDN label in input-std,
 input-std: any IDN permitted code standard (Sec.3.3.1),
 tag-file: IANA distributed IDNTAG file (Sec.3.2),
 zonefile: optional local registered domain name file for servers [UNAME], 
           or cache at a client site,
 idn-name: output of converted input in requested output-std,
 output-std: requested output form in any IDN permitted code standard.

In addition a localized zonefile search procedure SHOULD be supplied if a 
zonefile is applicable. 


4.4.1 Code Conversions Supported by IDN-Map 

Idn-label protocol recognizes two code standards: UCS and ASCII by default. 
Any other permitted code standards MUST be specified as parameters. The 
code conversion direction is specified in the following matrix.

 Input-std to output-std implementation matrix:

in\out  U-i     U-p      ACE   ASCII   G      B      J
U-i      -      fold     DNS    -    disp   disp   disp
U-p    record	-        DNS    -    disp   disp   disp
ACE    record   regist   pass  pass  disp   disp   disp
G      record	fold     DNS    -     -     disp   disp
B      record   fold     DNS    -    disp    -     disp
J      record   fold     DNS    -    disp   disp   -
ASCII    -      -        -     pass   -      -     -

where U-i      UCS input
      U-p      UCS folded in primary language
      ACE      T-ACE form
      G,B,J    permitted local code standards
      record   used for registration font or trademark records
      regist   for registration conflict matching 
      fold     canonicalization case folding
      DNS      obtain DNS identifier
      pass     pass by, no process 
      disp     local client backward compatible display
      -        prohibited 
      
From observision of the matrix, it is clear, that the conversion is based 
on input code standard. If the input and output are all ASCII, then output 
is ASCII without any further delay, which is compatible with current DNS 
operation.  

4.4.2 Input and Output Format Request 

Considering that idn-label protocol may be installed on a client site, the
input and output request specification may contain errors due to variety
of inconsistent site configuration, smooth handling of such errors is an
important part of idn-label protocol. 

 Input-std to output-std default case matrix:

in\out  U-t     U-n      ACE     ASCII
U-t      -      -        ACE      -
U-n*     -	-        ACE      -
ACE     UCS     -        -        - 
ASCII    -      -        -       pass 

where 
  U-t UCS code with tag identified
  U-n UCS code with NO-TAG identified, *also any input-std error case
  ACE identified ACE format
  ASCII [STD13] with no tag, or with "us-" tag added by zone masters
  -   ignored case
  pass passby without any processing 

It is proposed that the tag "us-" is reserved for a name part which
consists exclusively of characters that conform to the hostname 
requirements in [STD13], as an optional language tag. If an all ASCII 
label in [STD13] or a "us-" prepended to a name, and the output standard 
is not specified, or is specified as USASCII, then the input name MUST NOT 
be converted at all. This absolute requirement prevents:
 1) double encoding from a client of user keyboard input and a server 
    provider;
 2) messing up existing registered domain names;
 3) interfering with registered glyphs with more than one
    phonetic standard, such as Hanja and Kanji in CJK script.

If the input string consists only of characters that conform to 
the hostname requirements in [STD13], and with a prefixed language tag, 
and the output standard is NOT USASCII, the RECOMMENED output defaults 
to UCS folded, column #1, which is the universal base support. This 
recommentation is to provide a friendly presentation for end user 
configuation ignorance.

When there is no tag on a non-ASCII input string, then it is going 
through script identification, prohibited characters filtering, 
canonicalization, case-folding, as defined in [nameprep] and is treated
with find-tag process. 

If its output-std is not specified or specified with inconsistence, then 
the USASCII is assigned as the default output-std for any non-ASCII input.

All the rest input and output code standards MUST be explicitely specified
for any conversion requests to be honoured. 

4.5. Uniform Idn-label Protocol 

The Idn-label protocol is summarized in a C language format, with some
of the parameters and details ommitted.

idn-label(input, input-std, tag-file, zonefile, idn-name, output-std)
{
    flag = find-tag(input, tag-file, input-std, tag-rec);
    tag = get-tag(tag-rec);

 /* Part 1: Name preparation, normalization and error checking */

    switch (flag)
    {
	case ERROR:  return(ERROR);
 	case USASCII: 			/* input ASCII */
        {
	  switch (tag)
          {
	    case NIL: return (idn-name = input);   /* ASCII passby */
	    case US: return (idn-name = input);    /* ASCII passby */
            case AMC: 				   /* General ACE[AMC] */
                   {idn-amc-decompress; 
                    return(idn-name)}              /* Finish */

            case ZH: idn-zh-decompose;	           /* T-ACE decompose */
	    case JA: idn-ja-decompose;
	    case KR: idn-kr-decompose;
		...
            default: return ("unimplemented tag ERROR");
          }
          if (output-std not permitted)	         /* Output request check */
             output-std = UCS;
        }
 	case NO-TAG:                              /* General UCS input */
        {
          switch (flag)
            {
                case ALPH: idn-alph-normalize    
	        case CONS: idn-cons-normalize
	        case CJK: idn-cjk-normalize
            }
          if (output-std ERROR)	               /* Output request check */
             output-std = USASCII;
        }
        default:				/* script range found */
        {
          switch (tag)
          {
            case zh: idn-zh-normalize
	    case ja: idn-ja-normalize
	    case kr: idn-kr-normalize
		...
          }
          if (output-std not permitted)	         /* Output request check */
             output-std = USASCII;
        }
    }
    /* Above normalizing protocol:
     stat=idn-XY-normalize(input, input-std, tag-rec, input-list, err-report)*/

     if error (stat) 		                 /* Input error checked */	
	{
	fprintf(stderr, "%s %s", input, err-report);
	return (ERROR);
	}

  /* Part 2: Canonicalize and Code exchange */
   
   idn-folding(input-list, input-std, tag, output-std, output-list);

  /* Part 3: Present and Fitting */

   switch (output-std)
   {			                     /* output ACE */
      case USASCII:
      {
        if (flag = NO-TAG) { stat=idn-AMC-compress; tag=AMC;}
	switch (tag)
        {
         case zh:  
               stat=idn-zh-fitting(output-list, idn-name, err-report);
         case kr:  
               stat=idn-kr-fitting(output-list, idn-name, err-report);
         case XY:  
               stat=idn-XY-fitting(output-list, idn-name, err-report);
         }
        concatenate( tag, idn-name);          /* prepend tag to ACE*/
       }
     case UCS:
     {  			                    /* output UCS */
      switch (tag)
      {
        case AMC:
          switch (flag)
             {   
                case ALPH: idn-alph-present;    
	        case CONS: idn-cons-present; 
	        case CJK: idn-cjk-present; 
             }
        case kr: idn-kr-present;
        case ja: idn-ja-present;
        case XY: stat=idn-XY-present(output-list, idn-name, err-report);
       }
      }
     case other-output:{}			 /* output other standard */
   }
}

5. Prefered Embodiment of IDN Code Exchange Map

Three applcations are suggested for client, server and general public.

5.1. Client Application

Uniform Idn-label Protocol of Section 4.4 is one of the prefered 
embodiments of IDN-map discussed to provide consistent IDN client interface 
corss any language installation. Using Idn-label interface, a basic URI cut 
and paste operation may be implemented:

URL cut and paste, then send:
  Loop for all labels
  {
	Get IDN label from URL buffer, 
	Call Idn-label, receive ACE label, 
	replace IDN label with ACE label
    until end of URL
  } 
  send URL. 

5.2 Server Application

The most important embodiment of IDN-Map is in IDN Domain Name registration
process to check for name conflict and trademark search, where trademarks
in Han characters is common practise. The following prototype demonstrates
such an embodiement.

IDN registration as an example for server application: 
1) get wish-name, 
      call Idn-label(wish-name), receive T-ACE-label.
      examing T-ACE-label, if bad go to 1).  
      send T-ACE-label for DNS match, bad go to 1).
      good go to 2)
2) call Idn-label(T-ACE-label), receive IDN-name.
     examing IDN-name, if bad, go to 1).
     send  IDN-name for IDN match, if bad go to 1). 
     good, go to 3).
3)Register IDN-name, T-ACE-label in zonefile [UNAME].

5.3. Implications of Deployment of IDN-Map

IDN-Map is a feasible tool for many, for example, a third application 
has been suggested to use the IDN-map as a general input encoding exchange 
module to be called from any applications. If it is implemented then
a librarian may use a keyboard with existing input software to access a 
particular CJK character, C, in UCS Plane 0, and retrieve a C' from Plane 
1, or C" from Plane 2.

A flexible tool always brings its cons with it. From technical area, more
scrutiny has to be placed for each equivalent symbol to be mapped into its
equivalent code point, and each T-ACE has to be checked for mnemonic pros, 
simple logical assignment to ensure consistence and uniqueness.  Also, it
introduces more policy decisions, for example, an all CJK character 
trademark registrant may have to registrate in three languages to ensure 
the legitimacy of the trademark. After all, a useful tool is to let its 
user to make decisions.

6. Security Considerations

Much of the security of the Internet relies on the DNS. Thus, any
change to the characteristics of the DNS can change the security of
much of the Internet. IDN-Map makes no changes to the DNS itself.

7.Internationalization considerations

The Internetional code exchange table will provide convenience for many
internetional application development.

8. Acknowledgements

The special comments which have contributed to improve this document 
were received from Li Ming Tseng as well as many other people from the 
working group.

9. IANA Considerations

This document requires IANA action for availibility of language tag, 
and registration for each tag and associated language specific processing
procedures.

10. References

[AMC] Adam M. Costello, "AMC-ACE-Z," draft-ietf-idn-amc-ace-z, Sept. 2001.

[Alphabet] "Repertoires of characters used to write the indigenous languages 
   of Europe", A CEN Workshop Agreement, Version 2.8, TECHNICAL REPORT, 
   Draft: 1998-12-14. http://www.egt.ie/alphabets/#1.3

[ASCII] American National Standards Institute (formerly United
   States of America Standards Institute), X3.4, 1968, "USA Code for
   Information Interchange". (ANSI X3.4-1968)

[bidi]  Martin Duerst, "Internet Identifiers and Bidirectionality", 
   draft-duerst-iri-bidi-00.txt, July 2001.

[CJK] James SENG, etc. "Han Ideograph (CJK) for Internationalized Domain 
   Names", draft-ietf-idn-cjk-01.txt, Apr 2001.

[GB] China national code exchange standard.

[hangeul] Soobok Lee and GyeongSeog Gim, "Hangeul NAMEPREP recommendation",
   draft-ietf-idn-hangeulchar,  July 2001.

[icdn] Xiang Deng and Yan Fang Wang, "The Implementation of Chinese character 
  in IDN", draft-ietf-idn-icdn-00.txt, July 2001.

[IDN] "IETF Internationalized Domain Names Working Group",
            idn@ops.ietf.org, James Seng, Marc Blanchet

[IDNA] Patrik Faltstrom and Paul Hoffman, "Internationalizing Host 
  Names In Applications", draft-ietf-idn-idna-03.txt, July 2001.  

[IDNReq] Zita Wenzel and James Seng, "Requirements of Internationalized 
	Domain Names", draft-ietf-idn-requirements. May 2001.)

[ISCII] Indian Standard Code for Information Exchange

[ISO639][ISO639-2/T] ISO/IEC 639-2 2001 Codes for the Representation of 
	Names of Languages.

[ISO10646]  ISO/IEC 10646-1:2000 (note that an amendment 1 is in
            preparation), ISO/IEC 10646-2 (in preparation), plus
            corrigenda and amendments to these standards.

[JIS]  "Japanese Industrial Standards", Information Technology 
    (Terms/Code/Date elements)-99, ISBN 4-542-12976-4

[jpchar] Yoshiro Yoneya and Yasuhiro Morishita, "Japanese characters  
    in multilingual domain name labels", draft-ietf-idn-jpchar-01, 
    March 2001.

[KSC] Korean national code exchage standard.

[nameprep] Paul Hoffman and Marc Blanchet, "Preparation of 
   Internationalized Host Names", draft-ietf-idn-nameprep, July 2001.

[Pinyin] "Scheme for the Chinese Phonetic Alphabet", Shangwu Pubishing
   House, 1979, United Book# 9017.810

[RFC2277]   "IETF Policy on Character Sets and Languages",
            rfc2277.txt, January 1998, H. Alvestrand.

[RFC2119] Scott Bradner, "Key words for use in RFCs to Indicate
	Requirement Levels", March 1997, RFC 2119.

[RFC2231] Email tag

[RFC 3066] H. Alvestrand, "Tags for the Identification of Languages", 
    (RFC 3066).

[STD13] Paul Mockapetris, "Domain names - implementation and
	specification", November 1987, STD 13 (RFC 1035).

[StepCode] Liana Ye, "StepCode - A Mnemonic Internationalized Domain 
   Name Encoding", draft-ietf-idn-step-01.txt

[tsconv] XiaoDong LEE, etc. "Traditional and Simplified Chinese Conversion",
   draft-ietf-idn-tsconv-00.txt, June 2001.

[UAX9] Mark Davis, "The Bidirectional Algorithm", Unicode Standard Annex #9,     
   March 2001. http://www.unicode.org/unicode/reports/tr9

[UAX15] Mark Davis and Martin Duerst. Unicode Standard Annex #15:
   Unicode Normalization Forms, Version 3.1.0.
   <http://www.unicode.org/unicode/reports/tr15/tr15-21.html>

[UCS] "Universal Multiple-Octet Coded Character Set", ISO/IEC 10646-1:1993, 
   ISBN 0-201-61633-5

[UNAME] Li Ming TSENG, etc. "Internationalized Domain Names and Unique 
   Identifiers/Names", draft-ietf-idn-uname-01.txt, Jul 2001.

[UTR21] Mark Davis. Case Mappings. Unicode Technical Report;21.
    <http://www.unicode.org/unicode/reports/tr21/>.

[UNICODE] The Unicode Consortium, "The Unicode Standard". Described at
            http://www.unicode.org/unicode/standard/versions/.

[UNICODE3] The Unicode Consortium, "The Unicode Standard -- Version
            3.0", ISBN 0-201-61633-5. Same repertoire as ISO/IEC
            10646-1:2000. Described at http://www.unicode.org/unicode/
            standard/versions/Unicode3.0.html.

[URL] Roy Fielding et al., "Uniform Resource Identifiers:
     Generic Syntax", August 1998, RFC 2396; Robert Hinden et. al, "IPv6
     Literal Addresses in URL's", December 1999, RFC 2732. 

[version] Marc Blanchet, "Handling versions of internationalized domain
     names protocols", draft-ietf-idn-version

11. Authors' Contact Information

Liana Ye
Y&D ISG
2607 Read Ave.
Belmont, CA 94002, USA.
(650) 592-7092
liana.ydisg@juno.com

Follow-Ups:
- Re: Layer 2 and "idn identities" (was: Re: [idn] what are the IDN identifiers?)
  - From: Eric Brunner-Williams in Portland Maine <brunner@nic-naa.net>
- Re: Layer 2 and "idn identities" (was: Re: [idn] what are the IDNidentifiers?)
  - From: Rick H Wesson <wessorh@ar.com>

Prev by Date: RE: Layer 2 and "idn identities" (was: Re: [idn] what are the
Next by Date: Re: Layer 2 and "idn identities" (was: Re: [idn] what are the IDN identifiers?)
Prev by thread: Re: Layer 2 and "idn identities" (was: Re: [idn] what are the IDNidentifiers?)
Next by thread: Re: Layer 2 and "idn identities" (was: Re: [idn] what are the IDNidentifiers?)
Index(es):
- Date
- Thread