[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[idn] draft-liana-idn-stone is submitted



A New Internet-Draft is available from the on-line Internet-Drafts
directories.
This draft is a work item of the Internationalized Domain Name Working
Group of the IETF.

	Title		: Establishing a Rosetta Stone of Internet
	Author(s)	: Liana Ye
	Filename	: draft-liana-idn-stone-00.txt
	Pages		: 11
	Date		: 28-Sep-01

Spoken language is human nature. An IDN system that users want is 
always closely associated with linguistic issues. To represent a symbol 
of a script consistently, recognizable and easily accessible for 
widest acceptance around the world has been the wish of many Internet 
engineers. For sensible, thus consistent and lasting IDN Domain Name 
identifiers, these symbols have to be indexed on their phonetics, which 
is the basic element of linguistics. A joint effort from Unicode 
Consortium, Library of Congress, The International Phonetic Association 
and IETF IDN working group is necessary to codify transliteration scheme.
Internet Draft                                       Liana Ye
draft-liana-idn-stone-00.txt                          Y&D ISG
Sept. 29, 2001
Expires in six months (March 2002)                         
		   
	    Establishing a Rosetta Stone of Internet 
		
Status of this memo

This document is an Internet-Draft and is in full conformance with
all provisions of Section 10 of RFC2026.

Internet-Drafts are working documents of the Internet Engineering
Task Force (IETF), its areas, and its working groups. Note that
other groups may also distribute working documents as
Internet-Drafts.

Internet-Drafts are draft documents valid for a maximum of six
months and may be updated, replaced, or obsolete by other documents
at any time. It is inappropriate to use Internet-Drafts as reference
material or to cite them other than as "work in progress."

     The list of current Internet-Drafts can be accessed at
     http://www.ietf.org/ietf/1id-abstracts.txt

     The list of Internet-Draft Shadow Directories can be accessed
	 at http://www.ietf.org/shadow.html.


Abstract

Spoken language is human nature. An IDN system that users want is 
always closely associated with linguistic issues. To represent a symbol 
of a script consistently, recognizable and easily accessible for 
widest acceptance around the world has been the wish of many Internet 
engineers. For sensible, thus consistent and lasting IDN Domain Name 
identifiers, these symbols have to be indexed on their phonetics, which 
is the basic element of linguistics. A joint effort from Unicode 
Consortium, Library of Congress, The International Phonetic Association 
and IETF IDN working group is necessary to codify transliteration scheme.

Table of Contents
1. Introduction
2. Available Resources and Issues to be Addressed
  2.1 Unicode Table
  2.2 ALA-LC Transliteration Table
  2.3 The IPA table
  2.4 CJK Romanization tables
3. Limits of IDN Identifiers 
4. Scope of Transliteration 

1. Introduction

Blanket treatment of Unicode is technically feasible, but difficult 
for human access, and it is vulnerable to fence against confusion among 
similar symbols of different scripts. For example, a Bopomofo symbol 
appears among a Chinese character string, it is very difficult for a 
program or a human to say the name is not a Japanese name. This common 
type of confusion will deem an unusable IDN implementation.

However, the solution for the above example will be trivial if the
name registration context, such as input language module, is retained 
by the IDN program, then there will be little chance to interpret the 
input as Japanese.  Retaining the input context can result into many 
formats. This document suggest a format associated with users¡¯ spoken 
language in Latin alphabet and numerals, and thus Romanized 
transliteration of non-Roman scripts, which is extended from 
accumulated works from libraries, dictionary publishers and foreign 
language educators, is proposed.

1.1 Context

World-wide desire to use characters other than plain ASCII in 
hostnames is bubbling up and accelerating. Hostnames have become 
the equivalent of business or product names for many services 
on the Internet, here also referred as trade names, for some oriental
users, they are nationwide trademarks in fact.  The need to make 
them usable by people whose native scripts are not directly 
representable by ASCII, the need for network support workers to 
diagnose URI [RFC2396], the need for expanded and diverse name server 
network to sort and manage zone files, the need for increasing number 
of non-native readers, who are not using their native scripts to refer 
to trade names in daily activities, and the need to minimize possible 
security leaks when international domain names are implemented in 
Domain Name Servers (DNS) have to be addressed. ¡°One aspect of the 
challenge is to decide how to represent the names users want in the DNS 
in a way that is clear, technically feasible, and ensures that a name 
always means the same thing.¡± The problem is addressed in [RFC 2825] 
when the Internet community is pushed by users to face the rudimental 
issue.  More detailed requirements on internationalizing hostnames are 
described in the IDN Working Group's requirements document [IDNReq].

1.2. Reality of Romanization

To facilitate one DNS symbol set for users of different languages 
with above technical and security considerations, a Romanization 
process from Non-Roman scripts to US-ASCII is unavoidable. Language 
Romanization has been a fact around the globe since Russia 
standardized Cyrillic for many eastern European languages in the 
1920's, Turkey changed from Arabic to Latin script in 1928, and 
China adapted Pinyin as a supplemental phonetic system for Han script 
in 1958. Consistent development of transliteration scheme for
non-Roman scripts has lead to the jointed publication of ¡°Romanization 
Tables¡± from American Library Association and Library of Congress in 
1997 [Translit 97].  

In the past three decades, software implementation of such a 
process has extended from a user to his qwerty keyboard, from a 
keyboard to text editors of various kinds, from text editors 
to mail services, from mail services to internet address resolvers. 
To unify a fragmented Romanization implementation reality for use 
as IDN hostname identifier, a written documentation is overdue to 
address issues as basic as stated by [DeFrancis 1989]:
   "The adaptation of Latin alphabet to represent a great variety
of spoken languages means of course that the value of specific 
symbols varies from language to language.  This is true both of
the European adaptations, which in most cases came about rather 
haphazardly, and of the more recent creations based on more 
carefully thought-out linguistic principles.  So it is that the 
French 'u' has a different value from that in English.  The letter
'j' represents one sound in English 'jam', another in German 'ja'.
The initial sound of English 'sure' is written 'sz', in Polish, 
Czech. The sound represented by English 'ts' is written in 'c'
in Polish, Czech, Hungarian, Serbo-Croatian, and Chinese."

Unification of symbol systems always brings about some loss from 
the original systems, especially in this fast growing internet
era, and the native language of a household can be lost in only one 
generation in a localized bilingual environment. In order to 
retain the colorful heritage of the world, means to provide easy
reference to the original sound system should be addressed too.

1.3 Author's Disclaimer

This document is an author¡¯s suggestion grew out of discussions from 
the IETF IDN online discussion group. It does not intent to question 
the current working goals of any groups or standard bodies, and is
impossible to force any working groups to any directions.

1.4 Terminology

The key words "MUST", "SHALL", "REQUIRED", "SHOULD", "RECOMMENDED",
and "MAY" in this document are to be interpreted as described in
[RFC2119].

Examples in this document use the notation from the Unicode Standard
[Unicode3] as well as the ISO 10646 names. For example, the letter
"a" may be represented as either "U+0061" or "LATIN SMALL LETTER A".
A non-Roman character also is denoted in its Romanized form and 
followed by its English equivalent word in <>. For example, ¡°zhong 
<heavy>¡±. 

 1.5 IDN summary

Transliteration of Unicode symbols to be used in IDN for DNS compatible
identifiers for foreign user friendliness and ease of zone file 
management is inevitable. Due to existing resource already in place,
the additional work to simplify and conciliate some differences is 
limited in scope, but would benefit Internet communication on the long 
term and may benefit the IDN system design immediately. 
 

2. Available Resources and Issues to be Addressed

Character standardization has always been an consistent effort since 
human civilization has written languages. Unicode Consortium, Library 
of Congress, The International Phonetic Association and various 
national education and standardization bodies, as well as dictionary 
publishers are all parts of this effort. When IDN Working Group 
examines all the available resources for an ease accessible and 
non-confusion Internet domain name system, to provide a feasible 
technical design to an Internationalized Domain system, it appears, a 
systematic, technical oriented study of current resource is necessary.

2.1 Unicode Table

Unicode is an on going effort for Internet era. Unicode concentrates 
on the graphic features of a character, while some issues can not be 
dealt with efficiency in current Unicode structure [Unicode 3]. For 
example a large equivalent traditional Han character with its 
simplified character set is not addressed in current Unicode Consortium 
[Unicode].  This phenomena is more prominent when multi-script is used 
in the same context.  For example, the following equivalent set for 
Small Latin Letter h, has already become an 9 to 1 case mapping 
in the current [nameprep] specification: 

0048; 0068; Case map

210B; 0068; Additional folding
210C; 0068; Additional folding
210D; 0068; Additional folding

1D407; 0068; Additional folding
1D43B; 0068; Additional folding
1D46F; 0068; Additional folding
1D4D7; 0068; Additional folding
1D573; 0068; Additional folding

The known equivalent set such as addressed in [nameprep] and Chinese
Traditional and Simplified character set [SC Table86][Tsconv] are well 
studied, with standards in place to work from.  

More serious question has been asked on symbols, which are look 
alike but with unrelated semantics in the context of a domain name,
for example, cases analogize to many look-alike Chinese characters. 
Due to the limitation of an text file restriction on the IETF draft
documentation, let us take a simple glyph from Armenian:

Armenian small letters ¡°h¡±, ¡°g¡¯, ¡°f¡±, ¡°o¡±, ¡°n¡±, ¡°u¡± can be considered 
as identical to Latin letters.  Should these letters be mapped to 
above Latin set or not?  If they are not mapped to Latin letters,
as in current [nameprep] specification, what will happen when an 
Armenian picks up letters from Latin set? Does this lead to a correct
match in a compressed ASCII Compatible encoding (ACE)?  How do we know 
what is going wrong with the ACE, if this is not a match? 

If the above Armenian Letters are mapped to Latin set, then the 
¡°0048; 0068; Case map¡± would be increased to 10 to 1 case mapping.  If it 
is so, then what about the rest letters in Armenian character set. Do 
the rest of the letters in Armenian script should be mapped to a look 
alike character too?  For example, the Armenian upper case ¡°n¡±, 
Armenian capital letter VO looks like 
  Thai Letter KO KAI, 
  Lao Letter DO, and 
  Georgian Letter GHAN. 
If more consideration in writing style variations, following the case 
mapping for Latin in above ¡°0048; 0068; Case map¡± example, which indeed 
including writing style variations, then the same Armenian capital 
letter VO, is similar with 
  Bopomofo Letter M, 
  Hiragana Letter RI, and 
  Katakana Letter WA. 
How can we sort out the original intent of a registrant? 

While IDN working group may be forced to take language tagged measure
[ISO639][IDNmap] to retain the users¡¯ language context, to separate 
different scripts, there are still questions as which script is allowed 
to be mixed with another script, are there any desires to have such a 
mix as current American English is reported in [Alphabet]? For example, 
can Azerbaijan users have freedom in choice of using Arabic, Cyrillic 
or Latin scripts[Translit 97, P.24][Mercury 2001-7-30]? Can Azerbaijan 
use the mix of the three scripts like Japanese language?  If two 
scripts are mixed, where the equivalent symbol set should be defined?

2.2 ALA-LC Transliteration Table

Romanization of non-Roman scripts has been an effective method in 
libraries of United States to catalog documents. However, due
to limitations in reality, and historical consistency in recording 
these material, the transliteration of symbols depart from popular 
usage of one or more locations. For example, Arabic transliteration 
is using more diacritics than transliteration used in popular Arabic 
teaching text.  Taking the Arabic Letter Alef from Unicode table: it
is called Alif in [Translit 97], and transliterated to U+0101, Letter
a with a bar on top. While the same Alef is transliterated to ¡°aa¡±
in text book [Nichlas 86]. 

In addition, Romanization has to be consistent with the original
phonetics of the spoken language. For example, Library of Congress
has been using a regional Chinese dialect and English Wade-Giles 
based phonetic system to transliterate Chinese material until 
recently [PinyinConv]. Are there any similar cases should be examined
before IDN implement them into an IDN system?

In addition, transliteration follows certain rules specific to 
particular script and/or language. When these rules apply to 
transliteration to host names, they have to be simplified. For example, 
the majority of white spaces may be omitted for alphabet languages, and 
the same rule works for most Arabic languages with a few exceptions 
which can not be ignored.  The same rule is not directly apply
to CJK in the current [Translit 97] treatment. There is a need for IDN 
working group to cooperate with Library of Congress to specify the 
rules of transliteration of individual scripts. 

2.3 The IPA table

It appears that the scheme followed in [Translit 97] is very close to 
IPA specification. However, for each language, a subset of IPA symbols 
is used in each transliteration scheme, due to the characteristics of 
particular language under treatment. This may imply, a limited number
of diacritic marks is needed for each language, similar with the 26
letter limit in Latin. For ease of use by a common user of a language, 
a smaller number of diacritic marks needs to be defined and to be 
allowed to represent different diacritic values of a language, so it is 
that the French macron, ¡°-¡° may has a different value from that in 
Chinese transliteration. 

The limit of the number of phonetic and diacritic elements to be 
represented in domain names of one language is placed by [STD13] and 
the recent ICANN decisions on a stable DNS. The available symbols to 
be used in the transliteration are ASCII [a-z0-9] and hyphen ¡°-¡°.

2.4 CJK Romanization Tables

CJK[CJK] phonetics are syllabic based, where a code point in Unicode 
is an independent unit in semantics, called ¡°zi¡±, in Chinese, ¡°ji¡± in 
Japanese and ¡°ja¡± in Korean, and somewhat similar with ¡°stem¡± in 
English. These syllabic based symbols are characters of CJK script and 
traditionally are treated as such in computer processing, as well as 
in UCS standard. However, in transliteration scheme, the symbols 
are often treated in groups of characters, called ¡°ci¡± in Chinese, 
¡°shi¡± in Japanese, and ¡°word¡± or ¡°phrase¡± in English. 

From the past computer processing of ¡°zi¡±, it seems to be a disjointed
field with the processing of a ¡°ci¡±.  Since ¡°zi¡± is ¡°only an input¡± 
processing problem, while processing of ¡°ci¡± is computational 
¡°linguistic¡± field, which is a sub-field of Artificial Intelligence,
and too complex to be considered for a flat treatment of any types. 
Granted that these are two separated computer research fields in the
early days of software development, today, it is difficult to separate
the two, since many of the techniques in treating a ¡°ci¡± benefits
treatment of a ¡°zi¡±.  However, this is not to conclude that the 
techniques used in CJK character input processing can not be extended
to treat much larger character set such as UCS.

Transliteration scheme followed in [Translit 97] also has shown 
differentiated treatments among the three CJK languages.  For 
Chinese, the transliteration is character based, while Korean and 
Japanese are word-based transliteration. Each has its limits to be
used in IDN system as an identifier. The word transliteration is 
sufficient in identify a multi-syllabic hostname in DNS system, but 
is insufficient in identify a glyph in Unicode. Also it is difficult
to reverse a word to a sequence of characters since much of its 
character content has not been preserved. A CJK character content 
here refers to the composition scheme of several radicals, also 
referred as CJK glyph content.  A CJK glyph content is the character 
compositions to be explicitly preserved for glyph identification in a 
similar way described in [UAX 15]. There are about 1,000 such 
characters and radicals [Mao 87] from GB[GB] standard, and have code 
points in UCS without definitive guide of use. A consistent 
transliterated radical list for composing a transliterated CJK 
character¡¯s glyph content for CJK code points is to be confirmed by a 
standard body. 

The character based transliteration is easily to retain the character 
boundary, but would have the same problem in identify a CJK character 
without preserving its original glyph content explicitly. The solution 
proposed by StepCode [StepCode] provides a progressive transliteration 
scheme to preserve a CJK glyph content for each CJK character in 
Unicode, and method to extract a DNS hostname consistently from such a 
character encoding to obtain a word-like DNS identifier. 

The progressive character transliteration scheme can only give one 
encoding per character per language, thus a primary encoding has to be 
elected from each equivalent transliteration coding set. For example, 
the Kanji <business> has two Romaji:¡°gyo1¡± and ¡°go1¡±, one of the two 
Romaji has to be elected as its primary transliteration scheme. 

3. Limits in IDN identifiers 

IDN identifiers are names of an entity in user familiar script. IDN
identifiers give just one name per entity. An IDN identifier is just 
one code per character under one language tag, it has no second code 
or second guess for the same identifier. The final goal is to arrive 
at only one table lookup is necessary to convert an IDN name to a DNS 
name, which has eliminated all the ambiguities of an IDN domain name 
may imply.  If we don¡¯t have such a concept of ¡°one¡± to be 
defined on a solid symbolic representation in DNS system, then there 
is no ¡°two¡± or ¡°three¡± to be based on, no matter what types of keyword 
searching system there may be.  In another word, the IDN encoding 
system is the process to eliminate ambiguities associated with a 
spoken language, with a script, as well as the registrants¡¯ intention. 
The IDN system design process is the process to peel off all the 
ambiguities and derive at a unique identification of a domain name. 
The ambiguities peeling process is obviously start at the outer most 
layer.

3.1 One Language per domain name

This layer is readily falling off, like the dry out and cracked onion 
skin. However, it is necessary to spell it out as one spoken language
as defined in [ISO 639] per domain name, thus it is REQUIRED only one 
language tag possible for one host name. 

3.2 One Romanized transliteration per language tag

As we all know, that there are many spoken languages using one script.
The example of Japanese and Korean are using CJK characters are not 
in this category, they are different languages as they are defined in
[ISO 639]. However, there are other spoken languages using the same 
CJK script, such as Cantonese(as [ISO 639] has decided, which will 
not have a language tag in IDN [IDNMap]), they are considered as a 
dialect of Chinese.  So there is only one phonetic system per language 
tag is allowed in IDN, such that Cantonese from China and Kun from
Japan may not be considered. It is REQUIRED that only one Romanized 
transliteration scheme per language tag. 

3.3 One Presentation Direction for All DNS Identifiers

It is also the fact, that human scripts may be read in different 
directions. Latin is read from left to right, Arabic is read from 
right to left, Mongolian is read from top to bottom, and Chinese may
be read from all the three directions. For Romanized transliteration 
to be used as DNS identifiers, all the names are read from left to 
right. It is REQUIRED, that transliterated scripts follow the left to
right reading convention of Latin script. 

3.4 One break per word 

Word boundary is not the same nor is presented consistently cross
all languages. Many transliteration use a hyphen, an apostrophe, a 
space, or a letter to indicate a break between two phonetic units or 
two semantic units. For example, transliteration of Chinese is one 
character per break, while Korean and Japanese are several characters 
per break [Translit 97]. Domain name label is a limited space, while it 
is possible to include several words in the English sense without 
any break indicators, many non-Roman based scripts need breaks for 
correctly reversing transliterated script to the format of the original 
script. For example, the name ¡°an¡¯gang¡± refers to a large steel entity 
in China, and it has to be interpreted as two characters with a break 
at the apostrophe, such that ¡°an-gang¡± does not become ¡°ang-ang¡±. 

The definition of a word in the IDN context is a semantic or a sound 
unit which has to be independent to its neighbor characters. Two 
disjoint semantic or sound parts of a domain name in transliteration 
form is REQUIRED to only use a hyphen in preserving the original script 
form during transcription of an IDN name.

3.5 One IDN character per equivalent character set

Characters may be defined as equivalent in semantics. For example, the 
¡°0048; 0068; Latin Case map¡± in Section 2.1 and the Chinese traditional 
to simplified character mapping in [Tsconv] are the scripts have been
examined carefully and have standing standards to refer to[Nameprep]
[SC Table86]. 

Similar cases exist in other languages too, which have to be carefully
studied before deployment of an IDN implementation. For example, ¡°kan 
<can, tin>¡± in Kanji has two semantically equivalent forms, a 
traditional Han form and a simplified Kanji form.  In the same language
and script of Japanese, another Kanji ¡°so <contest>¡± has exact the same 
traditional and simplified forms with Chinese character ¡°zheng <contest>¡± 
of the same semantics. If such cases are not well studied and treated 
before deployment of an IDN system, then it is foreseeable, 
controversial uses of the same IDN identifier will certainly lead to
an unusable system. An inter-language semantic equivalent IDN character
set is REQUIRED to be defined and only one IDN identifier per such a 
set is permitted.

As to Korean, most Hangul character corresponding to one or more Hanja 
characters from CJK character set. When an equivalent character set 
mapping is permitted for all scripts, as it is treated similarly with 
Latin case mapping, there are decisions to be made on pros and cons of 
such a mapping for the users of different languages. For example, a 
Hanja character of Korean has choice to be in an equivalent set with 
Hangul, or to be not in such a set. A careful evaluation of equivalent 
set is necessary by the users of Korean, Indian, Azerbaijan, etc. as 
well. It is REQUIRED only one IDN identifier is permitted from a set 
of semantically equivalent IDN characters within one language tag. 

3.6 One explicit phonetic value per character

In addition, individual character especially Han characters pronounced 
differently in different context, and has different semantics. For 
example, the character ¡°chong <double>¡± also pronounced as ¡°zhong 
<heavy>¡±, but IDN can only allow one of the pronunciations to be used 
as identifier. It is REQUIRED that one transliteration per character 
per language tag, and it is RECOMMENDED, that the pronunciation often
used to denote a name is adopted for IDN identifier. 

3.7 One Delimiter Digit per Character 

A character is a code point in UCS. A character in a transliterated
form is an alpha-numeral string with at least one Latin letter. 
When several characters in a sequence of transliterated form of a 
particular language, a numeral is used for two functions: diacritic 
value and character delimiter when applicable. If a Latin letter has 
two diacritics, for example, one on top and the other on the bottom of
the letter, the limit is still enforced. The way to observe the limit 
is to either treat the two diacritics together as a special mark, or 
omit one of them. 

4. Scope of Transliterations 

The Internet Rosseta Stone for human script transliteration is to
1) codify CJK character transliteration;
2) codify [Translit 97] transliteration rules for easy word assemble 
  and disassemble routine;
3) suggest implementation of language tag list with consideration of 
  the following questions:

a) What does the language tagged user group expecting the IDN domain name look like?
b) Is this language tagged script used interchangeably with another script?  Are they 
    used as a mixed script? 
c)  How does the concerned script  deal with foreign concepts? 
d) How does the concerned script deal with foreign sounds?
e) Are there any wishes for the script to be mixed with other scripts?  Which ones?

4) provide an easy accessible documentation of embedded differentiation
from a script to its transliteration.  The documentation may be
similar to ALA-LC Romanization Tables, with additions of
 universal on-line access, 
 language tag link and included scripts,
 phonetics of each symbol in IPA notation, and 
 its sound file links into exiting IPA language sound database,
 diacritic value of each digit for different languages,
 hyphenation for embedded word breaks. 
It is RECOMEMDED to name this document as ¡°idn-mnemonic¡±. 

5. Security Considerations

Much of the security of the Internet relies on the DNS. Thus, any
change to the characteristics of the DNS can change the security of
much of the Internet. Thus, transliteration makes no changes to the 
DNS itself, therefor there is no foreseeable security changes to DNS.

6.Internationalization considerations

The proposal is affecting all the domain name users using non-Latin
scripts, and Latin scripts if diacritic marks to be a part of IDN 
identifiers.

7. References

[Alphabet] "Repertoires of characters used to write the indigenous languages 
   of Europe", A CEN Workshop Agreement, Version 2.8, TECHNICAL REPORT, 
   Draft: 1998-12-14. http://www.egt.ie/alphabets/#1.3

[ASCII] American National Standards Institute (formerly United States 
   of America Standards Institute), X3.4, 1968, "USA Code for
   Information Interchange". (ANSI X3.4-1968)

[CJK] James SENG and etc. ¡°Han Ideograph (CJK) for Internationalized 
Domain Names¡±, draft-ietf-idn-cjk-01.txt, 11th Apr 2001.

[DeFrancis 1989] John DeFrancis, "Visible Speech - The Diverse 
	Oneness of Writing Systems", 1989, ISBN 0-8248-1207-7.

[GB] China National character code exchange standard.

[IDNmap] Liana Ye, ¡°IDN Code Exchange Mapping Structure¡±, 
     draft-liana-idn-map-00.txt, Sept. 2001.

[IDNReq] Zita Wenzel and James Seng, "Requirements of Internationalized 
	Domain Names", draft-ietf-idn-requirements. May 2001.)

[IPA] The International Phonetic Alphabet,  
    http://www2.arts.gla.ac.uk/IPA 1996.

[ISO639][ISO639-2/T] ISO/IEC 639-2 2001 Codes for the Representation of 
	Names of Languages.

[Mao 87] Mao, Yuhang, ¡°Direct Radical-Consonant Coding of Chinese 
   Characters¡±, Proceedings 1987 International Conference on Chinese and 
   Oriental Language Computing, Chicago, USA, 1987.

[Nameprep] Paul Hoffman and Marc Blanchet, "Preparation of 
   Internationalized Host Names", draft-ietf-idn-nameprep, July 2001.

[Nichlas 86] Nicholas Awde & Putros Samano, ¡°The Arabic Alphabet¡±, 1986,
   ISBN 0-8184-0430-2
 
[PinyinConv] Library of Congress Pinyin Conversion Project, ¡°New 
   Chinese Romanization Guidelines
   http://lcweb.loc.gov/catdir/pinyin/romcover.html#7

[RFC 2026] S. Bradner, ¡°The Internet Standards Process -- Revision 3¡±,
    1996, RFC 2026.

[RFC2119] Scott Bradner, "Key words for use in RFCs to Indicate
	Requirement Levels", March 1997, RFC 2119.

[RFC2396] Tim Berners-Lee, et. al., "Uniform Resource Identifiers (URI):
Generic Syntax", August 1998, RFC 2396.

[RFC2825] L. Daigle, Ed. ¡°A Tangled Web: Issues of I18N, Domain 
Names, and the Other Internet protocols¡±, May 2000, RFC 2825.

[SC Table86] "A Dictionary of Chinese Character Information¡±, 
   Scientific Publishing, 1988, ISBN 7-03-000869-3/H.3

[StepCode] Liana Ye, "StepCode - A Mnemonic Internationalized Domain 
   Name Encoding", draft-ietf-idn-step-01.txt. 

[STD13] Paul Mockapetris, "Domain names - implementation and
	specification", November 1987, STD 13 (RFC 1035).

[Translit 97] Barry, Randall K. 1997. ALA-LC romanization tables: 
   transliteration schemes for non-Roman scripts. Washington: Library 
   of Congress Cataloging Distribution  Service. ISBN 0-8444-0940-5

[Translit 97, P.24][Mercury 2001-7-30] ALA-LC romanization tables, 
   ¡°Azerbaijani¡± Arabic to Latin table, Page 24. 
   Aida Sultanova, ¡°Azerbaijan Mandates use of Latin alphabet¡±, San 
   Jose Mercury News, July 30, 2001. 

[Tsconv] XiaoDong LEE, etc. "Traditional and Simplified Chinese Conversion",
   draft-ietf-idn-tsconv-00.txt, June 2001.

[UAX15] Mark Davis and Martin Duerst. Unicode Standard Annex #15: ¡°Unicode 
Normalization Forms¡±, Version 3.1.0. 
<http://www.unicode.org/unicode/reports/tr15/tr15-21.html>

[UCS][UNICODE] The Unicode Consortium, "The Unicode Standard". 
     Described at http://www.unicode.org/unicode/standard/versions/.

[UNICODE30] The Unicode Consortium, "The Unicode Standard -- Version
            3.0", ISBN 0-201-61633-5. Same repertoire as ISO/IEC
            10646-1:2000. Described at http://www.unicode.org/unicode/
            standard/versions/Unicode3.0.html.

8. Acknowledgements

The author has benefited from energetic discussions regarding IDN 
system design issues. Among many comments, special arguments or 
instructions which have contributed to inspire the draft of this 
document were from James Seng, Eric Brunner, Mark Davis, Patrik 
Faltstrom, L.M.Tseng, Soobok Lee, Martin Duerst, Harald Tveit 
Alvestrand, Xiaodong Lee, Roozbeh Pournader , Deven Kalra, Adam M. 
Costello, Paul Hoffman, Bruce Thomson, and John C Klensin. 

9. IANA Considerations

This document requires IANA action for availability of script tag, 
and registration for each tag and possibly its sub-field for 
phonetic system used.

10. Authors' Contact Information

Liana Ye
Y&D ISG
2607 Read Ave.
Belmont, CA 94002, USA.
(650) 592-7092
liana.ydisg@juno.com