[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[idn] draft-liana-idn-step-1

To: Internet-Drafts@ietf.org
Subject: [idn] draft-liana-idn-step-1
From: liana.ydisg@juno.com
Date: Sun, 22 Jul 2001 17:52:00 -0700
Cc: idn@ops.ietf.org, liana.ydisg@juno.com

A New Internet-Draft is available from the on-line Internet-Drafts
directories.
This draft is a work item of the Internationalized Domain Name Working
Group of the IETF.

	Title		: StepCode- A Romanized Mnemonic IDN Encoding
	Author(s)	: Liana Ye
	Filename	: draft-liana-idn-step-1.txt
	Pages		: 25
	Date		: 22-July-2001
	
This document describes Romanization of localized internet 
domain names of different languages to US-ASCII [a-z0-9] strings 
in a fashion that is completely compatible with the current DNS.
Two related documents, IDN tags and Mnemonic mapping, will be summitted
shortly.

Internet Draft                                     Liana Ye
draft-Liana-idn-step-01.txt                          Y&D ISG
July 20, 2001
Expires in six months (December 2001)                         
		   
	 StepCode- A Romanized Mnemonic IDN Encoding
		
Status of this memo

This document is an Internet-Draft and is in full conformance with
all provisions of Section 10 of RFC2026.

Internet-Drafts are working documents of the Internet Engineering
Task Force (IETF), its areas, and its working groups. Note that
other groups may also distribute working documents as
Internet-Drafts.

Internet-Drafts are draft documents valid for a maximum of six
months and may be updated, replaced, or obsoleted by other documents
at any time. It is inappropriate to use Internet-Drafts as reference
material or to cite them other than as "work in progress."

     The list of current Internet-Drafts can be accessed at
     http://www.ietf.org/ietf/1id-abstracts.txt

     The list of Internet-Draft Shadow Directories can be accessed
	 at http://www.ietf.org/shadow.html.


Abstract

This document describes Romanization of localized internet 
domain names of different languages to US-ASCII [a-z0-9] strings 
in a fashion that is completely compatible with the current DNS.  

1. Introduction

1.1 Context

World-wide desire to use characters other than plain ASCII in 
hostnames is bubbling up and accelerating. Hostnames have become 
the equivalent of business or product names for many services 
on the Internet, here also referred as tradenames.  The need to 
make them usable by people whose native scripts are not directly 
representable by ASCII, the need for network support workers to
diagnos URL, the need for expanded and diverse name server 
network to sort and manage zonefiles, the need for increasing 
number of non-native readers, who are not using their native 
scripts to refer to tradenames in daily activities, and the need
to minimize possible security leaks when international domain 
names are implemeted in Ddomain Name Servers (DNS) have to be 
addressed. The requirements for internationalizing hostnames are 
described in the IDN WG's requirements document, [IDNReq].

To facilitate one DNS symbol set for users of different languages 
in above technical and security considerations, a Romanization 
process from different languages to US-ASCII is unavoidable.
Language Romanization has been a fact around the globe 
since Russia standardized Cyrillic for many easten European 
languages in the 1920's, Turkey changed from Arabic to Latin
script in 1928, and China adapted Pinyin as a supplemental
phonetic system for Han script in 1958. In the past three 
decades, software implementation of such a process has extented 
from a user to his qwerty keyboard, from a keyboard to text 
editors of various kinds, from text editors to mail services, 
from mail services to internet address resolvers.  To unify a 
fragmented Romanization implementation reality for use as 
IDN hostname identifier, a written documentation is overdue to 
address issues as basic as stated by [DeFrancis 1989]:
   "The adaptation of Latin alphabet to represent a great variety
of spoken languages means of course that the value of specific 
symbols varies from language to language.  This is true both of
the European adaptations, which in most cases came about rather 
haphazardly, and of the more recent creations based on more 
carefully thought-out linguistic principles.  So it is that the 
French 'u' has a different value from that in English.  The letter
'j' represents one sound in English 'jam', another in German 'ja'.
The initial sound of English 'sure' is written 'sz', in Polish, 
Czech. The sound represented by English 'ts' is written in 'c'
in Polish, Czech, Hungarian, Serbo-Croatian, and Chinese."

One step further from the above linguistic issues is sorting and 
searching zonefiles or name servers of hostname identifiers 
containing different written languages for potentially very 
large numbers of users online, say 10% of the world's population.  
Hostname identification could become a bottleneck for internet 
traffic if sorting and searching has to be treated 1. in more 
than one set of partially overlapping or mixed or possibly mixed 
symbolic representations; and 2. in compressed or semantically 
random ordered zonefiles scattered around the globe.

Historically, Character-formed script such as CJK characters has 
inherent sorting and indexing difficulties and is used to be 
an intellectual activity just to use a dictionary. Suppose we have 
solved such an indexing problem with substantial resources and
IDN goes to a Character-form based system, then it is forseeable 
that IDNS system will have to support a text based DNS system as 
well for a long time.  After all, the DNS system is a historically 
successful system. To throw such a system away is like asking 
people to stop shopping at supermarkets. 

The Romanized Pinyin system for CJK character indexing has 
provided a feasible but partial solution. The currently used 
complete solution is to go through a software process of both 
searching databases for possible matches (not exact-match DNS 
lookups) and, where necessary, dialogue with the users, and arrive 
at strong candidates for the glyph representation, especially 
where the users were not easily able to enter more direct 
representations of the characters from keyboards. If this 
selection process can be codified in Latin alphabet, then a 
complete Romanized syllabic system will be reality, and sorting 
and searching international domain names with one set of symbolic 
representation will be speedy and feasible. 

Representation system for hostnames is due to be unified. In 
fact, writing system unification has been seen with Arabic, Latin 
and Chinese.  Each of them is used by many different spoken 
language groups.  According to [DeFrancis 1989], human scripts
can be organized into three groups for their phonetic 
characteristics: 
1. Syllabic systems, for example, Chinese, Japanese, Maya and Yi; 
2. Consonantal systems, ie. Hebrew, Arabic and Indian languages; 
and 3. Alphabetic systems, including Greek, Latin, Cyrillic, 
Korean and English.  Alphabetic systems can be unified by
embedding some differences under the hat of mnemonic 
representation of language symbols, so that the French 'u' is 
permitted to have a different sound value from the English 'u'.

Mapping a consonantal system to an alphabet symbol set is, essentially
embedding some phonetic differences, using a Latin mnemonic hat.
Additionally, there is the question on how to represent the vowels
of the language. Turkey has provided an answer to this question.

As to unifying a syllabic system with an alphabet system, two issues 
need to be addressed.  The first is reversibility from the 
alphabetic system back to the syllabic system, and the second is  
expressibility with the alphabet system of additional information 
included in the syllabic system. 
 
Unification of symbol systems always brings about some loss from 
the original systems, especially in this fast growing internet
era, and the native language of a household can be lost in only one 
generation in a localized bilingual environment. In order to 
retain the colorful heritage of the world, means to provide easy
reference to the original system should be implemented.

The proposed solution is called StepCode, for its prioritized
steps in such a Romanization procedure. First, specify the 
phonetic differences to be embedded in the representation, 
where an International Phonetic Alphabet (IPA) description of 
the embedded differences shall be recorded. Second, if the 
Romaized embedding is not sufficient to cover the differences, 
then extend the mapping space to a 26x10 table for secondary 
phonetic elements which can not be embedded under the Latin 
mnemonic hat. Third, if the 26x10 space is not sufficient, then 
linearize the symbol by specifing each of its components. This 
last part may become recursive. This open ended solution not only 
provides a path to unify a large syllabic system using an alphabet 
system, but also ensures that more semantically specific symbols, 
such as trademarks and logos, can be represented online and sorted 
for speedy referencing. Due to its step nature, the represnetation
can (and should) stop for each symbol, as soon as the symbol can 
be identified within its designated context. For example, 
"xinzhuqinghua1212qin1jin0ge1ge0shui1qing0hua2shi0.com", is a 
unique expression resulted from two complete iterations of applying 
StepCode to four codepoints of [ISO10646], while one complete 
step would result in "xinzhuqinghua1212", which is most likely 
sufficient for identifing a short tradename. For a longer 
tradename the digits may be truncated, and the method resembles 
transliteration of a hostname such that a CJK string appears as 
a normal readable Romanized expression, such as "xinzhuqinghua" of 
the same example above.  For applying StepCode to hostnames,  
except for terminology definitions, this document will limit the 
discussion to the first two of those three parts. 

The IDN WG's comparison document [IDNComp] describes three potential
main architectures for IDN: arch-1 (just send binary), arch-2 (send
binary or ASCII Compatible Encoding, ACE), and arch-3 (just send ACE).
StepCode is an ACE that can be used with protocols that match arch-2 
or arch-3. 
  
The StepCode protocol has the following features:

- There is exactly one way to convert internationalized host parts
to and from Language tagged ACE encoded strings. It permits 
different script tags to access the same glyph in [ISO10646] similar 
to the method used for searching books in a library, such that CJK 
character set may be accessed by different language users with 
different hostnames. Where each if the hostnames always is a unique 
expression on the internet.  If an input string can not match such 
a hostname, then it is considered as user input error.

-[nameprep] applicable to UNICODE and other corresponding 
local coding standards.

-[IDN Tag] includes each language tag and its corresponding 
code blocks of UNICODE and other local coding standards. 

-[Mnemonics] includes language tags, local scripts to Latin 
alphabet symbol mapping, and IPA phonetic value description of 
each phonetic symbol of a language script. It shall be a Rosetta 
Stone of the internet.

- Host parts have no international glyphs but US-ASCII. The 
StepCode procedure SHOULD be after [nameprep] which has prepared 
the hostname parts in applicable code standards.

- For applicable tags, local display codes of different
code standards with corresponding registered hostnames SHOULD
be retained for inquiries from other IDN hosts, and request for
the "reference to be sent" protocol SHOULD be drafted.
 
- Names using StepCode have lengths proportional to the number
of glyphs in the names themselves plus the language tag.  
However, StepCode for all the non-Latin phonetic glyphs SHOULD
be confined within two octets, since all the current phonetic 
based scripts can be represented within two octets and its
mnemonic representation SHOULD be preserved. For a relatively long
CJK, Yi and Hangul glyph squence, say above ten glyphs, the average 
length per glyph is about 3.7 Latin letters.

- This specification allows standard compression or security 
treatment compatible with existing hostnames.

It is important to note that the following sections contain many
normative statements with "MUST" and "MUST NOT". Any implementation
that does not follow these statements exactly is likely to cause
damage to the Internet by creating non-unique representations of
hostnames.

1.2 Author's Disclaimer

This document is for collecting an international co-authorship 
of the IDN WG, to propose a script-specific Romanization encoding
standard for an international tradename solution on the internet.
Since the majority of UNICODE symbols have Romanized names 
specified in UNICODE standard already, the additional work needed
is to select each symbol, excluding font or case variations, to be 
romanized onto Latin alphabet for DNS encoding standard. The most 
technically difficult part of this proposal is to convert a 
romanized CJK and Hangul string back to its codepoints of 
display code standard supported by its local host, where such 
procedures exist in many public domains. A sample procedure in C 
language for Chinese is provided in Appendix D. 

1.3 Terminology

The key words "MUST", "SHALL", "REQUIRED", "SHOULD", "RECOMMENDED",
and "MAY" in this document are to be interpreted as described in
 [RFC2119].

Hexadecimal values are shown preceded with an "0x". For example,
"0xa1b5" indicates two octets, 0xa1 followed by 0xb5. Binary values
are shown preceded with an "0b". For example, a nine-bit value might
be shown as "0b101101111".

Examples in this document use the notation from the Unicode Standard
[Unicode3] as well as the ISO 10646 names. For example, the letter
"a" may be represented as either "U+0061" or "LATIN SMALL LETTER A".

StepCode converts strings at a client site with internationalized 
characters into strings of US-ASCII that are acceptable as host 
name parts in current DNS host naming usage. The former are called
"pre-converted" and a "glyph" for a symbol repesented by one 
codepoint in [ISO10646] or "glyphs" for a string of glyphs and the 
latter are called "post-converted".

The "pre-converted" strings at a client site may be represented 
by Unicode, GB code, JIS code, BIG5 and others which may contain 
font information.  These code forms are referred as language 
specific "localized codepoints".  

The protocol contains one procedure and calls for a minimum 
number of symbols of a language to be mapped onto a Latin 
alphabet in a mnemonic manner. For languages with a large number 
of glyphs and is impossible to map onto a Latin alphabet 
directly, a three layered scheme is RECOMMENDED, and a minimum 
set of glyphs of a script which are often used as parts of other 
glyphs is identified. The glyphs in the smaller set sometimes 
are called radicals, or particles of a CJK character, but 
neither reflects the nature of the set of glyphs which are most 
frequently used glyphs by themselves and are parts of other glyphs. 
The minimum set of glyphs is called "Pianpang", a Han word meaning 
"a character standing on the side", a common word in Chinese. A 
set of associated definitions in this area is given here:
"pang" - a character on the left;
"bian" - a character on the right;
"tou"  - a character on the top;
"di"   - a character on the bottom;
"xin"  - a character in the middle;
"kuang"- a container or a frame character.
Since CJK characters are writen from left to right and from top 
down, often the "pang" is the first part of a character to be used 
as the key for searching into dictionaries and is partially ordered 
in UNICODE, so "pang" is also referred as radicals.

The three layers of glyphs of a large language script are 
Layer one: phonetic glyphs, which can be directly mapped onto
	an alphabetic system under the Latin mnemonic hat;
Layer two: a minimum number of frequently used glyphs 
	which are also used as Pianpangs in other glyphs;
Layer three: the rest of the glyphs in the language script.

The protocol uses US-ASCII to denote the phonetic elements of 
a script and calls for standardizing such a mapping for each 
script tag. The phonetic elements of a glyph is called "spelling" 
of the glyph and is called "stem" for that of a "Pianpang".

The protocol specifies ASCII Compatible [ACE] Encoding maps for 
major languages and provides means of embodiment of such 
implementation with Chinese script and here is referred to as a 
"language tagged ACE" process, or "T-ACE". 

1.4 IDN summary

Using the terminology in [IDNComp], StepCode specifies an ACE format 
for arch-2 (send binary or ACE), and arch-3 (just send ACE).

The characteristic of StepCode length discussed above (1.1 Context) 
is a variable depending on users' choice among many factors. It 
fits well with existing compression and security treatments. 

It calls for standardizing phonetic elements within its user 
language groups specified in the [ISO 639], while asking the 
internet industry to enforce the standard and providing cross 
reference to different script tags into Unicode standard.

2. Host Part Transformation

According to [STD13], host parts must be case-insensitive, start
and end with a letter or digit, and contain only letters, digits, 
and the hyphen character ("-"). This excludes any 
internationalized characters, any font variations, Chinese 
Traditional/Simplified character set variations, as well as many 
other characters in the ASCII character repertoire. Further, 
domain name parts must be 63 octets or shorter in length 
including the language tag. 

2.1 Name tagging

All post-converted name parts that contain internationalized
characters begin with a language tag defined either in [ISO 639-2/T] 
or listed in Appendix E of this document in the form of "xxx-", 
where "xxx" denote the language or script encoded, it SHOULD 
use an [ISO 10646] defined script for the phonetic standard 
implemented.  The herein listed language tags are writing systems 
as oppossed to spoken languages specified in [ISO 639] though 
they are based on spoken languages. For example "usa-" for 
US-ASCII is not considered as a spoken language and so it is 
not included in the [ISO 639].

Since [ISO639] definition based on spoken languages, while script
base definition have been defined in [ISO 10646], StepCode 
implementation applied to languages defined in [ISO 10646] with
labels defined in [ISO639]. 

The phonetic symbols implemented in the encoding MUST have 
been included in [ISO 10646]. 

A language tag MUST be registered with IANA with codepoint blocks
of UNICODE associated with the tag, for [nameprep] to recognize, 
to apply ACE process and to attach the tag to the post-converted 
hostname and for a receiving host to reverse its hostname back to 
either UNICODE or its local codepoints. 

A zone administrator MAY still choose to use "usa-" at the 
beginning of a hostname part even if that part does not contain
internationalized characters. Zone administrators MAY create
host part names that begin with "usa-" which means no conversion
is done and display systems SHOULD ignore converting 
internationalized characters back for display.

2.2 Converting an internationalized name to a T-ACE name part

To convert a string of internationalized characters into a 
T-ACE name part, the following steps MUST be performed in the 
exact order of the subsections given here.

2.2.1. Tag checking
If a name part consists exclusively of characters that conform to
the hostname requirements in [STD13] or the string "usa-", 
the name MUST NOT be converted to T-ACE. That is, a name part 
that can be represented without T-ACE MUST NOT be changed. 
This absolute requirement prevents:
	1. double encoding from a client of user keyboard input 
	 and a server provider;
	2. messing up existing registered domain names;
	3. there being two different encodings for a single DNS 
	registered hostname;
	4. interfering with registered glyphs with more than one
	phonetic standard, such as Chinese script.

If any checking for prohibited name parts (such as ones that are
prohibited characters, case-folding, or canonicalization) is to 
be done, it MUST be done before doing the conversion to a 
T-ACE name part as it is specified in [nameprep]. 

Characters outside the first plane of characters (those with
codepoints above U+FFFF) MUST be represented using surrogates, 
as described in the UTF-16 description in [ISO 10646].

The input name string consists of characters from the ISO 10646
character set in big-endian UTF-16 encoding. This is the
pre-converted string.

2.2.2. Check the input string for disallowed names

If the input string consists only of characters that conform to 
the hostname requirements in [STD13], or the input string consists 
a null language tag, the conversion MUST stop with an error.

2.2.3. T-ACE encoding
Find the corresponding tag, T, with [IDN TAG] for a input string. 
If all the codepoints are in the first tag X, then T= X, it 
	is a valid IDN; 
otherwise, T = dud.

Branch to T, encode the input string with procedure T, conforming 
	to [STD13], obtain ACE string, A. 

Pre-pend the tag, T-, to ACE string, A, to obtain a T-ACE hostname.


2.3. StepCode Method

StepCode starts at a phonetic representation with Latin alphabet
of a glyph.  When this is not sufficient in identify the glyph, 
it supplements the representation with a digit. Due to the fact 
that alphabet based scripts connect several syllables into one 
semantic unit or a word, it normally identifies a word uniquely 
within the language. While a character-form based script such 
as CJK, characterized by one syllable per glyph, often can not 
uniquely identify a character by its syllable alone, but a 
sequence of syllables will often identify a string of characters 
uniquely within the language in a similar way with alphabet 
languages. StepCode observes such a phenomenon and represents a 
phrase of a syllabic language as one semantic unit containing 
more than one syllable, and encourages such a representation of 
a character string. For example, the syllabic string  
"xin zhu qing hua" of four characters is written in the preferred 
form "xinzhuqinghua". 

When Latin alphabet is not sufficient to represent the sound of a 
glyph, the representation is supplimented with a digit, denoting 
a secondary phonetic characteristic of the glyph, or the phrase. 
Together, the described process forms the first step of StepCode 
encoding, and is the most visible part of the method as well.

StepCode steps:
S1.1. Romanize the primary phonetic characteristic of a 
	glyph/phrase;
S1.2. Supplement the secondary phonetic characteristic of the 
	glyph with a digit/digits.

The second step of StepCode is applied to components of each
glyph, Pianpang, in the same way specified in S1.1.
S2.1. Romanize the primary phonetic characteristic of a Pianpang, B;
S2.2. Specify how the next pianpang is related to the current 
	pianpang, B, with a digit;
S2.3. If the pianpang contains another pianpang, X of B, 
	then goto S2.1 of X (and it is S2+1.1(X));
	otherwise goto the next pianpang, B+1.

2.3.1  StepCode phonetic symbol tables

A glyph of alphabetic language has a sound value associated with 
it.  Under this proposal, a set of sounds with a similar value
from different languages SHOULD be associated with a glyph in 
US-ASCII, as shown in Appendix A. 

A glyph of consonantal systems and a phonetic glyph of syllabic 
systems SHOULD be determined for a best fit onto an existing 
set of sound values of US-ASCII. [UNICODE] standard 
has specified a romanized name for each of glyph in the 
standard. The mapping MAY be based on such a romanized name.  

2.3.2. StepCode Conceptial Definition for Digits

With 26 Latin alphabet limit, many languages possess a 
set of sound elements which  are not possible to be included,
then the excluded sound elements are the secondary phonetic 
elements, and SHOULD be assigned to additional symbol 0-9.

2.3.2.1 Secondary Sound Values in Step one encoding:

Although 26x10 is a two dimensional map, it can be filled 
with more than two phonetic aspects of a script.  With 
increased complexity, the mnemonic efforts diminish gradualy. 
For simplicity, four phonetic mapping rules SHOULD be
observed: R1. Diacritic mark mapping; R2. Phoneme Mapping; 
R3. Overflow consecutive slot mapping; R4. Priority 
elements mapping.

[R1] Diacritic mark mapping. For some language scripts a 
secondary phonetic elements have to be marked for their 
users. For example European scripts, a simple Tone mark
mapping SHOULD be used, where the digits only denote common 
diacritic marks [Macmillan93] as the following. 
 
0	letters with no tone
1	flat/macron (-)
2	rise/acute (/)
3	dip/breve (v)
4	drop/grave (\)
5	throw/circumflex (^)
6	thrill/tilde (~)
7	dieresis (")
8	cedilla	(hook)
9	user assigned

The position of a similar marks SHOULD stay in its
respective position for easy reference cross script 
boundary and for users looking for replacement marks. 
A French diacritic mark assignment is in [Appendix B.1]. 

[R2] Phoneme table mapping, where each digit specifies a 
variant of a base phoneme, and a maximum of nine variants may 
be accommodated. This rule has a best mnemonic result cross 
different scripts. For example, IPA symbol mapping [Appendix B.2].

[R3] Overflow Symbol mapping- where the symbols SHOULD fill 
in only consecutive slots in the opposite directions
in the table for ease of index computation, where the middle 
section of the table SHOULD be left for user selected 
definitions. This rule is suited two sets of corresponding 
symbols of the same script, for example Chinese in [Appendix B.3].

[R4] Priority elements mapping- Selecting a set of often used
symbols to be placed in the table. [Appendix B.4]

The above assignment rules may be used in a combination according 
to an order of weights in such an assignment.  Such an order
of weights SHOULD be specified in the form [Rx-Ry-Rz-R4]. 

2.3.2.2. Digits in Step 2 encoding:
 
A unified CJK character is often a composition of several independent 
symbols of the language. It is possible to describe a CJK character  
by representing a character with only its parts/Pianpangs.  
Although it can identify a character uniquely, normally it is 
accompanied with a number of rules with too many exceptions
for the majority of users to comprehend. StepCode encoding has 
reduced the complexity of the rules by considering a CJK 
character as a simple grid of 1 to 10 units, depending on the 
user's viewpoint.  Naming the 1 to 10 units in a linear fashion 
results a linear representation of the glyph or its encoding. 
This is used as secondary encoding most of the time, while 
sometimes it has to be the primary representation, when the 
correct sound of a character is not available to the users. 
The digits in Step 2 and thereafter, specifing how a pianpang
of a glyph on its grid are related to the next pianpang, are 
called layout digits. 

Layout digits specify the relation to the next pianpang in line.  
The left and right direction are defined by a user's left or 
right hand while sitting in front of a display screen or a 
piece of paper. 

The glyph layout digits are:
	0 - end of a character or a Pianpang
	1 - to its right
	2 - to its underside
	3 - to contain the following
	4 - to divide the following
	5 - to its left
	6 - to its top

	The following selectable digits are to specify additional 
	glyphs of the script and directions of layout.

	7 - to overlay itself with X then to its right;
	8 - to overlay itself with X then to its left;
	9 - to overlay itself with X then to its underside.
	
The pianpang layout scheme trades complexity of a glyph with
code length, such that the complexity can be eliminated when
truncating the code is permitted.

2.3.3. StepCode Format

Format Defination: A Stepcode unit is a string of [A-Za-z0-9]
characters without any white spaces, BLANK, in between. For each 
StepCode unit, there are data elements indicated by "", which is a 
MUST supplied element, and [] where the element is optional, 
and / where the data is selectable.

Sx stands for primary sound value or Spelling of xth glyph;
Tx stands for secondary sound value or tone of xth glyph;
Py stands for Stem for yth Pianpang;
Ly stands for Layout relation from y to y+1;
Px.y stands for Stem for Xth glyph and its yth Pianpang;
Lx.y stands for Layout relation from Xth glyph and its y to y+1.
	 
2.3.3.1. One glyph
	"S""T"[P1][L1][P2][L2]...[Py][0/BLANK]

Example:xin1
	xin1qin1jin0 

2.3.3.2. Glyphs
"S1S2S3...Sx"[T1T2...Tx][P1.1][L1.1][P1.2][L1.2]...[P1.y][0]
		[P2.1][L2.1][P2.2][L2.2]...[P2.y][0]
			... 
		[Px.1][Lx.1][Px.2][Lx.2]...[Px.y][0/BLANK]

Example of glyphs of four:
	xinzhuqinghua
	xinzhuqinghua1212
	xinzhuqinghua1212qin
	xinzhuqinghua1212qin1jin0ge1ge0shui1qing0hua
	xinzhuqinghua1212qin1jin0ge1ge0shui1qing0hua2shi0

Which these five equivalent StepCodes is used, depends on where 
it is stored, the size and type of the database, as well as whether
there exist similar hostnames it has confict with. 

2.4. StepCode Encoding Process

Go through [nameprep], checking for prohibited characters, 
case-folding, or canonicalization. 

Either, StepCode may be obtained from Unicode and/or other local
codes to StepCode glygh/phrase conversion tables. 

Or, it is input directly from keyboards, where an input 
processing module to verify correctness of intented glyphs is
necessary. (See C code in [Appendix D.1])

Prepend script tag in the form of "xxx-" to post-converted 
string;  finish. This is the hostname part that 
can be used in DNS registration as well as resolution. 

2.5. Converting a StepCode hostname to an internationalized name

The process has three parts with script tag untouched:

P1.If a domain name part consists no script tag or "usa-"tag, 
	then goto P3;
	Otherwise search for process named "xxx" from StepCode  
		to Unicode or other code conversion, obtain the 
		corresponding codes. 
	(At this point, only a syllabic system might fail.)
P2.If the corresponding code is exit then goto Step 3;
	Otherwise decomposes the post-converted string into a number
	   of individual glyphs 
		specified in the "T" field, or
	   	by syllable recognition; (See [Appendix D.2])
	Search for each glyph;
	If any glyph is not found or is not unique, 
		compose an error message and 
		Request the missing glyphs to be supplied 
		   from the sender either in the form
		   of Unicode or 
			other code stream 
			or in a 24x24 bit map stream. 
P3.Display available glyph, where missing glyph is shown with StepCode;
	If appliable, save the corresponding hostname and display codes. 

3. Security Considerations

Much of the security of the Internet relies on the DNS. Thus, any
change to the characteristics of the DNS can change the security of
much of the Internet. Thus, StepCode makes no changes to the DNS 
itself.

Hostnames are used by users to connect to Internet servers. The
security of the Internet would be compromised if a user entering a
single internationalized name could be connected to different
servers based on different interpretations of the internationalized
hostname. Thus the restriction of DNS names to a small symbol set is
necessary and effective, where adding any other data format such as 
UTF-8 only opens the security gate to complications.  

4.Internationalization considerations

StepCode is designed so that every internationalized hostname part can
be represented as one and only one DNS-compatible string. If there
are two different ways to obtain the same glyph on a display device,
then they are still two distinct hostnames, with no bearing on 
security issues. If there is any way to follow the steps in this 
document and get two or more different results, it is decause of an error 
in the domain name registration process, where one domain name register
fails to update other domain name register servers about a newly 
registered and well researched hostname. 

5. References

[Appendix A] Example Phonetic symbols to Latin small letter mapping
[Appendix B] Secondary sound values to digits mapping.
[Appendix C] StepCode layout digit specification.
[Appendix D] Example C code implementation on encoding and decoding.
[Appendix E] Example of IDN Language tags.

[ASCII] American National Standards Institute (formerly United
States of America Standards Institute), X3.4, 1968, "USA Code for
Information Interchange". (ANSI X3.4-1968)

[DeFrancis 1989] John DeFrancis, "Visible Speech - The Diverse 
	Oneness of Writing Systems", 1989, ISBN 0-8248-1207-7.

[Dictionary79] Beijing Foriegn Language Dept., "A Chinese-English 
	Dictionary", 1979, BK# 9017.810.

[IDNCOMP]   "Comparison of Internationalized Domain Name Proposals",
            draft-ietf-idn-compare-00.txt, June 2000, P. Hoffman.

[IDNReq] Zita Wenzel and James Seng, "Requirements of Internationalized 
	Domain Names", draft-ietf-idn-requirements. May 2001.)

[IDN TAG] Draft-Liana-idn-tags, IDN Language tags.

[ISO639][ISO639-2/T] ISO/IEC 639-2 2001 Codes for the Representation of 
	Names of Languages.

[ISO10646]  ISO/IEC 10646-1:2000 (note that an amendment 1 is in
            preparation), ISO/IEC 10646-2 (in preparation), plus
            corrigenda and amendments to these standards.

[Macmillan93] The Macmillan Visual Desk Reference, 1993, 
	ISBN 0-02-531310-x.

[Mnemonics] "Draft-Liana-idn-mnemonics", Language symbols of 
	[ISO10646] to Latin alphabet mappings for unified IDN 
	symbol representation.

[RFC2277]   "IETF Policy on Character Sets and Languages",
            rfc2277.txt, January 1998, H. Alvestrand.

[RFC2119] Scott Bradner, "Key words for use in RFCs to Indicate
	Requirement Levels", March 1997, RFC 2119.

[STD13] Paul Mockapetris, "Domain names - implementation and
	specification", November 1987, STD 13 (RFC 1035).

[UNICODE] The Unicode Consortium, "The Unicode Standard". Described at
            http://www.unicode.org/unicode/standard/versions/.

[UNICODE30] The Unicode Consortium, "The Unicode Standard -- Version
            3.0", ISBN 0-201-61633-5. Same repertoire as ISO/IEC
            10646-1:2000. Described at http://www.unicode.org/unicode/
            standard/versions/Unicode3.0.html.

[Ye95] Liana Ye, "A Language Oriented Chinese Encoding for 
Multilingual Computing Environments", in "Proceeding of the 1995 
International Conference on Computer Processing of Oriental
Languages", Page 323.

6. Acknowledgements

The author has reused existing IDN draft documents and language as 
much as possible to demonstrate deep respect for the work done by 
members of this working group. Among them, special comments which 
have contributed to improve this document were received from John C 
Klensin, Eric Brunner-Williams and William Davis. Aaron Irvine has 
contributed Esperanto specifications.

7. IANA Considerations

This document requires IANA action for availibility of language tag, 
and registration for each tag and possibly its sub-field for 
phonetic system used.

8. Authors' Contact Information

Liana Ye
Y&D ISG
2607 Read Ave.
Belmont, CA 94002, USA.
(650) 592-7092
liana.ydisg@juno.com

Aaron Irvine, PhD. 
<aaron.irvine@openwave.com>

Expires January 2002


[Appendix A] Sample Phonetic Symbol to Latin alphabet mapping

The phonetic symbols of Chinese are Bopomofo, or Zhuyin symbols 
from U+3105 to U+312c, where the sound value mapping is transcribed
from Zhuyin standard of 1942 and [Dictionary 1979]. 

Definitions:

-x 	The symbol 'x' occurs at end of a unit.
x / y	Both symbols are applicable.
x U+3105 ' 	A sequence of symbols where there is no equivalent
	ASCII representation, a Unicode point with blanks as 
	delimitors is used.

	Mnemonics	Unicode		IPA description

zho-
	Pinyin		Bopomofo	IPA
	b		U+3105		p
	p		U+3106		p'
	m		U+3107		m
	f		U+3108		f
	d		U+3109		t
	t		U+310a		t'
	n		U+310b		n
	l		U+310c		l
	g		U+310d		k
	k		U+310e		k'
	h		U+310f		x
	j		U+3110		t U+0255
	q		U+3111		t U+0255 '
	x		U+3112		U+0255

	zh		U+3113		t U+0282
	ch		U+3114		t U+0282 '
	sh		U+3115		U+0282
	-i				U+0285

	r		U+3116		U+0290
	z		U+3117		ts
	c		U+3118		ts'
	s		U+3119		s
	-i				U+027f

	y				j
	w				w

	a		U+311a		a
	o		U+311b		o
	e		U+311c		U+0259
	eh		U+311d		U+025b
	ai		U+311e		ai
	ei		U+311f		ei
	ao		U+3120		au
	ou		U+3121		U+0259 u
	an		U+3122		an
	en		U+3123		U+0259 n
	ang		U+3124		a U+014b
	eng		U+3125		U+0259 U+014b
	ong				u U+014b
	er		U+3126		U+0259 r
	i		U+3127		i
	u		U+3128		u
	iu		U+3129		i U+0259 u
	v / u"		U+312a		y
	ng		U+312b		U+014b
	gn		U+312c		gn

	ia				ia
	ie				i U+025b
	iao				iau
	ian				ian
	in				in
	iang				ia U+014b
	ing				i U+014b
	iong				y U+014b
	ua				ua
	uo				u U+0259
	uai				uai
	ui 				uei
	uei				uei
	uan				uan
	un				u U+0259 n
	uen				u U+0259 n
	uang 				ua U+014b
	ve				ys
	van				yan
	vn				yn
	- / ' 	(character spelling separator)


[Appendix B.1] Example on Diacritic mark mapping

French has less than eight but more than four diacritic marks, 
it is an example of phonetic mapping [R1].

fre-
0	no tone
1	Silent or Liaison '
2	rise/acute (/)
3	(dip/breve is not used)
4	drop/grave (\)
5	throw/circumflex (^)
6	thrill/tilde (~)
7	dieresis (")
8	(not used for French)
9	Supercript or nasal n


[Appendix B.2] Example on Phoneme Mapping

IPA symbol mapping, [R2] where each digit specifys a 
variant of a base phoneme, and four variants are assigned. The 
table allows other variants to be filled as needed. 
The Unicode codepoint next to Latin alphabet column indicates
the replacement of the corresponding codepoint of Latin letter.  

ipa-
0	1		2		3

a 	U+0251 		ae U+00e6	 U+0292
b
c	ch U+02a7
d	
e	U+025b 		.e U+0259	.e: U+025c  		
f
g
h
i				
j	d3 U+02a4
k
l
m
n	ng U+014b
o	U+0252	o: U+0254  		
p
q
r
s	sh U+0283
t	th U+03b8	U+00f0
u 	U+028c   	U+028a  	U+0075  
v
w
x
y
z	zh U+0292

4	unsigned
5	unsigned

6	unsigned
7	unsigned
8	unsigned
9	unsigned

[Appendix B.3] Example on Overflow Consecutive slot Mapping

Chinese script using Overflow and Tone Mark mapping 
architecture, [R1-R3], Where the table is partitioned to 
select two different glyph sets of the script:

zho-
	0	no tone
	1	flat/macron (-)
	2	rise/acute (/)
	3	dip/breve (v)
	4	drop/grave (\)

	5	classic character drop/grave (\)
	6	classic character dip/breve (v)	
	7	classic character rise/acute (/)
	8	classic character flat/macron (-)	
	9	classic character no tone

[Appendix B.4] Priority elements mapping for English. 
DNS name resolver treats uppercase same as lower case,
It provides no additional value for users to assign 
any specific value to upper case letters besides as one
of many fonts. The English mapping assignment takes
[R1-R2-R4], where digit 8 is designated for letter
related dingbats.

eng-
0	a-zA-Z
1	flat/macron (-)
2	rise/acute (/)
3	dip/breve (v)
4	drop/grave (\)
5	throw/circumflex (^)
6	thrill/tilde (~)
7	dieresis (")
8	Dingbats 
9	Greek a-zA-Z

	0	8		
	a 	U+2604	/*areo or comet*/
	b	
	c	U+24b8	/*copyright*/
	d	U+25ca	/*diamond*/
	e	U+24d4 	/*eletron*/  		
	f	U+2709	/*fly*/
	g	
	h	U+2624	/*health or Caduceus*/
	i	U+261e  /*index or white right pointing index*/			
	j	
	k	U+2654	/*king*/
	l	U+2661	/*love or white heart suit*/
	m	U+2709	/*mail or envelope*/
	n	U+266b	/*note or Barred eighth note*/
	o  	
	p	U+262e	/*peace symbol*/
	q	U+2655	/*queen*/
	r	U+2602	/*rain or umbrella */
	s	U+263a	/*smile*/
	t	U+231a	/*time or watch*/
	u 	U+2328 	/*utility or keyboard*/  
	v	U+260e	/*voice or phone*/
	w	U+270d	/*writing*/
	x	
	y	U+262f	/* yinyang */
	z	

[Appendix C] The glyph layout digits:

	0 - end of a character or a Pianpang
	1 - to its right
	2 - to its under
	3 - to contain the following
	4 - to divide the following
	5 - to its left
	6 - to its top

	The following sellectable digits for specify additional 
	glyph of the script and direction of layout.

	7 - to overlay itself with X then to its right
	8 - to overlay itself with X then to its left
	9 - to overlay itself with X then to its under

[Appendix D.1] StepCode keyboard input process

/* buff.c  StepCode processor interface   Copyright Y&D ISG, Inc. 1994
 *-----------------------------------------------------------------------*
 *  find_gly  find a glyph online.
 *  find_wd   find a word online.
 */

#include <stdio.h>
#include <ctype.h>
#include "steplib.h"

int auto_learn= TRUE;
int udic_large= FALSE;
int udic_database= FALSE;
int odic_expand = FALSE;
int dic_saved = FALSE;
int keyboard_in = TRUE;
int alt_memb = 2;	/* extra members of a poly-code to be recorded */

/* 
 * find_gly  using a StepCode to find the GB code for display a glyph.
 */
int find_gly(step, stepcd, infor, gb, key)
	char *step, *stepcd, *infor, *gb;
	int *key;
{
	FILE *bufp;
	int linecnt, bytes;
	char line[MAXdatalen], *p;
	char bufname[FILENAMSIZ];
	
	strncpy(stepcd, step, strlen(step)+1);
	if (hit_gly(stepcd, gb)) 
		{ *key=GB; return(A_to_B);}

	strncpy(bufname, BUFFILE, FILENAMSIZ);
	bufp = (FILE *)fopen(bufname, "w+b");
	if( bufp == NULL ) 
	{
		strcpy( message, "Buffer file unavailable.");
		typo(message, word); 
		return(ERROR);
  	}
	search_dic(STEP, 1, stepcd, bufname, &bufp, &linecnt);	
	if (linecnt<=0)
	{
		if(verbose)
		typo("No entry found in GB table. You may create one.", step);
		
		fclose(bufp);
		return(A_to_ZIL);
	}
	fseek( bufp, 0L, 0 );		/* to beginning sake read */
	if(fgets(line, MAXdatalen,  bufp)== NULL)
	{	if(verbose)
		fprintf(stderr, "ERROR- buffer file read error.\n");
		fclose(bufp);
		return(ERROR);
	}
	sscanf(line, "%s%d%s%s\n", stepcd, key, gb, infor);
	hash_gly(stepcd, gb);
	fclose(bufp);
	if (linecnt>1)
	{
		return( A_to_N);
	}else {  
		return( A_to_B);
	}
}

int find_wd(step, stepcd, infor, gb, cnt, key)
	char *step, *stepcd, *infor, *gb;
	int cnt, *key;
{
	FILE *bufp;
	int linecnt;
	char line[MAXdatalen], *p;
	char bufname[FILENAMSIZ];
	
	strncpy(stepcd, step, strlen(step)+1);
	if ( hit_wd(stepcd, gb))
		{ *key = GB; return(A_to_B);}

	strncpy(bufname, BUFFILE, FILENAMSIZ);
	bufp = (FILE *)fopen(bufname, "w+b");
	if( bufp == NULL ) 
	{
		fprintf( stderr, "Buffer file unavailable.");
		return(ERROR);
  	}
	search_dic(STEP, cnt, stepcd, bufname, &bufp, &linecnt);
	if (linecnt<=0)
	{	if (!auto_learn)
		{  
		   if(verbose)
			typo("Not found.  You may create the word.", step);
		   fclose(bufp);
		   return(A_to_ZIL);
		}else
		{
			neww = learnword(cnt, stepcd, gb);
			/* Do whatever with neww here */
			if(dic_saved) 
				{	
					hash_wd(stepcd, gb);
					dic_saved = FALSE;
				}
			else 
			{
			   typo("The new word has not saved.", stepcd);
			}   
			fclose(bufp);
			neww = reset_word(neww);
			return(ZIL_to_A);
		}
	}
	fseek( bufp, 0L, 0 );		/* to beginning sake read */
	fgets(line, MAXdatalen,  bufp);
	if(line == NULL)
	{	
		if (ferror(bufp)!=0 && verbose) 
			fprintf(stderr, "Error during buffer read.\n");
		if (feof(bufp) !=0 && verbose) 
			fprintf(stderr, "Buffer file ended.\n");
		clearerr(bufp);
		fclose(bufp);
		return(A_to_ZIL);
	}
	sscanf(line, "%s%d%s%s\n", stepcd, key, gb, infor);
	hash_wd(stepcd, gb);
	fclose(bufp);
	if (linecnt>1)
	{
		return( A_to_N);
	}else {  
		return (A_to_B);
	}
}



/* --------------------------------------------------------------------
 * Figure out the number of glyphs in a word. The next two routines are 
 * based on PINYIN system.
 */
int one_letter_sound(word)
	char *word;
{
	int cnt=0;
	char *w, *v;
	
	w=word;
	while (*w=='m'||*w=='M'||*w=='n'||*w=='N')
			{ ++cnt; ++w;}
	if (cnt>0)
	{
		v = w; --v;
		if((*w=='g'||*w=='G')&& (*v=='n'||*v=='N'))
			++w;	/*ex: mng nnng*/
	}
	if(cnt==0) while (*w=='a'||*w=='A'){ ++cnt; ++w;}
	if(cnt==0) while (*w=='o'||*w=='O'){ ++cnt; ++w;}
	if(cnt==0) while (*w=='e'||*w=='E'){ ++cnt; ++w;}
	if (!isalpha(*w)) 
		return(cnt); /*ex:a aa ooo eee- mmm nmn*/
	else cnt=0;		/*ex: an hhh oong */
	return(cnt);
}

int tell_word(word)
	char *word;
{
	char *w, *v;
	int  cnt;
	cnt=0;
	
	if(!isalpha(*word)) return (NULL);
	
	for (w=word;isalpha(*w);++w); /*skip Pinyin */
	while (isdigit(*w)) {cnt++; ++w;} /*count the number of tone marks*/

	if (cnt<1)		/*special sigle letter glyph cases*/
	{
		cnt = one_letter_sound(word);
		if (cnt>=1) return(cnt); /* else do syllable analysis */
	}
	else return(cnt);

	/*
	 * find the number of syllables by vowel rules 
	 * This implementation is accuate even without using apostrophe
	 */
	w=word;		
	while (isalpha(*w)) /*check the Pinyin only*/
	{
		switch (*w)
		{
		case 'a':
		case 'i':
		case 'e':
		case 'o':
		case 'u': v=w; ++w; cnt++; /*one vowel case*/
			switch (*w)
			{
			case 'i':
			case 'e':
			case 'o':
			case 'u': ++w;break; /*two vowels sound*/
			case 'a': ++w;
				if (*v=='u' && *w=='i') break;/*uai*/
				if (*v=='i' && *w=='o') break;/*iao*/							
				else {
					--w;     /*still two vowels*/
					break;
				}
			default: break;
			}
		default:
			/*already get out off the compound vowel*/
		break;
	        }		  
		++w;   
	}/*check syllables*/
	return(cnt);
}

/* 
 * --------------------------------------------------------------------
 * Interactive input process procedure
 * --------------------------------------------------------------------
 */
inputp(char *word, char *gb) 
{
	int  i,  glyphcnt;
	char c, *w;
	int cnt, key, stat;
	char dump[MAXdatalen];
	
	for (;;)
	{
		*word='\0';
		fgets(word, MAXlinelen, stdin);
		if (isspace(*word))
			break; 

		/* Check if the entry is a glyph string by */
		glyphcnt = tell_word(word);
		if (glyphcnt == NULL) 
		{
			printf("%s", *word);
			fflush(stdin); 
			continue;
		}
		
		w=word;
		while (isalnum(*w)) ++w;
		*w = '\0';
		if(verbose)
			printf("tell_word figure:  %d glyphs\n", glyphcnt);

		/* Determin the entry is known through dictionary
		 * and cache lookup.
		 */
		if(glyphcnt >=2) 
			stat = find_wd(word, stepcd, dump,gb,glyphcnt, &key);
		else stat = find_gly(word, stepcd, dump,gb, &key);

		/* Print out with GB code */
		if (!stat==ERROR) font_code(stepcd, gb, &key, stderr);
		if(verbose) printf("%s\n", stepcd);
		fflush(stdin);
		fflush(stderr);
	}
	return(0);
}


[Appendix D.2]

/* Disassemble a Chinese stepword into stepglyphes.
 *----------------------------------------------------------------*
 */
int disassemb(cnt, word, sts, phonsys)
	int cnt;
	char *word;
	char *sts[];
	int phonsys;
{
	char *w, *hd, *nt, *vh;  /*Stand for head, next, vowel_head*/
	int  i, j, nc, al_flag;
	char *s;

/* initialize*/
	for (i=0;i<(cnt+3);++i)
	   for (j=0, s=sts[i];j<=STEPSIZE;++j, ++s)
		*s=NULL;
	hd=w=word;
	i=j=nc=0;

	switch (phonsys)
	{
	case PINYIN: break;
	case ZHUYIN: /* branch to disassemb_zhuyin(); return;*/
	case KANTON: /* branch to disassemb_kanton(); return;*/
	default:
		break;
	}
	 
/* non-consonent or non-vowel single letter glyphs */
	nc=one_letter_sound(word);
	if(nc>0)
	{
		for(i=0;i<nc;i++, w++) sts[i][0]=*w;
		++w;
		if (*w=='g'||*w=='G') /*case of ng*/
		{	sts[i][1]=*w;	return(nc); }

		for (i=0;i<nc;i++)	/* add the tones */
		{
		   if (sts[i][0]=='a'||sts[i][0]=='A') sts[i][1]='1';
		   if (sts[i][0]=='m'||sts[i][0]=='m') sts[i][1]='2';
		}
		/* Cases of O and E are very limited */
		return(nc);
	}

/* delete the ending -r */
	s = word;
	while (isalpha(*s)) s++;
	--s;
	if (*s=='r' && *(s-1)!= 'e')
	{	
		er_flag= TRUE;
		while (isalnum(*s)) *s=*(++s);
	}

/* Ending -z and -l are accommodated here */


  /* By Pinyin rules:
   * It only trys to recognize a possible syllable, and pays little
   * attention of correct spelling. A word like 'peo' will pass,
   * but 'leek' will not. This scheme is not a speller checker, and
   * tolerates foreign vocabulary.
   */
	hd=w=word;
	i=j=nc=al_flag=0;
	while (isalpha(*w))		/*check the Pinyin only*/
	{	while (isalpha(*w)&&!isvowel(*w)) ++w;
		vh=w; ++w; nt=w; nc++;	 /*one vowel case*/
		
			switch (*w)
			{
			case 'i':
			case 'e':
			case 'u': ++w;nt=w;
				break;		/*two vowels case*/

			case 'o': ++w;nt=w;nt++;
				if (*vh=='i'&& *w=='n'&& *nt=='g')
				  { nt++; w=nt;} /* iong case only */
				else nt=w;
				break;
			
			case 'a': ++w;nt=w;	      /* -a? */
				if (!isalpha(*w) ||
				    (isalpha(*w) &&
				     (*w!='o')&&(*w!='i')&&(*w!='n')))
					break;
				++nt;		    /* special cases */
				if((*vh=='u' && *w=='i') ||      /*uai*/
				   (*vh=='i' && *w=='o'))        /*iao*/
				{ if((nc<cnt)&&(!isalpha(*nt)))
					{		  /*two glyphs*/
					   strncpy(sts[i],hd,(++vh)-hd);
					   ++i;nc++; hd=vh;w=nt;
					   break;
					}
				  else { w=nt; break;}	 /* one glyph*/
				}
				if(*nt=='g')nt++;     /*-an+ or -ang+*/
				if (isalpha(*nt)&&(!isvowel(*nt)))
					{w=nt;	break;	}
				if(isvowelna(*nt)){--nt; w=nt;break;}
				if((nc<cnt)&&(!isalpha(*nt)))
				{	   /*uan or iang:two glyphs*/
					strncpy(sts[i],hd,(++vh)-hd);
					++i;nc++; hd=vh;w=nt;
					break;
				}
				if (!isalpha(*nt)) break; /* end of Pinyin*/
					--nt;	       /*-ana or anga*/
					++al_flag;
				break;
			case 'n': nt=w;++nt;
				if(*nt=='g')nt++;
				if (!isalpha(*nt)|| 
				    (isalpha(*nt)&&(!isvowel(*nt))))
				           { w=nt; break;}
				if (isvowelna(*nt)){--nt; w=nt;break;}
				if(*nt=='a')
				{ ++w; if(*vh=='o'&&*w=='a') --nt; /*-o na */
				       if(*vh=='o'&&*w=='g') {};   /*-ong a*/
				       if(*vh=='u'&&*w=='g') --nt; /*-un ga*/
				 else
				 { /* There are two possible ways 
				    strncpy(sts[i],hd,(--nt)-hd);
				    ++i;nc++; hd=nt;w=nt+1;
				               /*na ga have higher chance*/
				    ++al_flag;--nt;
				    
				 }}
				break;
			case 'r': nt=w; ++nt;
				if (isvowel(*nt)){--nt; break;}
				if (isalpha(*nt)||*vh=='e') w=nt;
				break;
			default:nt=w; /*consonents */
			break;
			}/* end of switch*/		

		strncpy(sts[i],hd,nt-hd);  
		++i;hd=nt; w=nt;
	}/* while check syllables*/

	append_suffix(nt, 0, cnt);

/*
 * supply a word  ending with er2p1 glyph. (extented from Pinyin rule)
 */
	if (er_flag)
	{
		strcpy(sts[cnt], "er2p1");
		nc=cnt+1;
	}else nc=cnt;
	if (!al_flag) return(nc);
	
/*
 * supply an alternative disassembled stepcodes 
 */
	w=word+1;
	while (al_flag)
	{
		while(isalpha(*w)&& al_flag)
		{ 
		  if (*w=='n')
		  { ++w; nt=w;++nt; 
			  if(*w=='a'||(*w=='g'&&*nt=='a')) /*found */
			  {
			     hd=w-2; --al_flag;
			     while (hd>word && isvowel(*hd)) --hd;
			     if (*hd=='h')
			     {  vh=hd-1;
			        if (isupper(*vh)) *vh=tolower(*vh);
			        if (*vh=='z'||*vh=='c'||*vh=='s') hd=vh;
			     }
			     if (*w=='a')nt=w;  /* else no change*/
			     strncpy(sts[nc], hd, nt-hd);
			     ++nc; w=nt;hd=nt;
		   }      }
		   ++w;
		}
		while (isalpha(*nt)) ++nt;
		strncpy(sts[nc], hd, nt-hd);
		nc++;
	}
	append_suffix(nt, nc-cnt, nc);
	return(nc);
}


[Appendix E] Sample Language tags of [UNICODE] Blocks

Tag	Start	End	Start	End	Start	End
Cyr-	U+0401	U+04cc					(not in [ISO639])
cjk-	U+3105	U+312c	U+3400	U+4dbf U+4e00	U+9fff	(Unified CJK)
kro-	U+3400  U+3d2d
lat-	U+0030	U+03f5					(include Greek)
usa- 	U+0030	U+0039	U+0061 U+007a  U+002d U+002d  (not in [ISO639])

Prev by Date: [idn] Indian scripts and similarities
Next by Date: Re: Just send UTF-8 with nameprep (was: RE: [idn] Reality Check)
Prev by thread: Re: Just send UTF-8 with nameprep (was: RE: [idn] Reality Check)
Next by thread: [idn] Indian scripts and similarities
Index(es):
- Date
- Thread