[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
[idn] draft-liana-idn-step-1
A New Internet-Draft is available from the on-line Internet-Drafts
directories.
This draft is a work item of the Internationalized Domain Name Working
Group of the IETF.
Title : StepCode- A Romanized Mnemonic IDN Encoding
Author(s) : Liana Ye
Filename : draft-liana-idn-step-1.txt
Pages : 25
Date : 22-July-2001
This document describes Romanization of localized internet
domain names of different languages to US-ASCII [a-z0-9] strings
in a fashion that is completely compatible with the current DNS.
Two related documents, IDN tags and Mnemonic mapping, will be summitted
shortly.
Internet Draft Liana Ye
draft-Liana-idn-step-01.txt Y&D ISG
July 20, 2001
Expires in six months (December 2001)
StepCode- A Romanized Mnemonic IDN Encoding
Status of this memo
This document is an Internet-Draft and is in full conformance with
all provisions of Section 10 of RFC2026.
Internet-Drafts are working documents of the Internet Engineering
Task Force (IETF), its areas, and its working groups. Note that
other groups may also distribute working documents as
Internet-Drafts.
Internet-Drafts are draft documents valid for a maximum of six
months and may be updated, replaced, or obsoleted by other documents
at any time. It is inappropriate to use Internet-Drafts as reference
material or to cite them other than as "work in progress."
The list of current Internet-Drafts can be accessed at
http://www.ietf.org/ietf/1id-abstracts.txt
The list of Internet-Draft Shadow Directories can be accessed
at http://www.ietf.org/shadow.html.
Abstract
This document describes Romanization of localized internet
domain names of different languages to US-ASCII [a-z0-9] strings
in a fashion that is completely compatible with the current DNS.
1. Introduction
1.1 Context
World-wide desire to use characters other than plain ASCII in
hostnames is bubbling up and accelerating. Hostnames have become
the equivalent of business or product names for many services
on the Internet, here also referred as tradenames. The need to
make them usable by people whose native scripts are not directly
representable by ASCII, the need for network support workers to
diagnos URL, the need for expanded and diverse name server
network to sort and manage zonefiles, the need for increasing
number of non-native readers, who are not using their native
scripts to refer to tradenames in daily activities, and the need
to minimize possible security leaks when international domain
names are implemeted in Ddomain Name Servers (DNS) have to be
addressed. The requirements for internationalizing hostnames are
described in the IDN WG's requirements document, [IDNReq].
To facilitate one DNS symbol set for users of different languages
in above technical and security considerations, a Romanization
process from different languages to US-ASCII is unavoidable.
Language Romanization has been a fact around the globe
since Russia standardized Cyrillic for many easten European
languages in the 1920's, Turkey changed from Arabic to Latin
script in 1928, and China adapted Pinyin as a supplemental
phonetic system for Han script in 1958. In the past three
decades, software implementation of such a process has extented
from a user to his qwerty keyboard, from a keyboard to text
editors of various kinds, from text editors to mail services,
from mail services to internet address resolvers. To unify a
fragmented Romanization implementation reality for use as
IDN hostname identifier, a written documentation is overdue to
address issues as basic as stated by [DeFrancis 1989]:
"The adaptation of Latin alphabet to represent a great variety
of spoken languages means of course that the value of specific
symbols varies from language to language. This is true both of
the European adaptations, which in most cases came about rather
haphazardly, and of the more recent creations based on more
carefully thought-out linguistic principles. So it is that the
French 'u' has a different value from that in English. The letter
'j' represents one sound in English 'jam', another in German 'ja'.
The initial sound of English 'sure' is written 'sz', in Polish,
Czech. The sound represented by English 'ts' is written in 'c'
in Polish, Czech, Hungarian, Serbo-Croatian, and Chinese."
One step further from the above linguistic issues is sorting and
searching zonefiles or name servers of hostname identifiers
containing different written languages for potentially very
large numbers of users online, say 10% of the world's population.
Hostname identification could become a bottleneck for internet
traffic if sorting and searching has to be treated 1. in more
than one set of partially overlapping or mixed or possibly mixed
symbolic representations; and 2. in compressed or semantically
random ordered zonefiles scattered around the globe.
Historically, Character-formed script such as CJK characters has
inherent sorting and indexing difficulties and is used to be
an intellectual activity just to use a dictionary. Suppose we have
solved such an indexing problem with substantial resources and
IDN goes to a Character-form based system, then it is forseeable
that IDNS system will have to support a text based DNS system as
well for a long time. After all, the DNS system is a historically
successful system. To throw such a system away is like asking
people to stop shopping at supermarkets.
The Romanized Pinyin system for CJK character indexing has
provided a feasible but partial solution. The currently used
complete solution is to go through a software process of both
searching databases for possible matches (not exact-match DNS
lookups) and, where necessary, dialogue with the users, and arrive
at strong candidates for the glyph representation, especially
where the users were not easily able to enter more direct
representations of the characters from keyboards. If this
selection process can be codified in Latin alphabet, then a
complete Romanized syllabic system will be reality, and sorting
and searching international domain names with one set of symbolic
representation will be speedy and feasible.
Representation system for hostnames is due to be unified. In
fact, writing system unification has been seen with Arabic, Latin
and Chinese. Each of them is used by many different spoken
language groups. According to [DeFrancis 1989], human scripts
can be organized into three groups for their phonetic
characteristics:
1. Syllabic systems, for example, Chinese, Japanese, Maya and Yi;
2. Consonantal systems, ie. Hebrew, Arabic and Indian languages;
and 3. Alphabetic systems, including Greek, Latin, Cyrillic,
Korean and English. Alphabetic systems can be unified by
embedding some differences under the hat of mnemonic
representation of language symbols, so that the French 'u' is
permitted to have a different sound value from the English 'u'.
Mapping a consonantal system to an alphabet symbol set is, essentially
embedding some phonetic differences, using a Latin mnemonic hat.
Additionally, there is the question on how to represent the vowels
of the language. Turkey has provided an answer to this question.
As to unifying a syllabic system with an alphabet system, two issues
need to be addressed. The first is reversibility from the
alphabetic system back to the syllabic system, and the second is
expressibility with the alphabet system of additional information
included in the syllabic system.
Unification of symbol systems always brings about some loss from
the original systems, especially in this fast growing internet
era, and the native language of a household can be lost in only one
generation in a localized bilingual environment. In order to
retain the colorful heritage of the world, means to provide easy
reference to the original system should be implemented.
The proposed solution is called StepCode, for its prioritized
steps in such a Romanization procedure. First, specify the
phonetic differences to be embedded in the representation,
where an International Phonetic Alphabet (IPA) description of
the embedded differences shall be recorded. Second, if the
Romaized embedding is not sufficient to cover the differences,
then extend the mapping space to a 26x10 table for secondary
phonetic elements which can not be embedded under the Latin
mnemonic hat. Third, if the 26x10 space is not sufficient, then
linearize the symbol by specifing each of its components. This
last part may become recursive. This open ended solution not only
provides a path to unify a large syllabic system using an alphabet
system, but also ensures that more semantically specific symbols,
such as trademarks and logos, can be represented online and sorted
for speedy referencing. Due to its step nature, the represnetation
can (and should) stop for each symbol, as soon as the symbol can
be identified within its designated context. For example,
"xinzhuqinghua1212qin1jin0ge1ge0shui1qing0hua2shi0.com", is a
unique expression resulted from two complete iterations of applying
StepCode to four codepoints of [ISO10646], while one complete
step would result in "xinzhuqinghua1212", which is most likely
sufficient for identifing a short tradename. For a longer
tradename the digits may be truncated, and the method resembles
transliteration of a hostname such that a CJK string appears as
a normal readable Romanized expression, such as "xinzhuqinghua" of
the same example above. For applying StepCode to hostnames,
except for terminology definitions, this document will limit the
discussion to the first two of those three parts.
The IDN WG's comparison document [IDNComp] describes three potential
main architectures for IDN: arch-1 (just send binary), arch-2 (send
binary or ASCII Compatible Encoding, ACE), and arch-3 (just send ACE).
StepCode is an ACE that can be used with protocols that match arch-2
or arch-3.
The StepCode protocol has the following features:
- There is exactly one way to convert internationalized host parts
to and from Language tagged ACE encoded strings. It permits
different script tags to access the same glyph in [ISO10646] similar
to the method used for searching books in a library, such that CJK
character set may be accessed by different language users with
different hostnames. Where each if the hostnames always is a unique
expression on the internet. If an input string can not match such
a hostname, then it is considered as user input error.
-[nameprep] applicable to UNICODE and other corresponding
local coding standards.
-[IDN Tag] includes each language tag and its corresponding
code blocks of UNICODE and other local coding standards.
-[Mnemonics] includes language tags, local scripts to Latin
alphabet symbol mapping, and IPA phonetic value description of
each phonetic symbol of a language script. It shall be a Rosetta
Stone of the internet.
- Host parts have no international glyphs but US-ASCII. The
StepCode procedure SHOULD be after [nameprep] which has prepared
the hostname parts in applicable code standards.
- For applicable tags, local display codes of different
code standards with corresponding registered hostnames SHOULD
be retained for inquiries from other IDN hosts, and request for
the "reference to be sent" protocol SHOULD be drafted.
- Names using StepCode have lengths proportional to the number
of glyphs in the names themselves plus the language tag.
However, StepCode for all the non-Latin phonetic glyphs SHOULD
be confined within two octets, since all the current phonetic
based scripts can be represented within two octets and its
mnemonic representation SHOULD be preserved. For a relatively long
CJK, Yi and Hangul glyph squence, say above ten glyphs, the average
length per glyph is about 3.7 Latin letters.
- This specification allows standard compression or security
treatment compatible with existing hostnames.
It is important to note that the following sections contain many
normative statements with "MUST" and "MUST NOT". Any implementation
that does not follow these statements exactly is likely to cause
damage to the Internet by creating non-unique representations of
hostnames.
1.2 Author's Disclaimer
This document is for collecting an international co-authorship
of the IDN WG, to propose a script-specific Romanization encoding
standard for an international tradename solution on the internet.
Since the majority of UNICODE symbols have Romanized names
specified in UNICODE standard already, the additional work needed
is to select each symbol, excluding font or case variations, to be
romanized onto Latin alphabet for DNS encoding standard. The most
technically difficult part of this proposal is to convert a
romanized CJK and Hangul string back to its codepoints of
display code standard supported by its local host, where such
procedures exist in many public domains. A sample procedure in C
language for Chinese is provided in Appendix D.
1.3 Terminology
The key words "MUST", "SHALL", "REQUIRED", "SHOULD", "RECOMMENDED",
and "MAY" in this document are to be interpreted as described in
[RFC2119].
Hexadecimal values are shown preceded with an "0x". For example,
"0xa1b5" indicates two octets, 0xa1 followed by 0xb5. Binary values
are shown preceded with an "0b". For example, a nine-bit value might
be shown as "0b101101111".
Examples in this document use the notation from the Unicode Standard
[Unicode3] as well as the ISO 10646 names. For example, the letter
"a" may be represented as either "U+0061" or "LATIN SMALL LETTER A".
StepCode converts strings at a client site with internationalized
characters into strings of US-ASCII that are acceptable as host
name parts in current DNS host naming usage. The former are called
"pre-converted" and a "glyph" for a symbol repesented by one
codepoint in [ISO10646] or "glyphs" for a string of glyphs and the
latter are called "post-converted".
The "pre-converted" strings at a client site may be represented
by Unicode, GB code, JIS code, BIG5 and others which may contain
font information. These code forms are referred as language
specific "localized codepoints".
The protocol contains one procedure and calls for a minimum
number of symbols of a language to be mapped onto a Latin
alphabet in a mnemonic manner. For languages with a large number
of glyphs and is impossible to map onto a Latin alphabet
directly, a three layered scheme is RECOMMENDED, and a minimum
set of glyphs of a script which are often used as parts of other
glyphs is identified. The glyphs in the smaller set sometimes
are called radicals, or particles of a CJK character, but
neither reflects the nature of the set of glyphs which are most
frequently used glyphs by themselves and are parts of other glyphs.
The minimum set of glyphs is called "Pianpang", a Han word meaning
"a character standing on the side", a common word in Chinese. A
set of associated definitions in this area is given here:
"pang" - a character on the left;
"bian" - a character on the right;
"tou" - a character on the top;
"di" - a character on the bottom;
"xin" - a character in the middle;
"kuang"- a container or a frame character.
Since CJK characters are writen from left to right and from top
down, often the "pang" is the first part of a character to be used
as the key for searching into dictionaries and is partially ordered
in UNICODE, so "pang" is also referred as radicals.
The three layers of glyphs of a large language script are
Layer one: phonetic glyphs, which can be directly mapped onto
an alphabetic system under the Latin mnemonic hat;
Layer two: a minimum number of frequently used glyphs
which are also used as Pianpangs in other glyphs;
Layer three: the rest of the glyphs in the language script.
The protocol uses US-ASCII to denote the phonetic elements of
a script and calls for standardizing such a mapping for each
script tag. The phonetic elements of a glyph is called "spelling"
of the glyph and is called "stem" for that of a "Pianpang".
The protocol specifies ASCII Compatible [ACE] Encoding maps for
major languages and provides means of embodiment of such
implementation with Chinese script and here is referred to as a
"language tagged ACE" process, or "T-ACE".
1.4 IDN summary
Using the terminology in [IDNComp], StepCode specifies an ACE format
for arch-2 (send binary or ACE), and arch-3 (just send ACE).
The characteristic of StepCode length discussed above (1.1 Context)
is a variable depending on users' choice among many factors. It
fits well with existing compression and security treatments.
It calls for standardizing phonetic elements within its user
language groups specified in the [ISO 639], while asking the
internet industry to enforce the standard and providing cross
reference to different script tags into Unicode standard.
2. Host Part Transformation
According to [STD13], host parts must be case-insensitive, start
and end with a letter or digit, and contain only letters, digits,
and the hyphen character ("-"). This excludes any
internationalized characters, any font variations, Chinese
Traditional/Simplified character set variations, as well as many
other characters in the ASCII character repertoire. Further,
domain name parts must be 63 octets or shorter in length
including the language tag.
2.1 Name tagging
All post-converted name parts that contain internationalized
characters begin with a language tag defined either in [ISO 639-2/T]
or listed in Appendix E of this document in the form of "xxx-",
where "xxx" denote the language or script encoded, it SHOULD
use an [ISO 10646] defined script for the phonetic standard
implemented. The herein listed language tags are writing systems
as oppossed to spoken languages specified in [ISO 639] though
they are based on spoken languages. For example "usa-" for
US-ASCII is not considered as a spoken language and so it is
not included in the [ISO 639].
Since [ISO639] definition based on spoken languages, while script
base definition have been defined in [ISO 10646], StepCode
implementation applied to languages defined in [ISO 10646] with
labels defined in [ISO639].
The phonetic symbols implemented in the encoding MUST have
been included in [ISO 10646].
A language tag MUST be registered with IANA with codepoint blocks
of UNICODE associated with the tag, for [nameprep] to recognize,
to apply ACE process and to attach the tag to the post-converted
hostname and for a receiving host to reverse its hostname back to
either UNICODE or its local codepoints.
A zone administrator MAY still choose to use "usa-" at the
beginning of a hostname part even if that part does not contain
internationalized characters. Zone administrators MAY create
host part names that begin with "usa-" which means no conversion
is done and display systems SHOULD ignore converting
internationalized characters back for display.
2.2 Converting an internationalized name to a T-ACE name part
To convert a string of internationalized characters into a
T-ACE name part, the following steps MUST be performed in the
exact order of the subsections given here.
2.2.1. Tag checking
If a name part consists exclusively of characters that conform to
the hostname requirements in [STD13] or the string "usa-",
the name MUST NOT be converted to T-ACE. That is, a name part
that can be represented without T-ACE MUST NOT be changed.
This absolute requirement prevents:
1. double encoding from a client of user keyboard input
and a server provider;
2. messing up existing registered domain names;
3. there being two different encodings for a single DNS
registered hostname;
4. interfering with registered glyphs with more than one
phonetic standard, such as Chinese script.
If any checking for prohibited name parts (such as ones that are
prohibited characters, case-folding, or canonicalization) is to
be done, it MUST be done before doing the conversion to a
T-ACE name part as it is specified in [nameprep].
Characters outside the first plane of characters (those with
codepoints above U+FFFF) MUST be represented using surrogates,
as described in the UTF-16 description in [ISO 10646].
The input name string consists of characters from the ISO 10646
character set in big-endian UTF-16 encoding. This is the
pre-converted string.
2.2.2. Check the input string for disallowed names
If the input string consists only of characters that conform to
the hostname requirements in [STD13], or the input string consists
a null language tag, the conversion MUST stop with an error.
2.2.3. T-ACE encoding
Find the corresponding tag, T, with [IDN TAG] for a input string.
If all the codepoints are in the first tag X, then T= X, it
is a valid IDN;
otherwise, T = dud.
Branch to T, encode the input string with procedure T, conforming
to [STD13], obtain ACE string, A.
Pre-pend the tag, T-, to ACE string, A, to obtain a T-ACE hostname.
2.3. StepCode Method
StepCode starts at a phonetic representation with Latin alphabet
of a glyph. When this is not sufficient in identify the glyph,
it supplements the representation with a digit. Due to the fact
that alphabet based scripts connect several syllables into one
semantic unit or a word, it normally identifies a word uniquely
within the language. While a character-form based script such
as CJK, characterized by one syllable per glyph, often can not
uniquely identify a character by its syllable alone, but a
sequence of syllables will often identify a string of characters
uniquely within the language in a similar way with alphabet
languages. StepCode observes such a phenomenon and represents a
phrase of a syllabic language as one semantic unit containing
more than one syllable, and encourages such a representation of
a character string. For example, the syllabic string
"xin zhu qing hua" of four characters is written in the preferred
form "xinzhuqinghua".
When Latin alphabet is not sufficient to represent the sound of a
glyph, the representation is supplimented with a digit, denoting
a secondary phonetic characteristic of the glyph, or the phrase.
Together, the described process forms the first step of StepCode
encoding, and is the most visible part of the method as well.
StepCode steps:
S1.1. Romanize the primary phonetic characteristic of a
glyph/phrase;
S1.2. Supplement the secondary phonetic characteristic of the
glyph with a digit/digits.
The second step of StepCode is applied to components of each
glyph, Pianpang, in the same way specified in S1.1.
S2.1. Romanize the primary phonetic characteristic of a Pianpang, B;
S2.2. Specify how the next pianpang is related to the current
pianpang, B, with a digit;
S2.3. If the pianpang contains another pianpang, X of B,
then goto S2.1 of X (and it is S2+1.1(X));
otherwise goto the next pianpang, B+1.
2.3.1 StepCode phonetic symbol tables
A glyph of alphabetic language has a sound value associated with
it. Under this proposal, a set of sounds with a similar value
from different languages SHOULD be associated with a glyph in
US-ASCII, as shown in Appendix A.
A glyph of consonantal systems and a phonetic glyph of syllabic
systems SHOULD be determined for a best fit onto an existing
set of sound values of US-ASCII. [UNICODE] standard
has specified a romanized name for each of glyph in the
standard. The mapping MAY be based on such a romanized name.
2.3.2. StepCode Conceptial Definition for Digits
With 26 Latin alphabet limit, many languages possess a
set of sound elements which are not possible to be included,
then the excluded sound elements are the secondary phonetic
elements, and SHOULD be assigned to additional symbol 0-9.
2.3.2.1 Secondary Sound Values in Step one encoding:
Although 26x10 is a two dimensional map, it can be filled
with more than two phonetic aspects of a script. With
increased complexity, the mnemonic efforts diminish gradualy.
For simplicity, four phonetic mapping rules SHOULD be
observed: R1. Diacritic mark mapping; R2. Phoneme Mapping;
R3. Overflow consecutive slot mapping; R4. Priority
elements mapping.
[R1] Diacritic mark mapping. For some language scripts a
secondary phonetic elements have to be marked for their
users. For example European scripts, a simple Tone mark
mapping SHOULD be used, where the digits only denote common
diacritic marks [Macmillan93] as the following.
0 letters with no tone
1 flat/macron (-)
2 rise/acute (/)
3 dip/breve (v)
4 drop/grave (\)
5 throw/circumflex (^)
6 thrill/tilde (~)
7 dieresis (")
8 cedilla (hook)
9 user assigned
The position of a similar marks SHOULD stay in its
respective position for easy reference cross script
boundary and for users looking for replacement marks.
A French diacritic mark assignment is in [Appendix B.1].
[R2] Phoneme table mapping, where each digit specifies a
variant of a base phoneme, and a maximum of nine variants may
be accommodated. This rule has a best mnemonic result cross
different scripts. For example, IPA symbol mapping [Appendix B.2].
[R3] Overflow Symbol mapping- where the symbols SHOULD fill
in only consecutive slots in the opposite directions
in the table for ease of index computation, where the middle
section of the table SHOULD be left for user selected
definitions. This rule is suited two sets of corresponding
symbols of the same script, for example Chinese in [Appendix B.3].
[R4] Priority elements mapping- Selecting a set of often used
symbols to be placed in the table. [Appendix B.4]
The above assignment rules may be used in a combination according
to an order of weights in such an assignment. Such an order
of weights SHOULD be specified in the form [Rx-Ry-Rz-R4].
2.3.2.2. Digits in Step 2 encoding:
A unified CJK character is often a composition of several independent
symbols of the language. It is possible to describe a CJK character
by representing a character with only its parts/Pianpangs.
Although it can identify a character uniquely, normally it is
accompanied with a number of rules with too many exceptions
for the majority of users to comprehend. StepCode encoding has
reduced the complexity of the rules by considering a CJK
character as a simple grid of 1 to 10 units, depending on the
user's viewpoint. Naming the 1 to 10 units in a linear fashion
results a linear representation of the glyph or its encoding.
This is used as secondary encoding most of the time, while
sometimes it has to be the primary representation, when the
correct sound of a character is not available to the users.
The digits in Step 2 and thereafter, specifing how a pianpang
of a glyph on its grid are related to the next pianpang, are
called layout digits.
Layout digits specify the relation to the next pianpang in line.
The left and right direction are defined by a user's left or
right hand while sitting in front of a display screen or a
piece of paper.
The glyph layout digits are:
0 - end of a character or a Pianpang
1 - to its right
2 - to its underside
3 - to contain the following
4 - to divide the following
5 - to its left
6 - to its top
The following selectable digits are to specify additional
glyphs of the script and directions of layout.
7 - to overlay itself with X then to its right;
8 - to overlay itself with X then to its left;
9 - to overlay itself with X then to its underside.
The pianpang layout scheme trades complexity of a glyph with
code length, such that the complexity can be eliminated when
truncating the code is permitted.
2.3.3. StepCode Format
Format Defination: A Stepcode unit is a string of [A-Za-z0-9]
characters without any white spaces, BLANK, in between. For each
StepCode unit, there are data elements indicated by "", which is a
MUST supplied element, and [] where the element is optional,
and / where the data is selectable.
Sx stands for primary sound value or Spelling of xth glyph;
Tx stands for secondary sound value or tone of xth glyph;
Py stands for Stem for yth Pianpang;
Ly stands for Layout relation from y to y+1;
Px.y stands for Stem for Xth glyph and its yth Pianpang;
Lx.y stands for Layout relation from Xth glyph and its y to y+1.
2.3.3.1. One glyph
"S""T"[P1][L1][P2][L2]...[Py][0/BLANK]
Example:xin1
xin1qin1jin0
2.3.3.2. Glyphs
"S1S2S3...Sx"[T1T2...Tx][P1.1][L1.1][P1.2][L1.2]...[P1.y][0]
[P2.1][L2.1][P2.2][L2.2]...[P2.y][0]
...
[Px.1][Lx.1][Px.2][Lx.2]...[Px.y][0/BLANK]
Example of glyphs of four:
xinzhuqinghua
xinzhuqinghua1212
xinzhuqinghua1212qin
xinzhuqinghua1212qin1jin0ge1ge0shui1qing0hua
xinzhuqinghua1212qin1jin0ge1ge0shui1qing0hua2shi0
Which these five equivalent StepCodes is used, depends on where
it is stored, the size and type of the database, as well as whether
there exist similar hostnames it has confict with.
2.4. StepCode Encoding Process
Go through [nameprep], checking for prohibited characters,
case-folding, or canonicalization.
Either, StepCode may be obtained from Unicode and/or other local
codes to StepCode glygh/phrase conversion tables.
Or, it is input directly from keyboards, where an input
processing module to verify correctness of intented glyphs is
necessary. (See C code in [Appendix D.1])
Prepend script tag in the form of "xxx-" to post-converted
string; finish. This is the hostname part that
can be used in DNS registration as well as resolution.
2.5. Converting a StepCode hostname to an internationalized name
The process has three parts with script tag untouched:
P1.If a domain name part consists no script tag or "usa-"tag,
then goto P3;
Otherwise search for process named "xxx" from StepCode
to Unicode or other code conversion, obtain the
corresponding codes.
(At this point, only a syllabic system might fail.)
P2.If the corresponding code is exit then goto Step 3;
Otherwise decomposes the post-converted string into a number
of individual glyphs
specified in the "T" field, or
by syllable recognition; (See [Appendix D.2])
Search for each glyph;
If any glyph is not found or is not unique,
compose an error message and
Request the missing glyphs to be supplied
from the sender either in the form
of Unicode or
other code stream
or in a 24x24 bit map stream.
P3.Display available glyph, where missing glyph is shown with StepCode;
If appliable, save the corresponding hostname and display codes.
3. Security Considerations
Much of the security of the Internet relies on the DNS. Thus, any
change to the characteristics of the DNS can change the security of
much of the Internet. Thus, StepCode makes no changes to the DNS
itself.
Hostnames are used by users to connect to Internet servers. The
security of the Internet would be compromised if a user entering a
single internationalized name could be connected to different
servers based on different interpretations of the internationalized
hostname. Thus the restriction of DNS names to a small symbol set is
necessary and effective, where adding any other data format such as
UTF-8 only opens the security gate to complications.
4.Internationalization considerations
StepCode is designed so that every internationalized hostname part can
be represented as one and only one DNS-compatible string. If there
are two different ways to obtain the same glyph on a display device,
then they are still two distinct hostnames, with no bearing on
security issues. If there is any way to follow the steps in this
document and get two or more different results, it is decause of an error
in the domain name registration process, where one domain name register
fails to update other domain name register servers about a newly
registered and well researched hostname.
5. References
[Appendix A] Example Phonetic symbols to Latin small letter mapping
[Appendix B] Secondary sound values to digits mapping.
[Appendix C] StepCode layout digit specification.
[Appendix D] Example C code implementation on encoding and decoding.
[Appendix E] Example of IDN Language tags.
[ASCII] American National Standards Institute (formerly United
States of America Standards Institute), X3.4, 1968, "USA Code for
Information Interchange". (ANSI X3.4-1968)
[DeFrancis 1989] John DeFrancis, "Visible Speech - The Diverse
Oneness of Writing Systems", 1989, ISBN 0-8248-1207-7.
[Dictionary79] Beijing Foriegn Language Dept., "A Chinese-English
Dictionary", 1979, BK# 9017.810.
[IDNCOMP] "Comparison of Internationalized Domain Name Proposals",
draft-ietf-idn-compare-00.txt, June 2000, P. Hoffman.
[IDNReq] Zita Wenzel and James Seng, "Requirements of Internationalized
Domain Names", draft-ietf-idn-requirements. May 2001.)
[IDN TAG] Draft-Liana-idn-tags, IDN Language tags.
[ISO639][ISO639-2/T] ISO/IEC 639-2 2001 Codes for the Representation of
Names of Languages.
[ISO10646] ISO/IEC 10646-1:2000 (note that an amendment 1 is in
preparation), ISO/IEC 10646-2 (in preparation), plus
corrigenda and amendments to these standards.
[Macmillan93] The Macmillan Visual Desk Reference, 1993,
ISBN 0-02-531310-x.
[Mnemonics] "Draft-Liana-idn-mnemonics", Language symbols of
[ISO10646] to Latin alphabet mappings for unified IDN
symbol representation.
[RFC2277] "IETF Policy on Character Sets and Languages",
rfc2277.txt, January 1998, H. Alvestrand.
[RFC2119] Scott Bradner, "Key words for use in RFCs to Indicate
Requirement Levels", March 1997, RFC 2119.
[STD13] Paul Mockapetris, "Domain names - implementation and
specification", November 1987, STD 13 (RFC 1035).
[UNICODE] The Unicode Consortium, "The Unicode Standard". Described at
http://www.unicode.org/unicode/standard/versions/.
[UNICODE30] The Unicode Consortium, "The Unicode Standard -- Version
3.0", ISBN 0-201-61633-5. Same repertoire as ISO/IEC
10646-1:2000. Described at http://www.unicode.org/unicode/
standard/versions/Unicode3.0.html.
[Ye95] Liana Ye, "A Language Oriented Chinese Encoding for
Multilingual Computing Environments", in "Proceeding of the 1995
International Conference on Computer Processing of Oriental
Languages", Page 323.
6. Acknowledgements
The author has reused existing IDN draft documents and language as
much as possible to demonstrate deep respect for the work done by
members of this working group. Among them, special comments which
have contributed to improve this document were received from John C
Klensin, Eric Brunner-Williams and William Davis. Aaron Irvine has
contributed Esperanto specifications.
7. IANA Considerations
This document requires IANA action for availibility of language tag,
and registration for each tag and possibly its sub-field for
phonetic system used.
8. Authors' Contact Information
Liana Ye
Y&D ISG
2607 Read Ave.
Belmont, CA 94002, USA.
(650) 592-7092
liana.ydisg@juno.com
Aaron Irvine, PhD.
<aaron.irvine@openwave.com>
Expires January 2002
[Appendix A] Sample Phonetic Symbol to Latin alphabet mapping
The phonetic symbols of Chinese are Bopomofo, or Zhuyin symbols
from U+3105 to U+312c, where the sound value mapping is transcribed
from Zhuyin standard of 1942 and [Dictionary 1979].
Definitions:
-x The symbol 'x' occurs at end of a unit.
x / y Both symbols are applicable.
x U+3105 ' A sequence of symbols where there is no equivalent
ASCII representation, a Unicode point with blanks as
delimitors is used.
Mnemonics Unicode IPA description
zho-
Pinyin Bopomofo IPA
b U+3105 p
p U+3106 p'
m U+3107 m
f U+3108 f
d U+3109 t
t U+310a t'
n U+310b n
l U+310c l
g U+310d k
k U+310e k'
h U+310f x
j U+3110 t U+0255
q U+3111 t U+0255 '
x U+3112 U+0255
zh U+3113 t U+0282
ch U+3114 t U+0282 '
sh U+3115 U+0282
-i U+0285
r U+3116 U+0290
z U+3117 ts
c U+3118 ts'
s U+3119 s
-i U+027f
y j
w w
a U+311a a
o U+311b o
e U+311c U+0259
eh U+311d U+025b
ai U+311e ai
ei U+311f ei
ao U+3120 au
ou U+3121 U+0259 u
an U+3122 an
en U+3123 U+0259 n
ang U+3124 a U+014b
eng U+3125 U+0259 U+014b
ong u U+014b
er U+3126 U+0259 r
i U+3127 i
u U+3128 u
iu U+3129 i U+0259 u
v / u" U+312a y
ng U+312b U+014b
gn U+312c gn
ia ia
ie i U+025b
iao iau
ian ian
in in
iang ia U+014b
ing i U+014b
iong y U+014b
ua ua
uo u U+0259
uai uai
ui uei
uei uei
uan uan
un u U+0259 n
uen u U+0259 n
uang ua U+014b
ve ys
van yan
vn yn
- / ' (character spelling separator)
[Appendix B.1] Example on Diacritic mark mapping
French has less than eight but more than four diacritic marks,
it is an example of phonetic mapping [R1].
fre-
0 no tone
1 Silent or Liaison '
2 rise/acute (/)
3 (dip/breve is not used)
4 drop/grave (\)
5 throw/circumflex (^)
6 thrill/tilde (~)
7 dieresis (")
8 (not used for French)
9 Supercript or nasal n
[Appendix B.2] Example on Phoneme Mapping
IPA symbol mapping, [R2] where each digit specifys a
variant of a base phoneme, and four variants are assigned. The
table allows other variants to be filled as needed.
The Unicode codepoint next to Latin alphabet column indicates
the replacement of the corresponding codepoint of Latin letter.
ipa-
0 1 2 3
a U+0251 ae U+00e6 U+0292
b
c ch U+02a7
d
e U+025b .e U+0259 .e: U+025c
f
g
h
i
j d3 U+02a4
k
l
m
n ng U+014b
o U+0252 o: U+0254
p
q
r
s sh U+0283
t th U+03b8 U+00f0
u U+028c U+028a U+0075
v
w
x
y
z zh U+0292
4 unsigned
5 unsigned
6 unsigned
7 unsigned
8 unsigned
9 unsigned
[Appendix B.3] Example on Overflow Consecutive slot Mapping
Chinese script using Overflow and Tone Mark mapping
architecture, [R1-R3], Where the table is partitioned to
select two different glyph sets of the script:
zho-
0 no tone
1 flat/macron (-)
2 rise/acute (/)
3 dip/breve (v)
4 drop/grave (\)
5 classic character drop/grave (\)
6 classic character dip/breve (v)
7 classic character rise/acute (/)
8 classic character flat/macron (-)
9 classic character no tone
[Appendix B.4] Priority elements mapping for English.
DNS name resolver treats uppercase same as lower case,
It provides no additional value for users to assign
any specific value to upper case letters besides as one
of many fonts. The English mapping assignment takes
[R1-R2-R4], where digit 8 is designated for letter
related dingbats.
eng-
0 a-zA-Z
1 flat/macron (-)
2 rise/acute (/)
3 dip/breve (v)
4 drop/grave (\)
5 throw/circumflex (^)
6 thrill/tilde (~)
7 dieresis (")
8 Dingbats
9 Greek a-zA-Z
0 8
a U+2604 /*areo or comet*/
b
c U+24b8 /*copyright*/
d U+25ca /*diamond*/
e U+24d4 /*eletron*/
f U+2709 /*fly*/
g
h U+2624 /*health or Caduceus*/
i U+261e /*index or white right pointing index*/
j
k U+2654 /*king*/
l U+2661 /*love or white heart suit*/
m U+2709 /*mail or envelope*/
n U+266b /*note or Barred eighth note*/
o
p U+262e /*peace symbol*/
q U+2655 /*queen*/
r U+2602 /*rain or umbrella */
s U+263a /*smile*/
t U+231a /*time or watch*/
u U+2328 /*utility or keyboard*/
v U+260e /*voice or phone*/
w U+270d /*writing*/
x
y U+262f /* yinyang */
z
[Appendix C] The glyph layout digits:
0 - end of a character or a Pianpang
1 - to its right
2 - to its under
3 - to contain the following
4 - to divide the following
5 - to its left
6 - to its top
The following sellectable digits for specify additional
glyph of the script and direction of layout.
7 - to overlay itself with X then to its right
8 - to overlay itself with X then to its left
9 - to overlay itself with X then to its under
[Appendix D.1] StepCode keyboard input process
/* buff.c StepCode processor interface Copyright Y&D ISG, Inc. 1994
*-----------------------------------------------------------------------*
* find_gly find a glyph online.
* find_wd find a word online.
*/
#include <stdio.h>
#include <ctype.h>
#include "steplib.h"
int auto_learn= TRUE;
int udic_large= FALSE;
int udic_database= FALSE;
int odic_expand = FALSE;
int dic_saved = FALSE;
int keyboard_in = TRUE;
int alt_memb = 2; /* extra members of a poly-code to be recorded */
/*
* find_gly using a StepCode to find the GB code for display a glyph.
*/
int find_gly(step, stepcd, infor, gb, key)
char *step, *stepcd, *infor, *gb;
int *key;
{
FILE *bufp;
int linecnt, bytes;
char line[MAXdatalen], *p;
char bufname[FILENAMSIZ];
strncpy(stepcd, step, strlen(step)+1);
if (hit_gly(stepcd, gb))
{ *key=GB; return(A_to_B);}
strncpy(bufname, BUFFILE, FILENAMSIZ);
bufp = (FILE *)fopen(bufname, "w+b");
if( bufp == NULL )
{
strcpy( message, "Buffer file unavailable.");
typo(message, word);
return(ERROR);
}
search_dic(STEP, 1, stepcd, bufname, &bufp, &linecnt);
if (linecnt<=0)
{
if(verbose)
typo("No entry found in GB table. You may create one.", step);
fclose(bufp);
return(A_to_ZIL);
}
fseek( bufp, 0L, 0 ); /* to beginning sake read */
if(fgets(line, MAXdatalen, bufp)== NULL)
{ if(verbose)
fprintf(stderr, "ERROR- buffer file read error.\n");
fclose(bufp);
return(ERROR);
}
sscanf(line, "%s%d%s%s\n", stepcd, key, gb, infor);
hash_gly(stepcd, gb);
fclose(bufp);
if (linecnt>1)
{
return( A_to_N);
}else {
return( A_to_B);
}
}
int find_wd(step, stepcd, infor, gb, cnt, key)
char *step, *stepcd, *infor, *gb;
int cnt, *key;
{
FILE *bufp;
int linecnt;
char line[MAXdatalen], *p;
char bufname[FILENAMSIZ];
strncpy(stepcd, step, strlen(step)+1);
if ( hit_wd(stepcd, gb))
{ *key = GB; return(A_to_B);}
strncpy(bufname, BUFFILE, FILENAMSIZ);
bufp = (FILE *)fopen(bufname, "w+b");
if( bufp == NULL )
{
fprintf( stderr, "Buffer file unavailable.");
return(ERROR);
}
search_dic(STEP, cnt, stepcd, bufname, &bufp, &linecnt);
if (linecnt<=0)
{ if (!auto_learn)
{
if(verbose)
typo("Not found. You may create the word.", step);
fclose(bufp);
return(A_to_ZIL);
}else
{
neww = learnword(cnt, stepcd, gb);
/* Do whatever with neww here */
if(dic_saved)
{
hash_wd(stepcd, gb);
dic_saved = FALSE;
}
else
{
typo("The new word has not saved.", stepcd);
}
fclose(bufp);
neww = reset_word(neww);
return(ZIL_to_A);
}
}
fseek( bufp, 0L, 0 ); /* to beginning sake read */
fgets(line, MAXdatalen, bufp);
if(line == NULL)
{
if (ferror(bufp)!=0 && verbose)
fprintf(stderr, "Error during buffer read.\n");
if (feof(bufp) !=0 && verbose)
fprintf(stderr, "Buffer file ended.\n");
clearerr(bufp);
fclose(bufp);
return(A_to_ZIL);
}
sscanf(line, "%s%d%s%s\n", stepcd, key, gb, infor);
hash_wd(stepcd, gb);
fclose(bufp);
if (linecnt>1)
{
return( A_to_N);
}else {
return (A_to_B);
}
}
/* --------------------------------------------------------------------
* Figure out the number of glyphs in a word. The next two routines are
* based on PINYIN system.
*/
int one_letter_sound(word)
char *word;
{
int cnt=0;
char *w, *v;
w=word;
while (*w=='m'||*w=='M'||*w=='n'||*w=='N')
{ ++cnt; ++w;}
if (cnt>0)
{
v = w; --v;
if((*w=='g'||*w=='G')&& (*v=='n'||*v=='N'))
++w; /*ex: mng nnng*/
}
if(cnt==0) while (*w=='a'||*w=='A'){ ++cnt; ++w;}
if(cnt==0) while (*w=='o'||*w=='O'){ ++cnt; ++w;}
if(cnt==0) while (*w=='e'||*w=='E'){ ++cnt; ++w;}
if (!isalpha(*w))
return(cnt); /*ex:a aa ooo eee- mmm nmn*/
else cnt=0; /*ex: an hhh oong */
return(cnt);
}
int tell_word(word)
char *word;
{
char *w, *v;
int cnt;
cnt=0;
if(!isalpha(*word)) return (NULL);
for (w=word;isalpha(*w);++w); /*skip Pinyin */
while (isdigit(*w)) {cnt++; ++w;} /*count the number of tone marks*/
if (cnt<1) /*special sigle letter glyph cases*/
{
cnt = one_letter_sound(word);
if (cnt>=1) return(cnt); /* else do syllable analysis */
}
else return(cnt);
/*
* find the number of syllables by vowel rules
* This implementation is accuate even without using apostrophe
*/
w=word;
while (isalpha(*w)) /*check the Pinyin only*/
{
switch (*w)
{
case 'a':
case 'i':
case 'e':
case 'o':
case 'u': v=w; ++w; cnt++; /*one vowel case*/
switch (*w)
{
case 'i':
case 'e':
case 'o':
case 'u': ++w;break; /*two vowels sound*/
case 'a': ++w;
if (*v=='u' && *w=='i') break;/*uai*/
if (*v=='i' && *w=='o') break;/*iao*/
else {
--w; /*still two vowels*/
break;
}
default: break;
}
default:
/*already get out off the compound vowel*/
break;
}
++w;
}/*check syllables*/
return(cnt);
}
/*
* --------------------------------------------------------------------
* Interactive input process procedure
* --------------------------------------------------------------------
*/
inputp(char *word, char *gb)
{
int i, glyphcnt;
char c, *w;
int cnt, key, stat;
char dump[MAXdatalen];
for (;;)
{
*word='\0';
fgets(word, MAXlinelen, stdin);
if (isspace(*word))
break;
/* Check if the entry is a glyph string by */
glyphcnt = tell_word(word);
if (glyphcnt == NULL)
{
printf("%s", *word);
fflush(stdin);
continue;
}
w=word;
while (isalnum(*w)) ++w;
*w = '\0';
if(verbose)
printf("tell_word figure: %d glyphs\n", glyphcnt);
/* Determin the entry is known through dictionary
* and cache lookup.
*/
if(glyphcnt >=2)
stat = find_wd(word, stepcd, dump,gb,glyphcnt, &key);
else stat = find_gly(word, stepcd, dump,gb, &key);
/* Print out with GB code */
if (!stat==ERROR) font_code(stepcd, gb, &key, stderr);
if(verbose) printf("%s\n", stepcd);
fflush(stdin);
fflush(stderr);
}
return(0);
}
[Appendix D.2]
/* Disassemble a Chinese stepword into stepglyphes.
*----------------------------------------------------------------*
*/
int disassemb(cnt, word, sts, phonsys)
int cnt;
char *word;
char *sts[];
int phonsys;
{
char *w, *hd, *nt, *vh; /*Stand for head, next, vowel_head*/
int i, j, nc, al_flag;
char *s;
/* initialize*/
for (i=0;i<(cnt+3);++i)
for (j=0, s=sts[i];j<=STEPSIZE;++j, ++s)
*s=NULL;
hd=w=word;
i=j=nc=0;
switch (phonsys)
{
case PINYIN: break;
case ZHUYIN: /* branch to disassemb_zhuyin(); return;*/
case KANTON: /* branch to disassemb_kanton(); return;*/
default:
break;
}
/* non-consonent or non-vowel single letter glyphs */
nc=one_letter_sound(word);
if(nc>0)
{
for(i=0;i<nc;i++, w++) sts[i][0]=*w;
++w;
if (*w=='g'||*w=='G') /*case of ng*/
{ sts[i][1]=*w; return(nc); }
for (i=0;i<nc;i++) /* add the tones */
{
if (sts[i][0]=='a'||sts[i][0]=='A') sts[i][1]='1';
if (sts[i][0]=='m'||sts[i][0]=='m') sts[i][1]='2';
}
/* Cases of O and E are very limited */
return(nc);
}
/* delete the ending -r */
s = word;
while (isalpha(*s)) s++;
--s;
if (*s=='r' && *(s-1)!= 'e')
{
er_flag= TRUE;
while (isalnum(*s)) *s=*(++s);
}
/* Ending -z and -l are accommodated here */
/* By Pinyin rules:
* It only trys to recognize a possible syllable, and pays little
* attention of correct spelling. A word like 'peo' will pass,
* but 'leek' will not. This scheme is not a speller checker, and
* tolerates foreign vocabulary.
*/
hd=w=word;
i=j=nc=al_flag=0;
while (isalpha(*w)) /*check the Pinyin only*/
{ while (isalpha(*w)&&!isvowel(*w)) ++w;
vh=w; ++w; nt=w; nc++; /*one vowel case*/
switch (*w)
{
case 'i':
case 'e':
case 'u': ++w;nt=w;
break; /*two vowels case*/
case 'o': ++w;nt=w;nt++;
if (*vh=='i'&& *w=='n'&& *nt=='g')
{ nt++; w=nt;} /* iong case only */
else nt=w;
break;
case 'a': ++w;nt=w; /* -a? */
if (!isalpha(*w) ||
(isalpha(*w) &&
(*w!='o')&&(*w!='i')&&(*w!='n')))
break;
++nt; /* special cases */
if((*vh=='u' && *w=='i') || /*uai*/
(*vh=='i' && *w=='o')) /*iao*/
{ if((nc<cnt)&&(!isalpha(*nt)))
{ /*two glyphs*/
strncpy(sts[i],hd,(++vh)-hd);
++i;nc++; hd=vh;w=nt;
break;
}
else { w=nt; break;} /* one glyph*/
}
if(*nt=='g')nt++; /*-an+ or -ang+*/
if (isalpha(*nt)&&(!isvowel(*nt)))
{w=nt; break; }
if(isvowelna(*nt)){--nt; w=nt;break;}
if((nc<cnt)&&(!isalpha(*nt)))
{ /*uan or iang:two glyphs*/
strncpy(sts[i],hd,(++vh)-hd);
++i;nc++; hd=vh;w=nt;
break;
}
if (!isalpha(*nt)) break; /* end of Pinyin*/
--nt; /*-ana or anga*/
++al_flag;
break;
case 'n': nt=w;++nt;
if(*nt=='g')nt++;
if (!isalpha(*nt)||
(isalpha(*nt)&&(!isvowel(*nt))))
{ w=nt; break;}
if (isvowelna(*nt)){--nt; w=nt;break;}
if(*nt=='a')
{ ++w; if(*vh=='o'&&*w=='a') --nt; /*-o na */
if(*vh=='o'&&*w=='g') {}; /*-ong a*/
if(*vh=='u'&&*w=='g') --nt; /*-un ga*/
else
{ /* There are two possible ways
strncpy(sts[i],hd,(--nt)-hd);
++i;nc++; hd=nt;w=nt+1;
/*na ga have higher chance*/
++al_flag;--nt;
}}
break;
case 'r': nt=w; ++nt;
if (isvowel(*nt)){--nt; break;}
if (isalpha(*nt)||*vh=='e') w=nt;
break;
default:nt=w; /*consonents */
break;
}/* end of switch*/
strncpy(sts[i],hd,nt-hd);
++i;hd=nt; w=nt;
}/* while check syllables*/
append_suffix(nt, 0, cnt);
/*
* supply a word ending with er2p1 glyph. (extented from Pinyin rule)
*/
if (er_flag)
{
strcpy(sts[cnt], "er2p1");
nc=cnt+1;
}else nc=cnt;
if (!al_flag) return(nc);
/*
* supply an alternative disassembled stepcodes
*/
w=word+1;
while (al_flag)
{
while(isalpha(*w)&& al_flag)
{
if (*w=='n')
{ ++w; nt=w;++nt;
if(*w=='a'||(*w=='g'&&*nt=='a')) /*found */
{
hd=w-2; --al_flag;
while (hd>word && isvowel(*hd)) --hd;
if (*hd=='h')
{ vh=hd-1;
if (isupper(*vh)) *vh=tolower(*vh);
if (*vh=='z'||*vh=='c'||*vh=='s') hd=vh;
}
if (*w=='a')nt=w; /* else no change*/
strncpy(sts[nc], hd, nt-hd);
++nc; w=nt;hd=nt;
} }
++w;
}
while (isalpha(*nt)) ++nt;
strncpy(sts[nc], hd, nt-hd);
nc++;
}
append_suffix(nt, nc-cnt, nc);
return(nc);
}
[Appendix E] Sample Language tags of [UNICODE] Blocks
Tag Start End Start End Start End
Cyr- U+0401 U+04cc (not in [ISO639])
cjk- U+3105 U+312c U+3400 U+4dbf U+4e00 U+9fff (Unified CJK)
kro- U+3400 U+3d2d
lat- U+0030 U+03f5 (include Greek)
usa- U+0030 U+0039 U+0061 U+007a U+002d U+002d (not in [ISO639])