[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: Layer 2 and "idn identities" (was: Re: [idn] what are the IDN identifiers?)
In a message dated 2001-12-03 2:01:56 Pacific Standard Time,
liana.ydisg@juno.com writes:
> I see that you have not read the I-D yet, and Deng Xiang
> has replied your Chinese vs. Japanese arguement, I
> will wait for your comment on the language-tag issue,
> or anything not up to your standard.
Here are some specific concerns related to items in
draft-Liana-idn-map-00.txt.
| The proposed ACE is a mnemonic encoding scheme,
| and is called StepCode [StepCode].
Hasn't AMC-ACE-Z already been chosen as the standard ACE for IDN? I would be
surprised if the decision were made to use two different ACEs depending on
the language, or script, of the encoded text.
| U-s U-p A-p
| U+0041 U+0061 a (Latin Letter A case folding)
| U+2fc2 U+2ee5 yv2 (Han character fish for Chinese case folding)
Several Chinese speakers and other experts have already, repeatedly, claimed
that SC/TC mapping is NOT a 1-1 operation like Latin case folding. If you
think your users will be satisfied with the 1-1 solution only, go right
ahead, but if this turns out to be inadequate and you need to propose a fix,
get ready to hear a lot of people say "I told you so."
| To facilitate end users for the speed of IDN access as well as
| compatibility with existing applications, it is RECOMMENDED that an IDN
| code exchange table inculdes applicable local display standards
| corresponding with each applicable codepoints in UCS.
Backward mapping tables to convert Unicode to legacy standards, for the
express purpose of allowing end-user software to delay the transition to
Unicode? Does this sound like a solution for the future?
| It is REQURIED to register a language tag with IANA and its
| associated script range whenever it is modified.
There is already a perfectly good update process in place for both ISO 639
and RFC 3066.
| To use mixed scripts in one IDN label is NOT RECOMMEMDED for an
| early deployment of IDN.
This immediately outcasts the Japanese, who have every reason to mix
hiragana, katakana, kanji, and romaji.
| Alphabet Sys. Consonant Sys. Character Sys.
|
| From: 0020 0530 2e80
| to: 052f 1bff d7af
|
| include:Latin Armenian CJK
| Greek Hebrew Kanji
| Cyrillic Arabic Kana
| IPA Devanagari Hangul
| Vietnamese Malayalam Yi
| Thai
| Lao
| Tibetan
| ...
Sorry, it's just not that simple. There are plenty of alphabets and
alphabetic characters encoded above U+0530. That's probably why the Unicode
Consortium, while providing a list of blocks of code points like the
following:
# Start Code..End Code; Block Name
0000..007F; Basic Latin
0080..00FF; Latin-1 Supplement
0100..017F; Latin Extended-A
is careful not to imply that ranges of code points are permanently reserved
for *classifications* of scripts like this.
You can tell that the three ranges listed here are arbitrary and bogus even
for the CJK scripts, by noting that Korean jamos (alphabetic) are located in
the "consonant system" block, while the Japanese syllabaries (kana) and
precomposed Korean syllables are in the "character system" block.
| Some cultures often use more than two scripts within the same group,
| such as Japanese, but rarely using another script especially from a
| different group.
As noted above, the Japanese use four scripts from two different groups.
| The main issue in IDN-Map
| is to identify character equivalent sets, and reduce the number of
| applicable IDN identifiers by 1) limiting the applicable IDN input code
| points to Plane 0 of Unicode table,
Has anyone else so far proposed that supplementary characters be flat-out
prohibited from occurring in IDN identifiers? Why should they be singled out
as a way to "reduce the number of applicable IDN identifiers"?
| It is RECOMMENDED that reasonable studies are given to each language to
| classify script treatment model, and a cost vs. benifit analysis in select
| a long term script specific processing protocol to be embedded in IDN
| language specific modules.
This won't disrupt the schedule of the working group, will it?
| canonicalization
This word has no clear definition and is carefully avoided by Unicode, as Ken
Whistler already explained.
| A string mixed with CJK and Kana is Japanese, CJK and Hangul mix is
| Korean. However, an all CJK character string MUST presumed to be in the
| primary language tag, that is Chinese, and registered as the only IDN name,
| unless the registrant requests a second and a third language to access the
| same IDN name.
Nothing prevents an all-Han string of any arbitrary length from being
Japanese text. The priority given to Chinese here is not likely to be well
received by other groups.
| Also, it
| introduces more policy decisions, for example, an all CJK character
| trademark registrant may have to registrate in three languages to ensure
| the legitimacy of the trademark.
Wait just a minute. Wasn't the whole idea of this language-tagging and
CJK-folding scheme to PREVENT registrants from having to register an IDN
identifier more than once?
| After all, a useful tool is to let its
| user to make decisions.
Some tools are interactive, others are not.
Finally, it is not yet clear to me whether the "idn-zh-" tag prefix is
supposed to be embedded within IDN identifiers or specified separately. But
between this additional label and the use of the less efficient StepCode
instead of ACE-Z, it seems that several bytes out of the precious 63-byte
limit are required as overheard to support this tagging scheme. If I
remember correctly, it is CJK users (Soobok Lee is only the most vocal) who
are most concerned about the space limitation and who want to find (or
invent) the most efficient encoding system possible. Will these other CJK
users agree to this proposal?
-Doug Ewell
Fullerton, California