[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Layer 2 and "idn identities" (was: Re: [idn] what are the IDN identifiers?)





On Mon, 3 Dec 2001 16:29:57 EST DougEwell2@cs.com writes:
> In a message dated 2001-12-03 2:01:56 Pacific Standard Time, 
> liana.ydisg@juno.com writes:
> 
> > I see that you have not read the I-D yet, and Deng Xiang 
> > has replied your Chinese vs. Japanese arguement, I 
> > will wait for your comment on the language-tag issue,
> > or anything not up to your standard.
> 
> Here are some specific concerns related to items in 
> draft-Liana-idn-map-00.txt.
> 
> | The proposed ACE is a mnemonic encoding scheme,
> | and is called StepCode [StepCode]. 
> 
> Hasn't AMC-ACE-Z already been chosen as the standard ACE for IDN?  I 
> would be 
> surprised if the decision were made to use two different ACEs 
> depending on 
> the language, or script, of the encoded text.

The current AMC treats all UCS codepoints the same. It can not 
solve look-alike cross different languages.  It does not make DNS
master records readable for administrators.  It does not make 
zonefile sortable by different regions or sensible user groups. 
It does not help a user does not read Chinese but communicate
with Chinese partners.  But It does compress the data and feed 
that into DNS. 

Using StepCode can group users by language, so sorting 
the names makes a lot more semantic sense for administrators.
StepCode also allows each character has its own ID to be treated
differently cross different languages.  

But StepCode is made only when user wants it, so there are may 
be users don't want it.  Then the AMC should be used to capture
such cases.  StepCode and AMC are compliment to each other. 
This is discussed in Section 4.5. 


> | U-s   U-p A-p
> | U+0041  U+0061    a      (Latin Letter A case folding)
> | U+2fc2  U+2ee5    yv2    (Han character fish for Chinese case 
> folding)
> 
> Several Chinese speakers and other experts have already, repeatedly, 
> claimed 
> that SC/TC mapping is NOT a 1-1 operation like Latin case folding.  
> If you 
> think your users will be satisfied with the 1-1 solution only, go 
> right 
> ahead, but if this turns out to be inadequate and you need to 
> propose a fix, 
> get ready to hear a lot of people say "I told you so."

For this please see my post on data-centric programming
techniques applicable to SC/TC problem.


> 
> | To facilitate end users for the speed of IDN access as well as
> | compatibility with existing applications, it is RECOMMENDED that 
> an IDN
> | code exchange table inculdes applicable local display standards
> | corresponding with each applicable codepoints in UCS.
> 
> Backward mapping tables to convert Unicode to legacy standards, for 
> the 
> express purpose of allowing end-user software to delay the 
> transition to 
> Unicode?  Does this sound like a solution for the future?

As these legacy standards have to 
be on servers to switch large user base to the new IDN.
After you have switched the users, then you can replace
user softwares and hardwards at the suppliers pace.  That
means you can play the price/service game to lure the
users to switch.  Without this feature, most users never 
want to change due to change is too expensive for 
the users, since they are happy with what they got: 
relieble connection.  

If we make these legal with the exchange map, then 
there will be no need to implement another code form 
like UTF-8.


> | It is REQURIED to register a language tag with IANA and its
> | associated script range whenever it is modified. 
> 
> There is already a perfectly good update process in place for both 
> ISO 639 
> and RFC 3066.

But IDN may not need to implement all of these tags.  Each tag 
implemented need script specific procedures to be deployed.

> 
> | To use mixed scripts in one IDN label is NOT RECOMMEMDED for an
> | early deployment of IDN.
> 
> This immediately outcasts the Japanese, who have every reason to mix 
> 
> hiragana, katakana, kanji, and romaji.

Wrong.  Japanese, Korean are the primary languages to
be tested in practic.  That is in C,J,K tags, also are used 
in the I-D to show the feasibility of the implementation.

The recommendation is there for warning  though C,J,K 
are shown can be done, since there is no system installed
yet, unrealistic jump in is not encouraged, especially when
there is possible use of USC blanket treatment by AMC.

> 
> |         Alphabet Sys.  Consonant Sys.  Character Sys.
> |
> | From: 0020            0530            2e80
> | to:   052f            1bff            d7af
> |
> | include:Latin           Armenian        CJK
> |         Greek           Hebrew          Kanji
> |         Cyrillic        Arabic          Kana
> |         IPA             Devanagari      Hangul
> |         Vietnamese      Malayalam       Yi
> |                         Thai
> |                         Lao
> |                         Tibetan
> |                         ...
> 
> Sorry, it's just not that simple.  There are plenty of alphabets and 
> 
> alphabetic characters encoded above U+0530.  That's probably why the 
> Unicode 
> Consortium, while providing a list of blocks of code points like the 
> 
> following:
> 
>     # Start Code..End Code; Block Name
>     0000..007F; Basic Latin
>     0080..00FF; Latin-1 Supplement
>     0100..017F; Latin Extended-A
> 
> is careful not to imply that ranges of code points are permanently 
> reserved 
> for *classifications* of scripts like this.
> 
> You can tell that the three ranges listed here are arbitrary and 
> bogus even 
> for the CJK scripts, by noting that Korean jamos (alphabetic) are 
> located in 
> the "consonant system" block, while the Japanese syllabaries (kana) 
> and 
> precomposed Korean syllables are in the "character system" block.

These are rough groups to study different cases
to cover broadest language variations.  And this grouping
is proposed by a well known linguist.  While we don't 
need to copy their views, (just like I am against copy  
UTC's recommendation),  it is necessary to learn 
what the different views proposed by linguists before
I feel confidence to propose a reasonable solution.
No specifics are placed on these groups.  The real 
term is in Language tag definition file.  As you may see
they are indefinit number of code blocks defined
in data specification format, Section 3.2, and 
associated with language specific procedures. 
That is the reason, I proposed IANA registration to 
the language tags we support. 


> 
> | Some cultures often use more than two scripts within the same 
> group,
> | such as Japanese, but rarely using another script especially from 
> a
> | different group. 
> 
> As noted above, the Japanese use four scripts from two different 
> groups.
> 
> | The main issue in IDN-Map
> | is to identify character equivalent sets, and reduce the number of
> | applicable IDN identifiers by 1) limiting the applicable IDN input 
> code
> | points to Plane 0 of Unicode table,
> 
> Has anyone else so far proposed that supplementary characters be 
> flat-out 
> prohibited from occurring in IDN identifiers?  Why should they be 
> singled out 
> as a way to "reduce the number of applicable IDN identifiers"?

This was a statement in an early ACE I-D of this 
group. Since UTS released new case folding map, 
[nameprep] took it without questioning, and everyone 
dropped this issue.  

No one proposes to prohibit Plane 1 codepoints.  
Here I am proposing to get the equivalent class 
work first, before we allow Plane 1 and above code
points in.  In fact, the more you let in the more it is 
support my case for letting TC/SC in.  And this is
the approve:

in the current [nameprep] specification: 

0048; 0068; Case map

210B; 0068; Additional folding
210C; 0068; Additional folding
210D; 0068; Additional folding

1D407; 0068; Additional folding
1D43B; 0068; Additional folding
1D46F; 0068; Additional folding
1D4D7; 0068; Additional folding
1D573; 0068; Additional folding

You can see this is 9 to 1 case folding, and how
will you recover the 9 cases?

> 
> | It is RECOMMENDED that reasonable studies are given to each 
> language to
> | classify script treatment model, and a cost vs. benifit analysis 
> in select
> | a long term script specific processing protocol to be embedded in 
> IDN
> | language specific modules.
> 
> This won't disrupt the schedule of the working group, will it?

I don't know what the WG schedule is based on.  If it waves
the CJK case away, it has met its schedule last year 
already.  If you mean the current schedule, you have to 
ask does the WG have a clear picture of the IDN or not. 
If no one know how to deal with CJK, any schedule is
meaningless.  That is the reason, I do not comment 
on WG mile stones.

> 
> | canonicalization
> 
> This word has no clear definition and is carefully avoided by 
> Unicode, as Ken 
> Whistler already explained.

I think we are getting somewhere.  We are getting 
down on the codepoints now.  When I don't need use
all these vague terms, we are near the solution.


> 
> | A string mixed with CJK and Kana is Japanese, CJK and Hangul mix 
> is
> | Korean. However, an all CJK character string MUST presumed to be 
> in the
> | primary language tag, that is Chinese, and registered as the only 
> IDN name,
> | unless the registrant requests a second and a third language to 
> access the
> | same IDN name.
> 
> Nothing prevents an all-Han string of any arbitrary length from 
> being 
> Japanese text.  The priority given to Chinese here is not likely to 
> be well 
> received by other groups.

Priority gives to Chinese has many reasons:
1) Majority of these characters originated in China with
 semantics and phonetic, and naturally be named and 
 known to people who use them.  The number is 
 100,000 - 20,003 = 80,000 on the way to be named. 
2) Kanji has more then two phonetics, and one of them
 is Chinese phonetics.  So it is not the worst case for Kanji.
3) All Kanji label automaticly gets two registered names, 
 one is in Chinese and the other in Japanese .

Japanese gets the  Chinese registration for free,  Chinese 
 gets the work  for nothing.  Who do think is the biggest 
beneficiary?


> | Also, it
> | introduces more policy decisions, for example, an all CJK 
> character
> | trademark registrant may have to registrate in three languages to 
> ensure
> | the legitimacy of the trademark.
> 
> Wait just a minute.  Wasn't the whole idea of this language-tagging 
> and 
> CJK-folding scheme to PREVENT registrants from having to register an 
> IDN 
> identifier more than once?

This registration is for the different user groups of the same 
tradename, like AOL.com and AmericanOnLine.com in DNS,
but in IDN they are the same as <A><O><L>.com

This is the IDN we have to work with, one match in IDN, one 
match in DNS.  If there are more then one accesses in DNS to 
one IDN label, IDN has to block them all in registration unless
they are registered.  That is the Chinese group has been 
saying: if we don't implement TC/SC, then there will be 
exponetial DNS names for the same IDN label.

> 
> | After all, a useful tool is to let its
> | user to make decisions.
> 
> Some tools are interactive, others are not.

This depends on which layer of user you have 
in mind.  I have several of them. 

> 
> Finally, it is not yet clear to me whether the "idn-zh-" tag prefix 

Where is the idn- tag come in? The zh-- tag shall be on the 
same footing with AMC tag bq-- and treated within the same 
interface.  Please look through the idn-map I-D again.  If 
I was not express that clearly, then tell me how to improve
it.

> is 
> supposed to be embedded within IDN identifiers or specified 
> separately.  But 
> between this additional label and the use of the less efficient 
> StepCode 
> instead of ACE-Z, it seems that several bytes out of the precious 
> 63-byte 
> limit are required as overheard to support this tagging scheme.  If 

Without tag there is little chance you can process CJK and 
the like problems.  For example, Latin and Armentian.  
The tag takes the same bytes with bq-- used in AMC. 

StepCode is not compressed, is human readable, is 
foreigner readable.  You can compare readability with
 code length efficience, but the judges are administrators
of zonefiles, internetional workers on a foreign land 
and the IDN name owners.  


> I 
> remember correctly, it is CJK users (Soobok Lee is only the most 
> vocal) who 
> are most concerned about the space limitation and who want to find 
> (or 
> invent) the most efficient encoding system possible.  Will these 
> other CJK 
> users agree to this proposal?
> 
> -Doug Ewell
>  Fullerton, California

Each member joins this list independently.  You have 
to ask them. 

Liana