[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Layer 2 and "identifiers"( was: Re: [idn] what are the IDN identifiers?)



Hi, Michel:

I am sorry that I got impatient.  Because I just got 
a Window Word crash report from my neighbor in the 
middle of answering your email.  This somewhat was 
related to Explore module. Does this sound familar?

> The only thing that is known at input time is a code point. Upper 
> layers
> may decide to apply heuristic to guess a language but that is 
> clearly
> beyond the scope of input mechanism in modern input processing
> mechanism. And there is not such a thing as the selection of a 
> 'language symbol set'.
 
Yes. There is such a selection when one has to 
use IMEs to input, that is the Upper layers concern
at this point.  And selection of an input keyboard
map is another type of langage selection, this
is often the case with Greek, Indic and other languages.

 "The modern input processing" 
as you implying can ignore the selection process.  
Does this sound right?

Modern input processing has to include some heuristic
to guess especially for Han codepoints.  This is layer 3 
matter.  When the code points are verified by the user, 
these code points go into layer 2.  In most free text 
processing, the language context at this moment is 
thrown away, so the next layer, layer 2, as you have said

> mechanism. And there is not such a thing as the selection of a 
> 'language symbol set'.

If as IDN matter, within limited URL syntax, we can require 
layer 3 to keep the language context, it makes us possible 
 to verify the script type from codepoint's block# 
with the language context saved.  

For existing client servers, which using say GB, JIS ...
as code transmission standards, this is easy to 
do, as long as we have a required parameter slot
in our IDN interface protocol. This is discribed in 
idn-map proposal, I think it is called idn-label.  

For PC client, it is a little more throuble.  Because the 
input keyboards information is always be ignored, and 
it is a large number.  However, since it is a large number,
and there can be a statistics to show 
1) how these keyboard map are accessed, 
2) do they used mixed within on label. 
This study is only a survey to understand the current
keyboard map usage on internet.

We could also ask : are there a need to use them
mixed but we can not allow them now?

For example, a US firm doing business in China,
do they want to put their logo in a name mixed
with Han characters?  As least, from what I 
know, such a need is exist.  If we have a Vietnamese
language tag, since it is a Latin and Han mixed 
script,  then this firm can have a name 
<Han-logo><Latin-logo>.com  in IDN form and 
vi--logo0logo.com as the DNS form.  So vi--
language tag will become the most expensive
tag to have. 

At this moment, I think language tag should be 
changed to idn-tag, or internationalized script 
tag based on [ISO639] and UCS. This means 
zh-- is for all Han characters, ja-for Kanji mix with
Kana, ko-- for Hanja mix with Hangul, 
la-- is for Latin and Greek,  vi-- for Latin and Han,
do you have suggestions for other groups?

To make things easier for Chinese to handle:
zh-- means 100,000 code points,  all the mixed
script Han codepoints can be limited to Plane 0.
Those are details anyway... 

These keyboard map usage study can serve
as the base for our language tag implementation.
I have mentioned a few large scripts which should
be used for language tag implementation test 
phase.  This is due to the existing demand and
problems we know of.  This is only the start point 
in understanding how should we serve the 
international communication in the future.  
As you know better than I do, that there are more
then 10,000 languages allowed in ISO 639-2
3-letter code and I don't know how many scripts in 
UCS, someone has to help me here.

How many language tags should IDN support?
Do we have any reasonable guess? 

One - [nameprep]  that is Latin, we know this well.
Two - [Tsconv],  that Latin and Chinese?  
          what about n-1, or 1-n, what about the 
          difference among CJK? 
four - [nameprep], [Tsconv],[Jchar][Jamo], what 
        about  Arabic?
five - Add Aracbic [bidi] 
....

This already gets into fregmented IDN.  If my using
of language tag is implying the current drafts, James
is ready to declare "No solution" for this group.  
So it is rightly, that I disagree with all of the above drafts, 
as well as facets and key words search, which is 
what you are suggesting: 

> The only case where there is a reasonable determination of 
> 'language'
> for input is East Asian Input Method Editors (IMEs), and it could be
> reasonable to assume that an application layer could offer some 
> TC/SC
> services before feeding the code points to a DNS service in that 
> case
> for CJK characters (but even that is not a simple case to solve 
> because
> of the contextual ambiguity as mentioned several times in this 
> forum). 

This is the reason, we have to require the layer 3 to settle with 
code points the user picked within UCS tolerance. In addition,
save the keyboard context in some way for verifing in next
layer.   Any reasonable input from UCS's range should be 
dealt with is in next sublayer, where we can reasonably map 
possible conflicts to semantic equivalent set and get to the 
stage of IDN identifiers and ready for ACE to code.

Since the TC/SC has to be globalized, and CJK has to 
be mixed and show up anywhere in the world, the
layer 3 comparison is not sufficient any more.  It has 
to be in the next sublayer, for IDN to a viable solution
 as an add-on to DNS. 

So the next layer is layer 2,  in my opinion is the IDNA 
and [nameprep], and layer 1 is the existing DNS, now I am 
agreeing with John's model.  

As a summary: 
Layer 3   in->user   
               out->UCS codepoints in equivalent code formats
Layer 2  in->from Layer 3
               out-> IDN identifiers and ACE in "LDH" format
Layer 1  in-> ACE, and others
               out->the net

In terminal case, it is Layer 3, and its server do Layer 2;

In PC case, it is Layer 3 and 2 together, and 
your model works, but TC/SC has to be in layer 2
in any case, becuse this mapping is not the same 
with Layer 3 TC/SC anyway, it is has more CJK 
agreement build-in already.

Liana