[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: Layer 2 and "identifiers"( was: Re: [idn] what are the IDN identifiers?)
Hi, Michel:
I am sorry that I got impatient. Because I just got
a Window Word crash report from my neighbor in the
middle of answering your email. This somewhat was
related to Explore module. Does this sound familar?
> The only thing that is known at input time is a code point. Upper
> layers
> may decide to apply heuristic to guess a language but that is
> clearly
> beyond the scope of input mechanism in modern input processing
> mechanism. And there is not such a thing as the selection of a
> 'language symbol set'.
Yes. There is such a selection when one has to
use IMEs to input, that is the Upper layers concern
at this point. And selection of an input keyboard
map is another type of langage selection, this
is often the case with Greek, Indic and other languages.
"The modern input processing"
as you implying can ignore the selection process.
Does this sound right?
Modern input processing has to include some heuristic
to guess especially for Han codepoints. This is layer 3
matter. When the code points are verified by the user,
these code points go into layer 2. In most free text
processing, the language context at this moment is
thrown away, so the next layer, layer 2, as you have said
> mechanism. And there is not such a thing as the selection of a
> 'language symbol set'.
If as IDN matter, within limited URL syntax, we can require
layer 3 to keep the language context, it makes us possible
to verify the script type from codepoint's block#
with the language context saved.
For existing client servers, which using say GB, JIS ...
as code transmission standards, this is easy to
do, as long as we have a required parameter slot
in our IDN interface protocol. This is discribed in
idn-map proposal, I think it is called idn-label.
For PC client, it is a little more throuble. Because the
input keyboards information is always be ignored, and
it is a large number. However, since it is a large number,
and there can be a statistics to show
1) how these keyboard map are accessed,
2) do they used mixed within on label.
This study is only a survey to understand the current
keyboard map usage on internet.
We could also ask : are there a need to use them
mixed but we can not allow them now?
For example, a US firm doing business in China,
do they want to put their logo in a name mixed
with Han characters? As least, from what I
know, such a need is exist. If we have a Vietnamese
language tag, since it is a Latin and Han mixed
script, then this firm can have a name
<Han-logo><Latin-logo>.com in IDN form and
vi--logo0logo.com as the DNS form. So vi--
language tag will become the most expensive
tag to have.
At this moment, I think language tag should be
changed to idn-tag, or internationalized script
tag based on [ISO639] and UCS. This means
zh-- is for all Han characters, ja-for Kanji mix with
Kana, ko-- for Hanja mix with Hangul,
la-- is for Latin and Greek, vi-- for Latin and Han,
do you have suggestions for other groups?
To make things easier for Chinese to handle:
zh-- means 100,000 code points, all the mixed
script Han codepoints can be limited to Plane 0.
Those are details anyway...
These keyboard map usage study can serve
as the base for our language tag implementation.
I have mentioned a few large scripts which should
be used for language tag implementation test
phase. This is due to the existing demand and
problems we know of. This is only the start point
in understanding how should we serve the
international communication in the future.
As you know better than I do, that there are more
then 10,000 languages allowed in ISO 639-2
3-letter code and I don't know how many scripts in
UCS, someone has to help me here.
How many language tags should IDN support?
Do we have any reasonable guess?
One - [nameprep] that is Latin, we know this well.
Two - [Tsconv], that Latin and Chinese?
what about n-1, or 1-n, what about the
difference among CJK?
four - [nameprep], [Tsconv],[Jchar][Jamo], what
about Arabic?
five - Add Aracbic [bidi]
....
This already gets into fregmented IDN. If my using
of language tag is implying the current drafts, James
is ready to declare "No solution" for this group.
So it is rightly, that I disagree with all of the above drafts,
as well as facets and key words search, which is
what you are suggesting:
> The only case where there is a reasonable determination of
> 'language'
> for input is East Asian Input Method Editors (IMEs), and it could be
> reasonable to assume that an application layer could offer some
> TC/SC
> services before feeding the code points to a DNS service in that
> case
> for CJK characters (but even that is not a simple case to solve
> because
> of the contextual ambiguity as mentioned several times in this
> forum).
This is the reason, we have to require the layer 3 to settle with
code points the user picked within UCS tolerance. In addition,
save the keyboard context in some way for verifing in next
layer. Any reasonable input from UCS's range should be
dealt with is in next sublayer, where we can reasonably map
possible conflicts to semantic equivalent set and get to the
stage of IDN identifiers and ready for ACE to code.
Since the TC/SC has to be globalized, and CJK has to
be mixed and show up anywhere in the world, the
layer 3 comparison is not sufficient any more. It has
to be in the next sublayer, for IDN to a viable solution
as an add-on to DNS.
So the next layer is layer 2, in my opinion is the IDNA
and [nameprep], and layer 1 is the existing DNS, now I am
agreeing with John's model.
As a summary:
Layer 3 in->user
out->UCS codepoints in equivalent code formats
Layer 2 in->from Layer 3
out-> IDN identifiers and ACE in "LDH" format
Layer 1 in-> ACE, and others
out->the net
In terminal case, it is Layer 3, and its server do Layer 2;
In PC case, it is Layer 3 and 2 together, and
your model works, but TC/SC has to be in layer 2
in any case, becuse this mapping is not the same
with Layer 3 TC/SC anyway, it is has more CJK
agreement build-in already.
Liana