[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [idn] opting out of SC/TC equivalence
Hi, Scott,
I think you have raised a good question. However, there is
no such an " end result should be applicable to _any_ name
space." The uppper to low case folding only apply to alphabet
systems, the SC/TC folding only apply to character based
systems, consonant system has many small scripts all within
128 codepoints. Following you assertion, we shall take Latin
case folding out of [nameprep] too.
From what I have been studing in the past serveral weeks,
mixed use of a script is a common situation. English mixes with
Greek, French and German in the States, Arabic script used by
several different languages, where each is subset of the Unicode
Aracbic block, Devanagari transcribs Tamil symbols, and the
current discussion on TC/SC is well known case.
We can not affort to have each spoken language to be a
processing entity in [nameprep], since they are normaly a
subset of a particular script.
We can allow script based entity to be the processing unit, that
is CJK, Latin, Cyrillic, Arabic and Indian languages as a few
script pools. How many would be there is up to political
debate as it is done in United Nations. Lately, I have come
across an article in "San Jose Mercury News" which says
"Azerbaijan mandates use of Latin alphabet". From technical side,
we can say how many of the pools is technically sensible, and
let other languages, such as Tibetan, Vietnanese, Lao to
decide which pool they want to be in, depend on their users as
well as how their scripts can be best handled technically.
My initial thought on the pools are:
character based: Chinese, Japanese, Yi
Alphabetic: Latin, Greek, Cyrillic, any IPA based languages
Consonant languages: Indian languages, Arabic, Tibetan
The reason for such pools are simple non-semantic
handling for [nameprep]:
Character based scripts do have large tables of case folding.
SC/TC is an example. I am certain that there will be
Kanji/SC folding and possibly Yi/SC folding too. Will
there be Hangul/TC folding or Vietnanese/TC folding?
I have not heard that but there is historical possibility.
SC/TC folding affects about 200 one-one folding
(some 3-1 too), and 146 block to block foldings
affects over 2000 characters.
Alphabet languages all have upper--to-lower case foldings,
including IPA, which is used for newly created scripts in Africa.
And they do borrow diacritics from other languages,
and letters from Greek. The best example is American
English. This case folding affects Unicode block
002-04f, four sets of scripts: Latin, IPA, Greek and
Cyrillic, over 100 upper cases.
Consonant languages do not have case foldings but
they borrow symbols from each other. They transcribe,
but usally not include Latin or CJK in their writing.
This group needs codepoint differenciation reducing script
confusion among Armenian, Lao, Thai, Georgian, and
among a dozent of Indian scripts. A codepoint verification
of its legal block (normally within 128 codepoints each) shall
be sufficient. (With political implication: do they want one
pure script or more than one?)
I would say we shall divide the languages into several
language pools, and "the end result should be applicable
to _any_ name space" within that pool.
Now, I am back to SC/TC folding. Depending on the
character encoding used, the folding has four cases:
case 1: transliteration encoding: no folding;
case 2:GB -to- ACE' : no folding;
case 3:Big5 -to- ACE": no folding;
case 4:Unicode: block to block and one to one folding.
The first three are realy IDNA output, the last one is the
case we have to consider in here.
Suppose TC folds to SC, (SC is a smaller than TC)
in current IDNA> [nameprep]> ACE schem, then:
GB > Unicode > [nameprep] > ACE;
Big5 > Unicode > fold to SC > [nameprep] > ACE;
Mixed Unicode > fold to SC > [nameprep] > ACE.
So,
IDNA? [nameprep]? ACE
GB > Unicode SC
Big5>Unicode > SC Unicode Latin Case folding ACE encoding
Mixed Unicode > SC Unicode
If this is the case, then what IDNA is doing? What about other
three cases IDNA is supposed to handle? What is the purpose for
SC Unicode go through [nameprep] at all, why not directly go
to ACE then?
I do not think the above is a reasonable processing model.
Instead all the above should be in [nameprep]. [nameprep]
should be divided into three pool cases (up for debate) and
treated with somewhat uniformed procedure for each of the three:
alphabet: case folding
character: case folding, allow GB, Big5, JIS, and KSC to Unicode
mapping too
consonant system: allow ISCII to Unicode mapping and tag
identification for script specific process into ACE.
Liana
On Wed, 15 Aug 2001 06:54:35 -0400 "Hollenbeck, Scott"
<shollenbeck@verisign.com> writes:
> >-----Original Message-----
> >From: tsenglm@cc.ncu.edu.tw [mailto:tsenglm@cc.ncu.edu.tw]
> >Sent: Wednesday, August 15, 2001 2:11 AM
> >To: ben; Adam M. Costello; idn@ops.ietf.org
> >Subject: Re: [idn] opting out of SC/TC equivalence
> >
> >
> > In HongKong , Taiwan, user use BIG5 code . This code set
> has no
> >simpified chinese scripts. In China , GB code set has no
> >traditional chinese
> >scripts . So there are no mixed type of GB and BIG5 . But you
> know
> >VeriSign/NSI announced ML.com with any UNICODE can be mixed.
> >That is the key problems.
> > Any suggestions must be considered what to do for .COM
> >in this WG.
>
> If composing labels of "mixed" Unicode code points is believed to be
> a key
> problem, that's an issue with current draft documents being
> developed and
> discussed in this WG -- which clearly permit this method of
> composition. If
> such a composition method doesn't make sense in a particular local
> context,
> it likely won't be widely used to create labels in that local
> context -- but
> that doesn't imply that the conventions of the local context should
> be
> applied everywhere else.
>
> This WG shouldn't attempt to produce solutions particular to a
> specific name
> space. The end result should be applicable to _any_ name space.
>
> <Scott/>
>