[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [idn] An ignorant question about TC<-> SC
From screen display point view, TC/SC are different glyph
sets(who defines the sets? How is it used by 1/5 of the world
population? Is Uicode group the only authoritive one? In
China there are over 600 recorded views on this).
From linguistic on phoneme analysis, TC/SC are identical by
all authoritive dictionaries and standards with exceptions
listed (for dialective and historical use).
From a higher level anaysis, the semantics are the same by
all authoritive dictionaries and standards with exceptions listed
and historically differenciated (in different dictionaries) and
frequence of usage in current sectors (by software implementors).
If for Latin only with 26 letters and 10 numbers, there are upper
i and lower L looks alike, 'I' and 'l', and 'O' '0' looks alike, then
for 4,000 to 20,000 symbols the ratio is 4:36, a 1/9 not one-one
mapping case. This is a much higher ratio than standard TC/SC
mapping. So yes, all language symbols share the same types
of characteristics but with defferent magnitude.
To treat UCS code points on the same base, the 4,000 to 20,000
number needs to be doubled, and treat TC/SC in a general way
is a test for correctly treat 8,000 to 56,000 symbols for the long run.
Since the 2,000 has pair-mapping( for sloppy discussion), but not
all one-to-one (for another 17), the others are not have case
mapping on another 2000, so do Indian languages. The 4000 are
high frequency symbols, similarly for Kanji and Hanja. These
symbols have highest chance to collide in IDN names, and critical
to map them in registration and IDN usage.
With the 2000 pair-mapping taken care of as case mapping, the
above 4000 increases to 6,000, a great drop in processing
complexity. If the other symbols follow the same equivalent
mapping, then the number getting better and manageable.
Liana
On Tue, 23 Oct 2001 08:59:23 -0400 John C Klensin <klensin@jck.com>
writes:
> While reading David's NFC versus NFKC note, I had an odd thought.
> I've been dissatisfied, as have many others, with the notion that
> TC <-> SC mapping is analogous to case mapping in Roman-derived
> alphabets. Arguments about whether that analogy applies have
> helped to make the discussion of what is, to me, a very difficult
> topic even more obscure.
>
> To quote the Unicode standard, "Serbo-Croatian is a single
> language with paired alphabets". This is a definition with which
> native speakers of the language agree (although, when tensions in
> the Balkans are high, I assume some of them are not completely
> happy about it). Would it be constructive to think about Chinese
> as "one language, two alphabets"? If it is, then nameprep or a
> related process ought to be able to map back and forth between
> the Roman-based characters usually used in Croatian contexts and
> the Cyrillic characters usually used in Serbian ones (people do
> this all the time, and certainly expect the two to match).
>
> Of course, the analogy is not exact (these things never are):
> perhaps partially because there are just fewer characters to deal
> with, there are no cases in which there are potential ambiguities
> in the mappings. On the other hand, one problem is more severe
> than in the Chinese case: in the general case, a Serbo-Croatian
> string written in Cyrillic cannot be distinguished, on a
> character string basis, from uses of Cyrillic for other languages
> (e.g., Russian), which should not be mapped and, similarly, a
> string written in Roman-based characters cannot be distinguished,
> on a character string basis, from the Roman-based characters of
> another language (English?) which, again, cannot be mapped.
>
> In either case, the mapping becomes readily plausible if the
> language, in addition to the content of the character string, is
> known, but is hard to think about without causing side-effects in
> other languages if not.
>
> Is that helpful?
> john
>
>