[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [idn] An ignorant question about TC<-> SC
On Wed, 24 Oct 2001 09:30:07 +0800 "James Seng/Personal"
<jseng@pobox.org.sg> writes:
> > From linguistic on phoneme analysis, TC/SC are identical by
> > all authoritive dictionaries and standards with exceptions
> > listed (for dialective and historical use).
>
> It should be "TC/SC Phrase" to be more exact.
>
Why it should be Phrase? The standards published by
Chinese always are characters and we are talking about
code points of UCS in [nameprep]. Are you referring to
input disambiguation or are you talking about a dictionary.
If you are talking about stringprep, a string has to be
decomposed into UCS code points anyway for matching,
then the place to start dealing with CJK for IDN identifiers
is still in nameprep not in stringprep.
> There is also a problem with "identical by all authoritive
> dictionaries". All dictionaries have (slight/some) differences in
> what
> they considered identical. The devil is in the level of details.
And we should following many standard bodies to draw the
line to cut the details at approperate places. I may call this
a transformation from analog to digital :-)
>
> > To treat UCS code points on the same base, the 4,000 to 20,000
> > number needs to be doubled, and treat TC/SC in a general way
> > is a test for correctly treat 8,000 to 56,000 symbols for the long
> run.
>
> There are 70,000+ han ideograph in ISO10646:2001.
>
> -James Seng
>
And there are over 100,000 han ideograph in database already.
But how many of them are to be used by a common name?
How do we know about it? How do we design a system to
accommodate all of them?
The conventional way is to regulate them with tables,
all CJKs have published the first 4000 as "required"
for education standards. Then there comes the next
4000 as they are often used in names.
Then the next 4000 are nice to have for an editing
software. This brings the number of characters to
12,000, the BIG5 standard. And it is a good indicator
of how many characters are really needed for IDN
application. There are always unhappy users for
not able to find the one he wants. But for IDN
application we need to consider the 12,000 first, and
make the majority users happy. To cover the
12,000 necessary identifiers for each user group,
the 21,003 UCS CJK release is a good base for
IDN group to consider.
The rest of CJK characters are supported or not and
how to be supported in IDN should be an open
question after the first 21,003 is depoyed for at least
10 years. (Well, I throw out the number to mean there
is little demand to use these characters, and if they are
allowed to be used, the tendency is a chaos even for
Chinese, as it has been the case in two experiments,
each lasts 5 - 10 years in the last 50 years. )
At the same time, we shall consider mechanism to
let people to be able to use the rest of the code points
with minimum support since they are less controversal
anyway. Some mechanism like AMC-Z may be good
enough.
Now, I'd like to say a few words about the 12,000
characters, not strings. Unicode has combined the CJK
into 21,003 code points from possible 36,000. In the
21,003 code points, there are 2000 TC/SC cases,
which may bring the 21,000 down to 18,000 due to
TC/SC in Chinese and Kanji. The process is as simple
as Latin case map, why can not let it be treated the
same? From code mapping point view, a table of
56,000 entries include all UCS Plane 0 is politically more
correct than only support a few blocks with "official"
worded as "Scripts with case mapping" from
Plane 0, Plane 1 and Plane 3, and leave Plane 0 CJK
block of 21,003 to stringprep. It sounds like a language
tag zigzaging through UCS space without a flag, good
for a few, hard for others to follow, defeating the hard
worked UCS table at hand.
Liana