[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [idn] An ignorant question about TC<-> SC





On Wed, 24 Oct 2001 09:30:07 +0800 "James Seng/Personal"
<jseng@pobox.org.sg> writes:
> >  From linguistic on phoneme analysis,  TC/SC are identical by
> > all authoritive dictionaries and standards with exceptions
> > listed (for dialective and historical use).
> 
> It should be "TC/SC Phrase" to be more exact.
>

Why it should be Phrase? The standards published by 
Chinese always are characters and we are talking about 
code points of UCS in [nameprep].  Are you referring to 
input disambiguation or are you talking about a dictionary.

If you are talking about stringprep, a string has to be 
decomposed into UCS code points anyway for matching,
then the place to start dealing with CJK for IDN identifiers 
is still in nameprep not in stringprep.
 
> There is also a problem with "identical by all authoritive
> dictionaries". All dictionaries have (slight/some) differences in 
> what
> they considered identical. The devil is in the level of details.

And we should following many standard bodies to draw the 
line to cut the details at approperate places.  I may call this
a transformation from analog to digital :-)

> 
> > To treat UCS code points on the same base, the 4,000 to 20,000
> > number needs to be doubled, and treat TC/SC in a general way
> > is a test for correctly treat 8,000 to 56,000 symbols for the long
> run.
> 
> There are 70,000+ han ideograph in ISO10646:2001.
> 
> -James Seng
> 

And there are over 100,000 han ideograph in database already.  
But how many of them are to be used by a common name? 
How do we know about it?  How do we design a system to 
accommodate all of them? 

  The conventional way is to regulate them with tables,  
all  CJKs have published the first 4000 as "required" 
for education standards. Then there comes the next 
4000 as they are often used in names.  

 Then the next 4000 are nice to have for an editing 
software. This brings the number of characters to 
12,000, the BIG5 standard. And it is a good indicator 
of how many characters are really needed for IDN 
application.  There are always unhappy users for 
not able to find the one he wants.  But for IDN 
application we need to consider the 12,000 first, and 
make the majority users happy.  To cover the 
12,000 necessary identifiers for each user group, 
the 21,003 UCS CJK release is a good base  for 
IDN group to consider.  

The rest of CJK characters are supported or not and 
how to be supported in IDN should be an open 
question after the first 21,003 is depoyed for at least 
10 years. (Well, I throw out the number to mean there 
is little demand to use these characters, and if they are 
allowed to be used, the tendency is a chaos even for 
Chinese, as it has been the case in two experiments, 
each lasts 5 - 10 years  in the last 50 years. ) 

At the same time, we shall consider mechanism to 
let people to be able to use the rest of the code points 
with minimum support since they are less controversal 
anyway. Some mechanism like AMC-Z may be good 
enough.   

Now, I'd like to say a few words about the 12,000 
characters, not strings. Unicode has combined the CJK 
into 21,003 code points  from possible 36,000.  In the 
21,003 code points, there are 2000 TC/SC cases, 
which may bring the 21,000 down to 18,000 due to
TC/SC in Chinese and Kanji. The process is as simple 
as Latin case map, why can not let it be treated the 
same? From code mapping point view, a table of 
56,000 entries include all UCS Plane 0 is politically more 
correct than only support a few blocks with "official" 
worded as "Scripts with case mapping" from 
Plane 0, Plane 1 and Plane 3, and leave Plane 0 CJK 
block of 21,003 to stringprep.  It sounds like a language 
tag zigzaging  through UCS space without a flag, good 
for a few, hard for others to follow, defeating the hard 
worked UCS table at hand.

Liana