[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [idn] An ignorant question about TC<-> SC





On Wed, 24 Oct 2001 15:17:00 +0800 "James Seng/Personal"
<jseng@pobox.org.sg> writes:
> > On Wed, 24 Oct 2001 09:30:07 +0800 "James Seng/Personal"
> > <jseng@pobox.org.sg> writes:
> > > >  From linguistic on phoneme analysis,  TC/SC are identical by
> > > > all authoritive dictionaries and standards with exceptions
> > > > listed (for dialective and historical use).
> > >
> > > It should be "TC/SC Phrase" to be more exact.
> > >
> >
> > Why it should be Phrase? The standards published by
> > Chinese always are characters and we are talking about
> > code points of UCS in [nameprep].  Are you referring to
> > input disambiguation or are you talking about a dictionary.
> 
> Because accurate TC/SC (in dictionary) is usually done by "words"
> not "characters".
> 

That is for the readers of the dictionary, not for code tables.
And code tables are referred to individual character mappings.  
something like GB or BIG5 tables.  A dictionary table is a 
word/string mapping. I am, as well as Tseng, Huang, as I 
believe,  were speaking of code tables.  

> > If you are talking about stringprep, a string has to be
> > decomposed into UCS code points anyway for matching,
> > then the place to start dealing with CJK for IDN identifiers
> > is still in nameprep not in stringprep.
> 
> Huh? When is our discussion go into stringprep.
> 
> > And there are over 100,000 han ideograph in database already.
> 
> What "database" you referring to?

The one started in China serveral years ago, but I was not
following that though. 
 
> 
> There are over 120,000 han ideographs in existence including some 
> very
> rare and old characters. Of which 70,000 already encoded in ISO10646 
> and
> there are more waiting for approval in the IRG.
> 
So you have the current state of the collection work.

> > Now, I'd like to say a few words about the 12,000
> > characters, not strings. Unicode has combined the CJK
> > into 21,003 code points  from possible 36,000.  In the
> > 21,003 code points, there are 2000 TC/SC cases,
> [snip]
> 
> If you are going to not happy with how ISO10646 is designed, then 
> bring
> it up to your ISO country representive.
> 

True, I am not happy with the result.  If I have worked with 
it, I don't think I have better idea either  :-)

> But ISO 10646 is the only character set we have at the moment which
> covers the most scripts of the world. And no, designing one within 
> this
> WG is definately out of scope.
> 
> -James Seng
> 
I am not propose to redesign the table.  I do not know 
how did you get that impression on my comments.  My 
proposal is to let Latin case mapping mechanism to
be more general and allow TC/SC character set to be 
treated in a similar way by the CJK community.  We 
do not redefine TC/SC, it is the [Tsconv] author to provide
TC/SC listing and they can decide which subset of 
TC/SC to use case folding feature. 

Liana