[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [idn] An ignorant question about TC<-> SC



> On Wed, 24 Oct 2001 09:30:07 +0800 "James Seng/Personal"
> <jseng@pobox.org.sg> writes:
> > >  From linguistic on phoneme analysis,  TC/SC are identical by
> > > all authoritive dictionaries and standards with exceptions
> > > listed (for dialective and historical use).
> >
> > It should be "TC/SC Phrase" to be more exact.
> >
>
> Why it should be Phrase? The standards published by
> Chinese always are characters and we are talking about
> code points of UCS in [nameprep].  Are you referring to
> input disambiguation or are you talking about a dictionary.

Because accurate TC/SC (in dictionary) is usually done by "words"
not "characters".

> If you are talking about stringprep, a string has to be
> decomposed into UCS code points anyway for matching,
> then the place to start dealing with CJK for IDN identifiers
> is still in nameprep not in stringprep.

Huh? When is our discussion go into stringprep.

> And there are over 100,000 han ideograph in database already.

What "database" you referring to?

There are over 120,000 han ideographs in existence including some very
rare and old characters. Of which 70,000 already encoded in ISO10646 and
there are more waiting for approval in the IRG.

> Now, I'd like to say a few words about the 12,000
> characters, not strings. Unicode has combined the CJK
> into 21,003 code points  from possible 36,000.  In the
> 21,003 code points, there are 2000 TC/SC cases,
[snip]

If you are going to not happy with how ISO10646 is designed, then bring
it up to your ISO country representive.

But ISO 10646 is the only character set we have at the moment which
covers the most scripts of the world. And no, designing one within this
WG is definately out of scope.

-James Seng