[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [idn] An ignorant question about TC<-> SC
On Wed, 24 Oct 2001 15:17:00 +0800 "James Seng/Personal"
<jseng@pobox.org.sg> writes:
> > On Wed, 24 Oct 2001 09:30:07 +0800 "James Seng/Personal"
> > <jseng@pobox.org.sg> writes:
> > > > From linguistic on phoneme analysis, TC/SC are identical by
> > > > all authoritive dictionaries and standards with exceptions
> > > > listed (for dialective and historical use).
> > >
> > > It should be "TC/SC Phrase" to be more exact.
> > >
> >
> > Why it should be Phrase? The standards published by
> > Chinese always are characters and we are talking about
> > code points of UCS in [nameprep]. Are you referring to
> > input disambiguation or are you talking about a dictionary.
>
> Because accurate TC/SC (in dictionary) is usually done by "words"
> not "characters".
>
That is for the readers of the dictionary, not for code tables.
And code tables are referred to individual character mappings.
something like GB or BIG5 tables. A dictionary table is a
word/string mapping. I am, as well as Tseng, Huang, as I
believe, were speaking of code tables.
> > If you are talking about stringprep, a string has to be
> > decomposed into UCS code points anyway for matching,
> > then the place to start dealing with CJK for IDN identifiers
> > is still in nameprep not in stringprep.
>
> Huh? When is our discussion go into stringprep.
>
> > And there are over 100,000 han ideograph in database already.
>
> What "database" you referring to?
The one started in China serveral years ago, but I was not
following that though.
>
> There are over 120,000 han ideographs in existence including some
> very
> rare and old characters. Of which 70,000 already encoded in ISO10646
> and
> there are more waiting for approval in the IRG.
>
So you have the current state of the collection work.
> > Now, I'd like to say a few words about the 12,000
> > characters, not strings. Unicode has combined the CJK
> > into 21,003 code points from possible 36,000. In the
> > 21,003 code points, there are 2000 TC/SC cases,
> [snip]
>
> If you are going to not happy with how ISO10646 is designed, then
> bring
> it up to your ISO country representive.
>
True, I am not happy with the result. If I have worked with
it, I don't think I have better idea either :-)
> But ISO 10646 is the only character set we have at the moment which
> covers the most scripts of the world. And no, designing one within
> this
> WG is definately out of scope.
>
> -James Seng
>
I am not propose to redesign the table. I do not know
how did you get that impression on my comments. My
proposal is to let Latin case mapping mechanism to
be more general and allow TC/SC character set to be
treated in a similar way by the CJK community. We
do not redefine TC/SC, it is the [Tsconv] author to provide
TC/SC listing and they can decide which subset of
TC/SC to use case folding feature.
Liana