[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [idn] opting out of SC/TC equivalence





--On 29. august 2001 03:24 -0700 liana.ydisg@juno.com wrote:

>  Hi, Harald
>
> You are quite right regarding how the Chinese linguistics
> works, there are never complete!  Right now, there are
> formally classified characters already exceed 100,000.
> I am not advocate include the whole set in [nameprep]
> at all.  Neither do I advocate to exclude any characters, since
> it is what the user wants.  But I do recommend to include
> the characters included in Big5 and GB standard, that is
> about 23,658 code points in Unicode, while 2238 are TC/SC
> equivalence and 14 radical equivalence we have been
> discussing.
>
> There are two questions, 1) how to we implement TC/SC
> in Unicode standard, 2) how do we implement CJK in
> [nameprep].
>
> For 1), I said 1100 ( I don't have the table at this time)
> half sized new Unicode points is to put radicals into the Unicode
> standard, just as other scripts did for diacritics.  This gives a
>  base for decomposing a character for IDN identifiers
> (but not good enough for user input interface).
>  Within these radicals, 1886  TC/SC equivalence can be
> addressed.  The  352 TC/SC equivalence and 14 radical
> equivalence can be addressed in a supplement document,
> which shall state how to treat the other 1886  TC/SC
> equivalence based on their radicals.  The radical class has
> been quite different from early history to recent era.  The class
> of radical definition I am proposing is for a Han speaker and
> parting away from a computer programmer or a dictionary
> editor:
>
> CJK Radical set has two sections: the first section is traditional
> dictionary radical set, such as "Kangxi" and "Cihai" radical
> set.  The second set is any characters which have been used
> as radicals for other characters are radical set members.
> The first  set is about 200, depending which dictionary we are
> based on, the second set is about 1000.  Since the two sets
> overlap, my estimate is 1100 code points.

The Unicode 3.0 standard contains 2 blocks of radicals - the 214 
traditional KangXi radicals encoded in U+2F00 through U+2FD5, and the CJK 
Radicals Extension block U+2E80 through U+2EF3.

There is some language on using those characters to construct unencoded Han 
characters in section 10.1 of Unicode 3.0.

Is this the same kind of thing you are talking about?
If changes to Unicode are needed, this has to go to the Unicode Consortium 
and ISO; this group can't do much about it.....

>
> 2) I assume (since I did not check)  that  about 23,658 code
> points in Unicode 3.0 has included Hanja and Kanji.
> The other codepoints in Unicode and future new comers,
> can be treated on needed base.  This means, only when
> someone has used in a name at registration time, and
> supplied the name with a codepoint in Unicode, then the
> codepoint is added to zonefile.  (Not in [nameprep]?)
>  If such a character is not in Unicode, then a bit map of the
> new character has to be provided in the zonefile.  This is
> the reason, I propose a "Request for Reference to be sent"
> protocol to be drafted.

Better check....

So you foresee a system where

- User upgrades his data entry system
- User types a new ideograph into his systemm
- The client software calls out to some global repository for the
  new canonical dideograph decomposition of the new ideograph
- The DNS system looks up the decomposition, not the original codepoint
- The server knows enough to canonically decompose the zonefile's ideograph
- All this works correctly for software written by Indian programmers for
  American companies?

Seems complicated to me....