[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [idn] opting out of SC/TC equivalence
On Wed, 29 Aug 2001 13:24:45 +0200 Harald Tveit Alvestrand
<harald@alvestrand.no> writes:
>
>
> --On 29. august 2001 03:24 -0700 liana.ydisg@juno.com wrote:
>
> > Hi, Harald
> >
> > You are quite right regarding how the Chinese linguistics
> > works, there are never complete! Right now, there are
> > formally classified characters already exceed 100,000.
> > I am not advocate include the whole set in [nameprep]
> > at all. Neither do I advocate to exclude any characters, since
> > it is what the user wants. But I do recommend to include
> > the characters included in Big5 and GB standard, that is
> > about 23,658 code points in Unicode, while 2238 are TC/SC
> > equivalence and 14 radical equivalence we have been
> > discussing.
> >
> > There are two questions, 1) how to we implement TC/SC
> > in Unicode standard, 2) how do we implement CJK in
> > [nameprep].
> >
> > For 1), I said 1100 ( I don't have the table at this time)
> > half sized new Unicode points is to put radicals into the Unicode
> > standard, just as other scripts did for diacritics. This gives a
> > base for decomposing a character for IDN identifiers
> > (but not good enough for user input interface).
> > Within these radicals, 1886 TC/SC equivalence can be
> > addressed. The 352 TC/SC equivalence and 14 radical
> > equivalence can be addressed in a supplement document,
> > which shall state how to treat the other 1886 TC/SC
> > equivalence based on their radicals. The radical class has
> > been quite different from early history to recent era. The class
> > of radical definition I am proposing is for a Han speaker and
> > parting away from a computer programmer or a dictionary
> > editor:
> >
> > CJK Radical set has two sections: the first section is traditional
> > dictionary radical set, such as "Kangxi" and "Cihai" radical
> > set. The second set is any characters which have been used
> > as radicals for other characters are radical set members.
> > The first set is about 200, depending which dictionary we are
> > based on, the second set is about 1000. Since the two sets
> > overlap, my estimate is 1100 code points.
>
> The Unicode 3.0 standard contains 2 blocks of radicals - the 214
> traditional KangXi radicals encoded in U+2F00 through U+2FD5, and
> the CJK
> Radicals Extension block U+2E80 through U+2EF3.
>
> There is some language on using those characters to construct
> unencoded Han
> characters in section 10.1 of Unicode 3.0.
>
> Is this the same kind of thing you are talking about?
> If changes to Unicode are needed, this has to go to the Unicode
> Consortium
> and ISO; this group can't do much about it.....
>
> >
> > 2) I assume (since I did not check) that about 23,658 code
> > points in Unicode 3.0 has included Hanja and Kanji.
> > The other codepoints in Unicode and future new comers,
> > can be treated on needed base. This means, only when
> > someone has used in a name at registration time, and
> > supplied the name with a codepoint in Unicode, then the
> > codepoint is added to zonefile. (Not in [nameprep]?)
> > If such a character is not in Unicode, then a bit map of the
> > new character has to be provided in the zonefile. This is
> > the reason, I propose a "Request for Reference to be sent"
> > protocol to be drafted.
>
> Better check....
>
> So you foresee a system where
> - User upgrades his data entry system
Don't need if 1) supplier give them an existing display code to
ACE code map which includes all the local TC/SC equivalence
rendering procedure affecting Han characters. 2) before a transport
protocol calls, application calls for [IDNA] ACE conversion and
get the tagged ACE version to ship. So this is application call
[IDNA] black box operation.
> - User types a new ideograph into his systemm
This is a future possibility, and up to the application to deal with.
That IS out of this WG's scope. [nameprep] may update say once
a year, to collect the new ones which has to be proved stable
already, working like UNICODE updates.
> - The client software calls out to some global repository for the
> new canonical dideograph decomposition of the new ideograph
Yes, at registration time for adminstration of such a use only.
In fact, only Chinese has this repository and I belief, a few experts
really like to be able to use this repository. Why do we need
Unicode Consultium to make this type of decision?
> - The DNS system looks up the decomposition, not the original
> codepoint
No. IDN-label only gets this from the registering server, if it is asked
for from a client (that is if the client know itself has unicode as
display
code). This is outside DNS. DNS only deal with identifiers.
> - The server knows enough to canonically decompose the zonefile's
> ideograph
No. There are two separeted issue here. DNS don't know about
this. The sever who wants to display zonefile's ideograph has
to know there is a request for the full code to be shipped and to
put in its zonefile. It does not need to know decomposing procedure.
The decomposing can be handled by [IDNA] with tagged conversion.
This is all a Chinese tagged conversion, reversion and
display procedures does, so does tagged Kanji and Hanja
procedures. That is the reason I propose to increase radical
section in Unicode to cover 1000 radicals.
> - All this works correctly for software written by Indian
> programmers for
> American companies?
That is the reason I propose to increase radical section in
Unicode to cover 1000 radicals, and for WG to agree a versioning
table to include GB, Big5, KSC, JIS to transliterated ACE map.
>
> Seems complicated to me....
>