[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [idn] opting out of SC/TC equivalence





On Wed, 29 Aug 2001 13:24:45 +0200 Harald Tveit Alvestrand
<harald@alvestrand.no> writes:
> 
> 
> --On 29. august 2001 03:24 -0700 liana.ydisg@juno.com wrote:
> 
> >  Hi, Harald
> >
> > You are quite right regarding how the Chinese linguistics
> > works, there are never complete!  Right now, there are
> > formally classified characters already exceed 100,000.
> > I am not advocate include the whole set in [nameprep]
> > at all.  Neither do I advocate to exclude any characters, since
> > it is what the user wants.  But I do recommend to include
> > the characters included in Big5 and GB standard, that is
> > about 23,658 code points in Unicode, while 2238 are TC/SC
> > equivalence and 14 radical equivalence we have been
> > discussing.
> >
> > There are two questions, 1) how to we implement TC/SC
> > in Unicode standard, 2) how do we implement CJK in
> > [nameprep].
> >
> > For 1), I said 1100 ( I don't have the table at this time)
> > half sized new Unicode points is to put radicals into the Unicode
> > standard, just as other scripts did for diacritics.  This gives a
> >  base for decomposing a character for IDN identifiers
> > (but not good enough for user input interface).
> >  Within these radicals, 1886  TC/SC equivalence can be
> > addressed.  The  352 TC/SC equivalence and 14 radical
> > equivalence can be addressed in a supplement document,
> > which shall state how to treat the other 1886  TC/SC
> > equivalence based on their radicals.  The radical class has
> > been quite different from early history to recent era.  The class
> > of radical definition I am proposing is for a Han speaker and
> > parting away from a computer programmer or a dictionary
> > editor:
> >
> > CJK Radical set has two sections: the first section is traditional
> > dictionary radical set, such as "Kangxi" and "Cihai" radical
> > set.  The second set is any characters which have been used
> > as radicals for other characters are radical set members.
> > The first  set is about 200, depending which dictionary we are
> > based on, the second set is about 1000.  Since the two sets
> > overlap, my estimate is 1100 code points.
> 
> The Unicode 3.0 standard contains 2 blocks of radicals - the 214 
> traditional KangXi radicals encoded in U+2F00 through U+2FD5, and 
> the CJK 
> Radicals Extension block U+2E80 through U+2EF3.
> 
> There is some language on using those characters to construct 
> unencoded Han 
> characters in section 10.1 of Unicode 3.0.
> 
> Is this the same kind of thing you are talking about?
> If changes to Unicode are needed, this has to go to the Unicode 
> Consortium 
> and ISO; this group can't do much about it.....
> 
> >
> > 2) I assume (since I did not check)  that  about 23,658 code
> > points in Unicode 3.0 has included Hanja and Kanji.
> > The other codepoints in Unicode and future new comers,
> > can be treated on needed base.  This means, only when
> > someone has used in a name at registration time, and
> > supplied the name with a codepoint in Unicode, then the
> > codepoint is added to zonefile.  (Not in [nameprep]?)
> >  If such a character is not in Unicode, then a bit map of the
> > new character has to be provided in the zonefile.  This is
> > the reason, I propose a "Request for Reference to be sent"
> > protocol to be drafted.
> 
> Better check....
> 
> So you foresee a system where

> - User upgrades his data entry system
Don't need if  1) supplier give them an existing display code to
ACE code map which includes all the local TC/SC equivalence
rendering procedure affecting Han characters. 2) before a transport
protocol calls, application calls for [IDNA] ACE conversion and
get the tagged ACE version to ship.  So this is application call 
[IDNA] black box operation. 

> - User types a new ideograph into his systemm
This is a future possibility, and up to the application to deal with.
That IS out of this WG's scope.  [nameprep] may update say once
 a year, to collect the new ones which has to be proved stable 
already, working like UNICODE updates.

> - The client software calls out to some global repository for the
>   new canonical dideograph decomposition of the new ideograph
Yes, at registration time for adminstration of such a use only.  
In fact, only Chinese has this repository and I belief, a few experts 
really like to be able to use this repository.  Why do we need 
Unicode Consultium to make this type of decision?

> - The DNS system looks up the decomposition, not the original 
> codepoint
No.  IDN-label only gets this from the registering server, if it is asked

for from a client (that is if the client know itself has unicode as
display
code).  This is outside DNS. DNS only deal with identifiers.

> - The server knows enough to canonically decompose the zonefile's 
> ideograph
No.  There are two separeted issue here.  DNS don't know about
this.  The sever who wants to display zonefile's ideograph has
to know there is a request for  the full code to be shipped and to
put in its zonefile.  It does not need to know decomposing  procedure.
The decomposing can be handled by [IDNA] with tagged conversion.

 This is all a Chinese tagged conversion, reversion  and
display procedures does, so does tagged Kanji and Hanja 
procedures.  That is the reason I propose to increase radical 
section in Unicode to cover 1000 radicals.

> - All this works correctly for software written by Indian 
> programmers for
>   American companies?

That is the reason I propose to increase radical section in 
Unicode to cover 1000 radicals, and for WG to agree a versioning
table to include GB, Big5, KSC, JIS to transliterated ACE map.

> 
> Seems complicated to me....
>