[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [idn] opting out of SC/TC equivalence

To: harald@alvestrand.no
Subject: Re: [idn] opting out of SC/TC equivalence
From: liana.ydisg@juno.com
Date: Wed, 29 Aug 2001 03:24:48 -0700
Cc: liana.ydisg@juno.com, James@Seng.cc, tsenglm@cc.ncu.edu.tw, huangk@alum.sinica.edu, idn@ops.ietf.org, dhc@dcrocker.net

 Hi, Harald

You are quite right regarding how the Chinese linguistics
works, there are never complete!  Right now, there are
formally classified characters already exceed 100,000.
I am not advocate include the whole set in [nameprep]
at all.  Neither do I advocate to exclude any characters, since
it is what the user wants.  But I do recommend to include
the characters included in Big5 and GB standard, that is
about 23,658 code points in Unicode, while 2238 are TC/SC 
equivalence and 14 radical equivalence we have been
discussing.   

There are two questions, 1) how to we implement TC/SC
in Unicode standard, 2) how do we implement CJK in 
[nameprep].  

For 1), I said 1100 ( I don't have the table at this time) 
half sized new Unicode points is to put radicals into the Unicode 
standard, just as other scripts did for diacritics.  This gives a
 base for decomposing a character for IDN identifiers 
(but not good enough for user input interface). 
 Within these radicals, 1886  TC/SC equivalence can be 
addressed.  The  352 TC/SC equivalence and 14 radical 
equivalence can be addressed in a supplement document, 
which shall state how to treat the other 1886  TC/SC 
equivalence based on their radicals.  The radical class has
been quite different from early history to recent era.  The class
of radical definition I am proposing is for a Han speaker and 
parting away from a computer programmer or a dictionary 
editor:

CJK Radical set has two sections: the first section is traditional
dictionary radical set, such as "Kangxi" and "Cihai" radical
set.  The second set is any characters which have been used
as radicals for other characters are radical set members.
The first  set is about 200, depending which dictionary we are
based on, the second set is about 1000.  Since the two sets 
overlap, my estimate is 1100 code points. 

2) I assume (since I did not check)  that  about 23,658 code 
points in Unicode 3.0 has included Hanja and Kanji.  
The other codepoints in Unicode and future new comers, 
can be treated on needed base.  This means, only when
someone has used in a name at registration time, and 
supplied the name with a codepoint in Unicode, then the
codepoint is added to zonefile.  (Not in [nameprep]?)
 If such a character is not in Unicode, then a bit map of the 
new character has to be provided in the zonefile.  This is 
the reason, I propose a "Request for Reference to be sent" 
protocol to be drafted.  

Liana

On Wed, 29 Aug 2001 10:20:08 +0200 Harald Tveit Alvestrand
<harald@alvestrand.no> writes:
> 
> 
> --On 28. august 2001 13:40 -0700 liana.ydisg@juno.com wrote:
> 
> > Hi, James and Chinese experts:
> >   You are right on the TC/SC equivalence not in Unicode.
> > I know they wanted to put it in long time ago, so I assumed
> > it is reflected in there some how.   I have just read a reason
> > that it is not in there, because they think it is too difficult to
> > put it in.   I happend to have an idea that 1100 half size
> > code point may solve part of the problem and another 200
> > TC/SC listing completes it.  This can be used in [nameprep].
> > What do you think?
> 
> I would be happy to see a complete proposal.
> 
> draft-ietf-idn-tsconv-00 describes a TC/SC mapping for 2064 
> traditional/
> simplified pairs, saying that other tables are needed for 
> single/many
> and many/single mappings.
> 
> This means that we have a documented proposal on what to do with 
> 4128 
> characters.
> In Unicoode 3.0, there are 23.658 *more* characters classified as 
> "Han"; 
> Unicode 3.1 adds 42.711 more, and it has been noted here that 
> because of 
> the way Chinese linguistics work, it is almost 100% certain that 
> there will 
> be more added.
> I assume (foolishly) that for some large class of these characters, 
> the 
> answer is "don't touch them" when mapping TC/SC - but I have no way 
> of 
> telling which characters belong in that class.
> 
> If you can come up with a proposal that describes what to do about 
> ALL the 
> Han characters in Unicode, I will be very happy to hear it.
> 
> Until then, I have to say that I have not seen any complete 
> proposal.
> Remember - the implementations of the algorithm for the non-Chinese 
> part of 
> the world will mainly be done by non-Chinese-speaking programmers; 
> it's got 
> to be simple & complete enough that even I can get it right...
> 
>              Harald
> 
> 
> 
> 
> 
> > .
> > Liana
> >
> > On Mon, 20 Aug 2001 13:53:38 +0800 "James Seng/Personal" 
> <James@Seng.cc>
> > writes:
> >> > The SC character set has been used for decades and has went
> >> > through extensive nationwide testing in China.  SC is stable 
> and
> >> they
> >> are
> >> > properly reflected in Unicode standard.   The question is a
> >> definition:
> >> > is TC/SC a case folding?  It seems that in this WG, there has 
> no
> >> > consensus on this definition yet.
> >>
> >> I am not sure what you mean by "properly reflected" in Unicode
> >> Standard.
> >> If you mean it is in ISO10646 codepoints, then yes, both TC/SC 
> are
> >> in
> >> the code points. But if you saying Unicode Consortium have proper
> >> definition of TC/SC, then I afraid to say there is none.
> >>
> >
> >
> 
>

Prev by Date: Re: [idn] opting out of SC/TC equivalence
Next by Date: Re: [idn] Re: Characters, scripts and words (was: Re: nameprep and others: hangeulchar)
Prev by thread: Re: [idn] opting out of SC/TC equivalence
Next by thread: Re: [idn] opting out of SC/TC equivalence
Index(es):
- Date
- Thread