[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [idn] opting out of SC/TC equivalence



Hi, James,

  I am sorry about having offended you, but I just
have to let my feeling out, since I have been 
in U.S. long enough and feel good to let the steam out
then keep that in as I have been brought up the way
I was.  In fact, from my personal opinion you have
been a very good manager of this group.  Thanks 
to Lee who has pointed out the new definition
of this group is an information collecting agency, not
a ML.com deployment group.

   I have been trying to keep discussion more 
objective, but when I hit a limit on that, my mind just
wandering off.  So, for your questions below I think
some answers requiring my renewed research, I would
put a disclaimer on it, as you have proved my TC/SC
in Unicode was wrong, I guess due to "no-solution"
being made or on "no-one-buy it" as marketing machine
over powered many of the foundamental  questions,
when we talk about Universal access, these questions
have to be revisited. 

On Sat, 1 Sep 2001 16:53:41 +0800 "James Seng/Personal" <James@Seng.cc>
writes:
> > There are existing converting table already.
> 
> If there is an existing converting tables, please provide reference.
> This table must be publicly available and ideally without IP on it.

At the time arround 1989, I have used such a software made
in Hong Kong for my job. 

> 
> > What I am
> > proposing is for GB directly exchange with Unicode in
> > [nameprep] such that there will be one step search
> > for other users who don't want unicode set as well.
> 
> What you suggesting is that you wish to do Nameprep with charset in 
> GB.
> This is to provide the unneccessary step to take
> GB->Unicode->Nameprep->ACE then ACE->Unicode->GB.

No, the primary keys in [nameprep] is Unicode and ACE,
and ACE is grouped by Tags. Within each tag, ACE is unique.
GB is just an entry to a Unicode row.  When GB is out of
date, that entry can be closed, and there shall have no effect
on the functions of [nameprep].  The input is Unicode(or GB)
the output is ACE, then add tag in [IDNA].  On decoding side,
input is ACE of a tagged area, out with Unicode (or GB).  GB
only acts as input/output for end user display, has no role in
comparing. 

Can we out date it or not, from some of the messages, it may
not be that easy, since they are not frequently used even in
libraries.  I am foreseen that a lot more work is a head of us.

TC/SC to Unicode are two columns, consistent with current
[nameprep] case folding format. 

> 
> While it seem to make sense on the surface, I think you have
> misunderstood the purpose of Nameprep. Nameprep is for matching 
> purpose.
> The matching result remains in ACE. If I give you two GB string, you 
> do
> a GB->Unicode->Nameprep->ACE then compare the two ACE. You dont 
> reverse
> the ACE back to GB.

No, I don't.  But I do it when the input is ACE at the end of travel on
wire.

> 
> And also, your proposal to do GB->Nameprep->ACE(or whatever) for
> comparison would break the matching. GB->Nameprep->ACE may produce 
> the
> same ACE as SJIS->Nameprep->ACE where GB and SJIS may not be the the
> same in the first place.
> 
No.
1) There are different ACE by transliteration.
2) also by their tags, which determins a column of ACE.


> > You are right, most of us are care about this problem and
> > contributing to this group for free.  We are discussing this
> > sincerely.  However, I have clear feeling, something
> > otherwise.  I wish, I am wrong, that we can see real
> > solution out of this group.
> 
> As I said to another member of the wg, it is okay to form any 
> conspiracy
> theory in your own private mind but to say it in public would 
> requires
> you to provide substained evidence. Give specific instance or stop 
> been
> disruptive.
> 

This is the message I was refering to:

There is a rumor that some people actually want to use the results of
this 
working group, and that they want to use the results AS SOON AS POSSIBLE.

Having discussions that are outside the scope of the working group hurts 
that goal.


> > David, that is the misconception I have referred to in blaming
> > what Unicode has been done on Chinese language.  TC/SC
> > is the same script and same language.  It is used in a similar
> > way with upper/lower case of Latin.  Just like some people
> > want to use uppercase / or printing all the time, but most use
> > mixed cases.  TC/SC is larger set, so it is natural to have
> > more variety of changes.  But the majority is treated like
> > Latin cases.  They are not mixed scripts. Japanese is a
> > mixed scripts.  Korean is a mixed script depending on who's
> > viewpoint you are subscribing.
> 
> I share this misconception as you have above many years back, and 
> blame
> Unicode/ISO10646 for their poor handling of Chinese scripts. (See
> archives of unicore@unicode.org if you wish to see what stupid 
> statement
> I made back them).
> 
> A couple of expert Chinese linguistics quickly brought me to the 
> light
> to the key problems in TC-SC. As one explain to me, the best way to 
> deal
> with TC-SC is to treat them as two separate language all together,
> except they have some scripts in common and the grammers are 
> similar. I
> know it is weird but once you able to accept this concept, you will 
> find
> all the difficulties we have in TC-SC are closer to language 
> translation
> and less on codepoint normalization.
> 
> I learn my place now after that blunder. Chinese language issues are
> best deal by Chinese Linguists, not by Computer Scientists.

I can claim I am a linguist and a good one without a
certificate.  Both my parents are University teachers in 
langues since 50's.  I will speak to my own view, as all
 the linguists always argue with each others.  

> 
> > TC/SC is in dictionaries for kids in China.
> 
> This is interesting. Please provide reference to this dictionaries 
> for
> kids which have TC-SC. I like to see if I can get a copy of it.
> 

When I was in fourth grade, we are taught to use a dictionary.
There are three books are common are brought in by the 
students: New Four Corner Code Dictionary, "Xin Sijiao Haoma
 Zidian",    "Xinhua Zidian" and "Xuesheng Zidian".  The "Xinhua 
Zidian" has the most in number, but slowest in finding a word.  
I had a "Sijiao Haoma" at that time and it has TC/SC table in 
the front pages, and then index table.

If you have a "Xiandai Hanyu Cidian", which is the most widely
used today and authoritive midle size dictionary, similar 
size with "Xin Sijiao Haoma Zidian", you can see both TC/SC 
are in the index, where TC are in parenthese.  While in
the entry, the TC are always refered to SC entry.  

> > Some people have experiences that
> > Chinese translation always takes two versions, then
> > TC/SC must be two different languages, that is wrong too.
> 
> Have it occur to you that they *may* be actually right? And if they 
> are
> wrong, explain to me why they are wrong?
> 
> Would you consider Chinese & Japanese same? Probably not but they 
> are
> close enough for some Chinese to read a bit of Japanese and vice 
> versa.
> Would you consider TC & SC same language? Probably yes, since it is 
> very
> much similar (100x more similar than Japanese) but it still have 
> enough
> different words to confuse a casual reader.
> 
> -James Seng
> 

I am not going to be argumentive about your 
Kanji vs. Han.  But I want to tell you my personal
experiences I had in the U.S.  

When I arrived in U.S. after one year of English 
study in China, I went to DMV for my driver license
test. I was very happy to pick up a Chinese version
of driver's manual and go home to study it.  It was
in TC, and very hard for me to understand, that 
I have to go back and got an English version.  At the 
time of exam, I have to ask for an English test 
for the same reason.   In this case, TC
is foreign to me, not due to the font but the
translation.

TC in China is very common in classic novels. 
Any one who is eduacted can read in TC, but 
not write in TC, since it is hard to remember the 
strokes.  SC is an literacy effort in general eduation.
When people is literated, there are no difficulty 
in reading TC.  I only hear people from TC world
complain about not recognizing SC, I have 
never heard people from China, even a poorly 
educated one to complain about can not read TC.

As an interpreter and translater myself, I have 
encountered client requirement on TC vs. SC.
I can tell you that the people likes my translation
better from TC world then SC world.  The reason
is TC people think I did better job then others in
both technical, medical and many other fields. 
SC world people thinks I am a normal speaker, no
big deal.  Most difficult interpretation jobs are from 
SC world.  Not due to English to Chinese, but Chinese 
to English.  Because those people have very high
level in English and thinks I was not using the 
word they consider as accurate.  My point is
the misconception is due to mostly the quanlity of 
understanding of the concept, or the original intent
of the author, not due to its particular words they 
have used.

The disjointed Big5 and GB computer software
definitly play a bad role in promoting this situation.
Because, I have to type twice to get the same 
translation out, it is certain I charge them twice.
Don't you?  :-)

Liana