[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [idn] An ignorant question about TC<-> SC

To: ben@cc-www.com
Subject: Re: [idn] An ignorant question about TC<-> SC
From: Kenneth Whistler <kenw@sybase.com>
Date: Fri, 26 Oct 2001 12:49:44 -0700 (PDT)
Cc: idn@ops.ietf.org

Ben said:

>  However, what I am certain
> is that you have illustrated the fact that
> applications/clients/users/servers/etc can be made to take advantage
> of this explicit labeling of what "script" an IDN is in and over time
> (with people writing appropriate applications) can be developed into a
> very powerful and useful system.  (Unlike TLD such as ".gov", ".ca"
> which serves next to no purpose from an IDN's perspective.)

I have to disagree. I am certain that labelling what script an
IDN is in will just cause problems.

At the very least, this will introduce an entire new class of
error conditions, where the label says one thing, but the
character content of the IDN does not in fact match the label.

Furthermore, the example we have been talking about here,
traditional versus simplified Chinese, is not even a script
difference in the first place. "Traditional" versus "Simplified"
in a character set context, and as typically implemented,
refers to distinctions between Code Page 950 (Big 5) and
Code Page 936 (GBK, etc.), together with the fonts, input methods,
message resource files, and such, as needed
to support them. And either of those character sets is actually
mixed script, since they both support Latin characters from
ASCII, as well as the basic Greek alphabet and Bopomofo.
"Simplified Chinese" also supports the basic Cyrillic alphabet
and Hiragana and Katakana for Japanese.

Even if you are just talking about Traditional versus
Simplified Chinese characters (ideographs) within the
Han script subparts of Code Page 950 or Code Page 936, the
distinction is not as clean as you might think it would be.
The PRC simplified set, even in its earlier forms in GB 2312,
contain *some* traditional forms for characters. But the
current extensions, first for GBK (~ Microsoft Code Page 936),
and now for GB 18030, incorporate *all* of the Han characters
from the Unicode 3.0 repertoire, which means that a
"Simplified" code page for China now contains *all* of the
traditional characters from Code Page 950, as well as all
the simplified characters from Unicode 3.0.

And of course, Unicode data itself encompasses both simplified
and traditional forms of Chinese ideographs. So what would the
IDN distinction between simplified and traditional mean if
data was encoded in Unicode?

Even the identification of scripts is non-trivial. Many
characters are *shared* between scripts, or are borrowed
from one script to the next. Cyrillic and Latin have a long
history of cross-borrowing forms from one script into the
other, for example, for special uses. And Japanese got all
its Chinese characters (kanji) in the first place by
borrowing them from Chinese.

See the Unicode Technical Report #24 Script Names, for more
discussion of this:

http://www.unicode.org/unicode/reports/tr24/

Note, in particular, that "Traditional (Chinese)" and
"Simplified (Chinese)" are nowhere mentioned in that report --
those are simply not script distinctions.

--Ken

Prev by Date: Re: [idn] An ignorant question about TC<-> SC
Next by Date: RE: [idn] An ignorant question about TC<-> SC
Prev by thread: Re: [idn] An ignorant question about TC<-> SC
Next by thread: RE: [idn] An ignorant question about TC<-> SC
Index(es):
- Date
- Thread