[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [idn] Mixed TC/SC (was Re: Layer 2 and "idn identities")



David Hopwood suggested:

> liana Ye wrote:
> > We are discussing how the registrars to "avoid registering mixed scripts
> > 'names' ". Can you suggest any way of doing this, or any feasible guidelines?
> 
> It's fairly trivial: define a formal syntax that registrars can check
> against for domains that should exclude mixed names.
> 
> The general form of that syntax would be:
> 
> ['\' means set difference, '/' means union, '&' means intersection]
> 
>   SCONLY = ... ; set of Han ideograph code points that are only used in SC
>   TCONLY = ... ; set of Han ideograph code points that are only used in TC


etc., etc.

The problem is that you have to stop right here. This is a misconception
about what "SC" and "TC" are.

Simplified Chinese is:

1. A set of orthographic reforms instituted by the PRC government,
involving, among other things, a promulgation of a number of
character simplifications (some of them "traditional" simplifications,
and others just radical de novo simplifications), and the codification
(and reduction in number) of characters to be used in education,
newspapers, and such.

2. A lineage of coded character sets derived from GB 2312:1980,
including EUC-GB and Code Page 936. It has undergone extension,
first in the so-called "GBK", and most recently, in the new
Chinese standard, GB 18030.

3. A localization option for computer systems, usually involving
one of the GB-based characters sets, input method editors
appropriate for use with simplified characters, fonts that
cover the GB repertoire(s), and dictionar(ies) appropriate for
the country of use.

draft-ietf-idn-tsconv-00.txt says:

"Officially, simplified Chinese is used in Mainland of China
(current standard: GB 18030); In Taiwan, Hong Kong and Macao,
the official written script is traditional Chinese (encoded as
BIG5)."

This is referring to item 1 above (although the terminology is
a little off -- SC and TC are two different *orthographies* for
the same language [written Chinese] using the same script [Han]).
It is also referring to item 2 above -- the different coded
character sets which have been used in the PRC and in Taiwan and
elsewhere for representation of the two different orthographies.

The key point here is the correct assertion that GB 18030 is
the current standard for [the coded character set for]
Simplified Chinese. But since GB 18030 contains *all* of the
Han characters from Unicode 3.0, the clear implication is that
the "set of Han ideograph code points that are only used in TC"
is the null set. There are no such code points, since they are
all incorporated into what is the current standard for Simplified
Chinese.

The opposite is not the case, since not all Unicode Han code points
are contained in BIG5 (yet ;-) ), so the SCONLY set, if conceived
in terms of BIG5, is nonempty. However, BIG5 is not the *only*
traditional Chinese character encoding. CNS 11643 also has had
fairly wide implementation. And many of the simplified characters
not present in BIG5 *are* present in CNS 11643. If you took the
CNS standard as your reference, then the SCONLY set would start
to shrink down to the vanishing point, as well.

The orthographic reform mentioned in item #1 above is a fact
of life, and computer systems have to deal with it -- including,
increasingly, computer systems outside of the PRC and Singapore,
proper.

On the other hand, the artificial distinction between SC and TC
*coded character sets* is gradually vanishing with the shift over to
Unicode (or GB 18030, if you prefer). And with a common coded
character set basis, localization options (choice of IME, fonts,
and such) can be options on the *same* computer system, using
the same coded character set, so that eventually becomes a non-issue, 
as well.

Whether the "solution" offered in draft-ietf-idn-tsconv-00.txt 
is feasible for resolving end user expectations about same and
different regarding Chinese domain names is a different
issue, but:

> In any case, a precise definition of what a mixed TC/SC name is, is certainly
> useful independently of where in the DNS namespace they are prohibited, and
> is well within the scope of this WG. It would make sense to delegate to JET
> and the authors of tsconv the task of specifying the sets of characters SCONLY
> and TCONLY, since they've already done work closely related to this.

is certainly not feasible -- and it makes no sense for this
working group to waste time on what is effectively a
recapitulation of character encoding history across the
PRC/non-PRC political divide.

--Ken