[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [idn] Mixed TC/SC (was Re: Layer 2 and "idn identities")



-----BEGIN PGP SIGNED MESSAGE-----

Kenneth Whistler wrote:
> David Hopwood suggested:
> > liana Ye wrote:
> > > We are discussing how the registrars to "avoid registering mixed scripts
> > > 'names' ". Can you suggest any way of doing this, or any feasible
> > > guidelines?
> >
> > It's fairly trivial: define a formal syntax that registrars can check
> > against for domains that should exclude mixed names.
> >
> > The general form of that syntax would be:
> >
> > ['\' means set difference, '/' means union, '&' means intersection]
> >
> >   SCONLY = ... ; set of Han ideograph code points that are only used in SC
> >   TCONLY = ... ; set of Han ideograph code points that are only used in TC
> 
> etc., etc.
> 
> The problem is that you have to stop right here. This is a misconception
> about what "SC" and "TC" are.

I'm well aware that SC and TC are used to refer to several different things,
but it honestly didn't occur to me that anyone familiar with Unicode would
think their use to refer to charsets was anything other than an unfortunate
historical accident. I mean the SC and TC orthographies.

The conflation of these orthographies with specific charsets is not
helpful or useful, since it doesn't apply to Unicode-based systems. In fact
the "SC/TC" problem we're discussing only arises because Unicode, GBK, or
GB 18030-based input methods can produce a superset of characters used
in both the SC and TC orthographies. If we only needed to consider input
methods that support the *original* Big5 or GB 2312-80 standards, then it
would be impossible for a user to type a mixed SC/TC name.

(Of course in that case there would be a different problem: TC names could
not be entered by users of GB 2312-80-only input methods, and SC names
could not be entered by users of Big5-only input methods.)

> Simplified Chinese is:
> 
> 1. A set of orthographic reforms instituted by the PRC government,
> involving, among other things, a promulgation of a number of
> character simplifications (some of them "traditional" simplifications,
> and others just radical de novo simplifications), and the codification
> (and reduction in number) of characters to be used in education,
> newspapers, and such.
> 
> 2. A lineage of coded character sets derived from GB 2312:1980,
> including EUC-GB and Code Page 936. It has undergone extension,
> first in the so-called "GBK", and most recently, in the new
> Chinese standard, GB 18030.
> 
> 3. A localization option for computer systems, usually involving
> one of the GB-based characters sets, input method editors
> appropriate for use with simplified characters, fonts that
> cover the GB repertoire(s), and dictionar(ies) appropriate for
> the country of use.

2 is not relevant at all. In 3, the fact that a GB-based character set
is often used isn't relevant.

> draft-ietf-idn-tsconv-00.txt says:
> 
> "Officially, simplified Chinese is used in Mainland of China
> (current standard: GB 18030); In Taiwan, Hong Kong and Macao,
> the official written script is traditional Chinese (encoded as
> BIG5)."

The parenthesized comments are obviously just plain wrong; simplified
Chinese need not be encoded as GB 18030, and traditional Chinese need
not be encoded as Big5. This is as silly as it would be to say
"The official language of the U.K. is English (encoded as ISO-Latin-1)."

> This is referring to item 1 above (although the terminology is
> a little off -- SC and TC are two different *orthographies* for
> the same language [written Chinese] using the same script [Han]).

I agree; SC and TC are orthographies. Like other orthographies, they
use a particular repertoire/set of characters (which we can assume
for the purposes of IDN to be a subset of characters assigned in
Unicode 3.1). SCONLY is the set that is specific to the SC orthography,
and TCONLY is the set that is specific to the TC orthography.

(Off-topic, but as far as I understand, written Chinese is not quite a
single language. It can be used differently to write different spoken
languages, e.g. colloquial Cantonese can be written in Han using
characters that would not make sense if interpreted as Mandarin.
Most text that uses the Han script can be viewed as being in a single
written language, though.)

> It is also referring to item 2 above -- the different coded
> character sets which have been used in the PRC and in Taiwan and
> elsewhere for representation of the two different orthographies.
>
> The key point here is the correct assertion that GB 18030 is
> the current standard for [the coded character set for]
> Simplified Chinese. But since GB 18030 contains *all* of the
> Han characters from Unicode 3.0, the clear implication is that
> the "set of Han ideograph code points that are only used in TC"
> is the null set.

Since the premise that simplified Chinese means GB 18030 is wrong,
so is the conclusion.

[snip other stuff based on that premise]

> Whether the "solution" offered in draft-ietf-idn-tsconv-00.txt
> is feasible for resolving end user expectations about same and
> different regarding Chinese domain names is a different issue,

I don't think it is; that's one reason why I don't support it.

> but:
> 
> > In any case, a precise definition of what a mixed TC/SC name is, is
> > certainly useful independently of where in the DNS namespace they are
> > prohibited, and is well within the scope of this WG. It would make
> > sense to delegate to JET and the authors of tsconv the task of
> > specifying the sets of characters SCONLY and TCONLY, since they've
> > already done work closely related to this.
> 
> is certainly not feasible

Of course it's feasible. Reference [1] from the tsconv draft defines what
SCONLY is. TCONLY is not much more complicated.

[1] A Complete Set of Simplified Chinese Characters, published in 1986
    by the Committee of National Language and Chinese Character of China.

> -- and it makes no sense for this
> working group to waste time on what is effectively a
> recapitulation of character encoding history across the
> PRC/non-PRC political divide.

I have no intention in miring this WG in Chinese language politics; that's
why I suggested delegating the definitions of SCONLY and TCONLY to JET.

The approach I suggested is just a more precisely specified version of the
one that was recommended by the UTC: disallow mixed SC/TC names. That is
the simplest way to answer the objection made by some SC/TC folding
proponents, that O(2^n) names need to be registered to cover all valid
variants of a name - which would otherwise be a valid objection. In fact,
it solves that problem more comprehensively than would SC/TC folding,
because characters that would require 1-many and many-1 mappings can be
included in the SCONLY and TCONLY sets.

- -- 
David Hopwood <david.hopwood@zetnet.co.uk>

Home page & PGP public key: http://www.users.zetnet.co.uk/hopwood/
RSA 2048-bit; fingerprint 71 8E A6 23 0E D3 4C E5  0F 69 8C D4 FA 66 15 01
Nothing in this message is intended to be legally binding. If I revoke a
public key but refuse to specify why, it is because the private key has been
seized under the Regulation of Investigatory Powers Act; see www.fipr.org/rip


-----BEGIN PGP SIGNATURE-----
Version: 2.6.3i
Charset: noconv

iQEVAwUBPBWezzkCAxeYt5gVAQEj/AgAw9vX9USdXWVPcbm/EcV/Prq1CM73qBGZ
L6DpJ9SgTIOkw/ga3RiVmj2ojCd2i0hgH95mHPcl2TUv7zbF3yIjIOzYr7wHJ0jM
M+qC28naj4UQ/L3oJWGpCqyT8IJCwhKfFasKRZ0TQe7iO8q3u17yVP9LZH1EB77P
BAR2H1jXpZjet+KoQbfcIFG6YLtHtZxs40uRpsbYPjLlO5h80pu0eMFNZOC9ZWqR
qun3EIZpSU15QY+rXvQdcc200gV/sEDYjEwjGf4IgShlkzW6MRzwV8OAEmdfNi+q
KmJ0W7HD+CUV2Rw4sWZCRjz+3SYj+0CYVtyKxE+6x2S454xKEQKQWQ==
=EpN3
-----END PGP SIGNATURE-----