[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: traditional/simplified (Re: [idn] wg milestones update)



James,

> Saying TC-SC is like "color" and "colour" is a naive view of the whole
> TC-SC problem.

And I think this is a misleading characterization of my use of analogy
to try to illustrate the level of complexity of the problem;

> 
> > > Incidentally, for those who cannot directly envision the issues for
> > > trying to match traditional and simplified Chinese domain names, a
> > > roughly comparable problem would be trying to match "traditional"
> > > British English spellings and lexical conventions with "simplified"
> > > American English spellings and lexical conventions, 

The analogy is on the requirement for lexical analysis and dictionary
lookup to solve the problem of matching alternate forms of representation
of the "same" words in the "same" language.

And the second level of the analogy is to point out that the kind
of variant form problem that TC-SC folding is aiming to avoid is
already tolerated among ASCII-based domain names.

I'm not claiming that TC-SC can just be characterized as a spelling
issue.

Incidentally, for those claiming that simplified/traditional folding
should be done for Chinese for idn matching, it seems naive to assume
that the problem of Han character matching can stop there.

What about the problem of Z-variant folding for all the Han
characters in Unicode? Z-variants included in Unicode (because of
the source separation rule, for example) are *less* distinct, by
any reasonable criterion, than a simplified/traditional pair.

So we have the traditional: U+570B 'country'
        and the simplified: U+56FD 'country'

But what about the Chinese Z-variant: U+56EF guo2 'country'

Or U+797F and U+7984?

And many, many other examples.

As far as I know, no one has a reliable and complete Z-variant folding
table for all of the unified Han in Unicode, especially now with the
recent addition of Vertical Extension B, with another 42,711 ideographs.

And besides the large, but reasonably easy-to-identify set of
simplified ideographs introduced by the PRC, what about:

Japanese- and/or Korean-specific simplifications?

Pre-revolutionary simplifications within the traditional set of
ideographs?

Traditional synonymic alternates that have not been unified
because their gross glyphic structure is distinct?

Kaishu forms of guwen variants that were included in Vertical
Extension B for complete coverage of the traditional Han dictionaries?
(For example, see U+20006, U+20007, and U+20009, which are apparently 
all guwen variants of the common ideograph U+4E2D 'middle'.)

Where exactly do you draw a principled line that will be understood
and which can be reliably implemented everywhere to do idn
name matching?

--Ken