[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: traditional/simplified (Re: [idn] wg milestones update)



Harald Alvestrand responded to Sun Guonian:

> >==============================
> >I think the traditional/simplified Chinese equivalence SHOULD be defined
> >as a single rule anywhere so that we can make user use IDN, or chinese
> >part in IDN, without confusion.
> 
> If it was possible to write down this equivalence rule on a single sheet of 
> paper (without the codepoints), I would agree.
> Unfortunately not even the definitions of the different classes of 
> traditional/simplified Chinese equivalence seems to fit on a single sheet 
> of paper (the 1-n, n-1 and contextual mappings in particular).
> 
> A complex specification is prone to implementation errors, especially when 
> there are large markets (US, Western Europe) where a sloppy implementation 
> will not be challenged by real-life usage of the functionality.

With respect to the issue of whether it makes sense to have a bunch of
localized equivalence behavior in different zones, I have to agree that
what we need are globally determined equivalence rules. Anything else
invites chaotic mismatches.

With respect to the issue of whether traditional/simplified Chinese equivalence
should be included in the specification for idn matching, I quote an
exchange on traditional/simplified conversion that just occurred on
the unicode discussion list. (See below.)

The main point here is that a traditional/simplified Chinese converter
is a full-blown commercial application, which in this case comes with a set
of dictionaries with nearly 170,000 entries in them. This is not the
sort of thing that can be written down as a single-page algorithm
to be included in nameprep. As Harald suggests, it just invites
implementation errors to demand this level of complexity in matching.

Incidentally, for those who cannot directly envision the issues for
trying to match traditional and simplified Chinese domain names, a
roughly comparable problem would be trying to match "traditional"
British English spellings and lexical conventions with "simplified"
American English spellings and lexical conventions, so that, for
example:

    www.theatre.com  and  www.theater.com

would resolve to the same domain name, to avoid "confusions" among
users who might be using the "traditional" forms or the "simplified"
forms of the "same" name. But of course no such matching is attempted
now for English-based domain names, let alone all Latin-character-based
domain names -- and in fact "theatre.com" and "theater.com" are two
completely distinct existing domains that are, unfortunately or
fortunately (depending on which side of the divide you are on),
easy to confuse. A little digging could turn up dozens or hundreds
more such pairs of domains that have the "same" name already.

--Ken 

> To: "'Michal Gerling'" <michal.gerling@exlibris.co.il>
> Cc: "'John H. Jenkins'" <jenkins@apple.com>,
>         "Magda Danish (Unicode)"
> 	 <v-magdad@microsoft.com>,
>         unicode@unicode.org
> Subject: RE: FW: chinese conversion tables
> Date: Tue, 1 May 2001 16:01:30 -0400 
> Sender: unicode-bounce@unicode.org
> X-original-sender: Ted@basistech.com
> 
> Hi Michal,
> 
> Our company produces a product that addresses your problem, including all
> the issues mentioned by John Jenkins below. We call it our
> Chinese-to-Chinese Script Converter, or C2C for short.
> 
> In particular it does not only code-point conversion but also orthographic
> and lexemic conversions, based on a set of cross-idiom dictionaries and word
> identification in streams of Chinese text. It is fully Unicode based
> internally, although conversion to and from other character sets is also
> supported.
> 
> You can read more about it at
> http://www.basistech.com/products/Chinese-Converter.html, or contact me
> directly for more information.
> 
> ==============================
> Ted Peck
> Director of Product Management
> 
> Basis Technology Corp.
> One Kendall Square
> Cambridge, MA 02139
> 
> tel: 617-386-7158
> fax: 617-386-2021
> tpeck@basistech.com
> 
> -----Original Message-----
> From: John H. Jenkins [mailto:jenkins@apple.com]
> Sent: Tuesday, May 01, 2001 2:54 PM
> To: Magda Danish (Unicode); unicode@unicode.org
> Subject: Re: FW: chinese conversion tables
> 
> 
> At 11:21 AM -0700 5/1/01, Magda Danish (Unicode) wrote:
> >-----Original Message-----
> >From: Michal Gerling [mailto:michal.gerling@exlibris.co.il]
> >Sent: Tuesday, May 01, 2001 7:24 AM
> >To: 'info@unicode.org'
> >Subject: chinese conversion tables
> >
> >
> >I am working with UNICODE and the CJK market and need to know: Is there
> >any one table or formula for moving from simplified to traditional
> >characters and back in UNICODE? thank you very much for your help!
> >Michelle g.
> 
> Partial data to interconvert between simplified and traditional 
> characters is available through the Unihan database.  However, the 
> problem is not a simple one, as there are frequently multiple 
> traditional forms that correspond to a single simplified form. 
> Moreover, the vocabulary used in the PRC with simplified characters 
> differs on occasion from the vocabulary used in Taiwan and elsewhere 
> for traditional ones (e.g., the names of the chemical elements, until 
> recently the word for "computer").  It really isn't possible to 
> convert between simplified and traditional characters without doing a 
> lexical analysis.
> 
> -- 
> =====
> John H. Jenkins
> jenkins@apple.com
> jenkins@mac.com
> http://homepage.mac.com/jenkins/