[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Chinese folding (Re: My prod at IDN requirements)



Let's try to explain it a bit more.

At 20:22 00/01/04 +0100, Harald Tveit Alvestrand wrote:
> At 00:37 05.01.00 +0800, James Seng wrote:
> >Harald Tveit Alvestrand wrote:
> > > I haven't, so you know more than me :-)
> > > for my edification, could you please state the names/numbers of the glyphs
> > > that should be folded in Japanese and not in Chinese?
> >
> >there are 2145 traditional chinese characters which can be mapped to
> >simplified, ignoring those 1:N mapping which is context sensitive in the CJK
> >unification area.

A rough count in a dictionary which I have at hand confirms the above
number. But it is not possible to give an exact number, because some
of the Chinese simplifications are systematic, i.e. if you find part
A (at a specific position, e.g. on the left) in a traditional character,
then you have to replace it by part B. So if new ideographs are added
to Unicode (which is happening, Unicode 3.0 already has about 7000 more
than Unicode 2.0), simplified equivalents have to be added.
(I'm not sure that this is actually happening; if the characters are
considered to be added for historical purposes only, then they are not
needed for modern use, and won't be added.)


> I still don't get it - is this mapping done as part of the unification that 
> was done when deciding which characters to put in ISO 10646, or are these 
> defined as "equivalences" under some normalization form, but still have 
> separate codepoints in the BMP, or do you mean something different?
> If the first one - are those characters among the ones proposed for 
> addition to Plane 2?

One part of the unification rules (there are others) says that unification
takes place if the visual difference is not too big. This sounds extremely
loose, but the details have been worked out and documented carefully.
The border line is a thin line, but without it, things would be much
more inconvenient. The main goal of this rule is to avoid that a character
completely changes shape when sent from one machine to another, because
that would completely confuse many people. The line has been drawn more
or less at the point where a person knowing one present-day usage
(i.e. only simplified) may not anymore easily see that the other form
(e.g. traditional) is actually the same character. So this rule has been
designed explicitly for general use; ideograph experts may judge it
as too strict (because they know which forms correspond to which, and
wouldn't be confused that much by changes) or too loose (because for
some research, they want to distinguish some details).

If you have an Unicode (2.0) book, then you can have a look at some examples:

Unified:
  - 'grass' radical: Starting at 8278, the traditional form has the
    horizontal line at the top broken in the middle, it is difficult
    to see that in the Unicode book, 830D is an example. (The reason
    why this is difficult to see is that the font in Unicode 2.0 was
    streamlined; in the 830D example, the broken line is probably due
    to the fact that this is actually not a grass radical, but maybe
    was a 'sheep' radical or something, and ended up here by a mistake
    of a lexicographer that was carried on in the dictionaries.)
  - 'bone' radical: Starting at 9AA8, again, the differences are
    not visible in the Unicode printing. If you can afford a copy
    of ISO 10646, then the differences between the simplified/
    traditional/Japanese/Korean columns are easy to see, but if
    you are not an expert, you won't know what of the differences
    is serious, and what is just a result of using different fonts
    (such as the Japanese column being a bit bolder than the others).

Not unified:
  - 'thread' radical: Traditional starts at 7CF8, simplified at
    7E9F. (this is an example very close to the borderline, I guess)
  - 'speak' radical: Traditional starts at 8A00, simplified at 8BA0.

These are examples where whole radical sections have been simplified
systematically; there are also many cases of individually simplified
characters.

More explanations on unification can be found on page 6-110 in the
Unicode 2.0 book.

So unification is related to simplification, but it depends on
the degree of simplification taken. For us, of course, only those
cases that have separate codepoints matter.


> Sorry to be so stupid - if you just name a couple of examples and what the 
> Unicode databases say about them, I may have a better chance of getting it 
> right.

The Unicode database doesn't contain simple-traditional mappings.
Such data is available from other sources.

What is important for us, more than all of the above, is that the
mappings are not one-to-one, neither in Chinese nor in Japanese.
In the requirements document, we have to make clear that we have
to deal with such non-one-to-one mappings.


While this is nothing that should go into the requirements document,
here is my take on how I think we most probably will have to address
this problem: Deal with it on the registration side, i.e. register
both a simplified and a traditional variant of a label if that is
desired. This wouldn't work for case folding, because you would have
to register 2**n variants for a label of length n. For simplified/
traditional, it can work easily, because the average length of
a chinese label will be around 2 characters, and because arbitrary
combinations probably don't have to be taken into account.
Dealing with it at registration time makes sure that this can be
handled by hand. Mapping from traditional to simplified is still
a field where a lot of context sensitivity and linguistic experience
is needed to get to an acceptable result. As such, it's not, in
my opinion, ready for prime time in servers.

If you have any more questions, please feel free to ask.


Regards,   Martin.


#-#-#  Martin J. Du"rst, World Wide Web Consortium
#-#-#  mailto:duerst@w3.org   http://www.w3.org