[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [idn] An ignorant question about TC<-> SC

To: liana Ye <liana.ydisg@juno.com>, klensin@jck.com
Subject: Re: [idn] An ignorant question about TC<-> SC
From: Martin Duerst <duerst@w3.org>
Date: Thu, 25 Oct 2001 16:27:10 +0900
Cc: idn@ops.ietf.org

At 13:02 01/10/23 -0700, liana Ye wrote:

>  From screen display point view, TC/SC are different glyph
>  sets(who defines the sets? How is it used by 1/5 of the world
>population? Is Uicode group the only authoritive one? In
>China there are over 600 recorded views on this).

The Han ideographs in Unicode/ISO 10646 are defined by the
IRG (Ideographic Raporteur Group). This group reports to
ISO/IEC SC2 WG2, the ISO WG responsible for ISO 10646.
It is composed of representatives from all the countries
or similar entities interested in Han ideographs. That
includes China, Japan, Korea (both South and North), Taiwan,
Hong Kong, Singapore, and the US (I hope I didn't forget
anybody, and please excuse the maybe politically uncorrect
shortcuts). The US is the only country represented without
a tradition of using Han ideographs, but usually only
sends a small delegation and mainly helps with wording.
Many other countries may send rather large delegations
(given the number of characters, which means a lot of
work, this is no surprise). The Unicode consortium
participates in the IRG with an observer status only.

The IRG has published guidelines for deciding when to
unify two occurences and when not. Because of the very
huge number of characters, there is in some cases indeed
a thin line as to whether something should be unified or
not. And in these cases, the IRG just has to make a decision.

Overall, the guidelines are somewhat difficult to understand
at first, but they are designed mainly with a 'least surprise
to the average user' in mind, and I think they have achieved
this goal very well. The guidelines are based on earlier
ones used for the Japanese standard.

The core of the guidelines says that if two characters look
significantly different, then they are encoded with two codes
even if they e.g. are one-to-one SC/TC equivalents. This is
to avoid suddenly changing the appearance of letters for a
user who may not be familiar with the significantly differently
looking shape. On the other hand, cases where there is only
a small difference in shape are unified (i.e. only one code)
unless this small difference in shape makes an actual
difference in meaning.

Overall, the results are so that if you present a text
where you change the glyph shapes within the range that
is unified, people who have done basic education but don't
know about different shapes (e.g. people in Taiwan or
Hong Kong who only know about TC, or people in China or
Singapore who only know about SC, or people in Japan who
only know about the forms used in Japan) will read over
these changes without problems, and might at some points
say 'this looks a bit strange', but will still identify
the character.

There are some exceptions to these rule related to backwards-
compatible roundtripping (source separation rule).

Hope this helps,      Martin.

Prev by Date: RE: [idn] NFC vs NFKC
Next by Date: Re: [idn] Update Charter revision 2
Prev by thread: RE: [idn] An ignorant question about TC<-> SC
Next by thread: Re: [idn] An ignorant question about TC<-> SC
Index(es):
- Date
- Thread