[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [idn] An ignorant question about TC<-> SC
Hi ! All:
Someone give me the following information about UTC joint
meeting.
http://www.unicode.org/unicode/consortium/utc-minutes/UTC-088-200108.html
IETF - IDN Name Preparation
[88-A51] Action Item for Michel Suignard: Give comments to Mark Davis and
Cathy Wissink on IDN name prep for them to forward to the appropriate
people. [L2/01-318]
[88-C5] Consensus: The UTC would like to see a paragraph added to the IDN
Internet Draft, Preparation of Internationalized Host Names, regarding name
preparation, which would clarify that alternate names may need to be
registered to handle variant spellings for different languages and
simplified and traditional forms. [L2/01-318]
[88-A52] Action Item for Mark Davis, Cathy Wissink: Write and submit a
paragraph on alternate names to the IDN working group. Include Michel
Suignard's comments. [L2/01-318]
[88-A53] Action Item for Asmus Freytag: Suggest text for UAX#15 Unicode
Normalization Forms about IDN name preparation. [L2/01-318]
Properties
[88-C6] Consensus: The UTC does not guarantee that casing preserves
normalization form, but in the future such a requirement may be considered.
[L2/01-321, 327]
[88-A54] Action Item for Lisa Moore: Forward the UTC response on preserving
normalization form when casing to Kent Karlsson. [L2/01-321, 327]
==========
From the point of view of TC/SC conversion . No chinese expert
will try to solve this problem in one simple approach. The HAN scripts
mapping can be divide to 1-1, 1-n, m-n catagory. The 1-n , m-n are
sementic sensitive , they are language related and MUST be treated in high
level , so 88-C5 can be applied to these scripts by alternate name
registration.
But , in actual case , PRC-ONLY-SC that are directly derived
from TC used in China and some foreign country. Most of these PRC-ONLY-SC is
modified with the radical of quick-wriiten form and they are frequently
used characters in chinese text. So it has high probability to mixed them or
confusing in browsing. Some of these PRC-ONLY-SC are 1-n that can be
recognied and remove to 1-n catagory , and some of SC are the same as
Japan's modified Kanchi , if they are different in meaning , then it can be
move to 1-1 local mapping table that belong to other catagory to reduce the
side effect of different language. The final small set of 1-1 mapping
PRC-ONLY-SC/TC can be get from the original PRC announced documents. If
these table is combined with the AMC-ACE-Z encoder/decoder with case
anotation output to ACE string, We can get a module that can input TC/SC
unicoded data and encode it to ouput ACE string with pre-defined case ,
the reversing to deocde ACE string with case will recover to each original
TC/SC by this module . It may be forced the case to diplay a prefered TC/SC
form to meet the requirement of AP to display them.
The encoder/decoder will recover to the original TC/SC unicode
point , so UNICODE table will not be changed.
By the same ACE string with different casing , LDH-DNS server
compare the ACE label case-insensitively . So 1-1 mapping TC/PRC-ONLY-SC
scripts are the same symbol of name-identifier.
The 1-1 mapinng PRC-ONLY-SC/TC encoding can not solve all the
problems of TC/SC convertion . But it can reduce the complexity by level-1
DNS, so the 2^N problem can be reduced to a manageable condition.
Based on the table that PRC announced and keep more than 40
years. It can be keep work quickly without to spend much time to argue.
This approach is not try to do the best or complete perfect solution. But by
this way , solving TC/SC conversion in each level , many troubles can be
reduced to be more managable.
The numbers of TC/SC 1-1 mapping pair is a scale factor, 2000
pairs are more than 26 pair of ASCII with 2 order of magnitude , people can
not remember them easily.
L.M.Tseng
----- Original Message -----
From: "Mark Davis" <mark@macchiato.com>
To: <idn@ops.ietf.org>
Sent: Wednesday, October 31, 2001 12:09 AM
Subject: Re: [idn] An ignorant question about TC<-> SC
> I should try to clarify a few things.
>
> There have been some objections on this list to Han Unification. While
there
> are often typographic differences between fonts used for Han ideographs in
> different countries, what is encoded represents the fundamental character,
> as defined by the IRG (Ideographic Raporteur Group). This group reports to
> ISO/IEC SC2 WG2, the ISO WG responsible for ISO 10646. As stated
elsewhere,
> the IRG includes official representatives from China, Japan, the Republic
of
> Korea and DPR of Korea, Taiwan via the Taipei Computer Society (TCA), Hong
> Kong, Singapore, the US, and the Unicode Consortium.
> Han unification is mostly irrelevant to the TC <-> SC issue. The major
> stumbling block is not that the TCs are unified across different
countries,
> but that accurate TC<->SC mappings are (a) not 1-1, (b) contextual, (c)
> dictionary-based.
>
> If there were a recognized 1-1 subset mapping between SC characters and TC
> characters, I don't think unification would be an issue in matching. For
> example, it would matter little to Japanese users that an SC character
could
> be typed in instead of one of their TC characters, so long as two
different
> TC characters used in Japan were *not* matched. This could be done as long
> as the mapping never identifies two different TC characters with the same
SC
> character.
>
> It should be also made clear that the Unicode consortium is not at all
> opposed to the development of a TC <-> SC mapping. However, it is not a
> trivial process. The best body to take on that work would be the IRG --
> given the composition and expertise of its membership. The process of
> arriving at a uniform, well-developed, agreed-upon mapping would probably
> take some substantial time, and from that mapping one would have to
extract
> the 1-1 subset. As stated elsewhere, the consortium has not seen evidence
> that such a mapping is required in IDN (given other mechanisms), and does
> have considerable concern that it would slow down the IDN development
> process even further.
>
>
>