[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
[idn] draft-ietf-idn-tsconv-01.txt
I've just finished a review of draft-ietf-idn-tsconv-01.txt,
and I would like to make a few comments on the conversion problem they
have tackled and the solution they have presented.
The most important starting point here is certainly to
acknowledge both the work of those striving to find a reasonable
approach to this problem and to acknowledge the problem they wish to
address: there are groups of users for whom the perceived equivalence
of specific Han ideographs is strong enough to cause confusion where
those ideographs are treated differently. The authors of this draft
(and many other peopl) believe that for Internet applications to
function correctly for those users, that it must be possible to
associate the ideographs in a way that allows simplified characters,
complex characters, or a mixture of the two to be treated as
equivalent when a user would see them as equivalent. The authors
further recommend a method for retaining (as much as possible) the
data required to display the characters in the form originally seen by
the user, so that a check of the mapping among characters does not
seem to involve a transformation. As goals for application design for
that user community, these seem appropriate.
It is not so clear, however, that these goals should or can be
met using the DNS infrastructure as described. Probably one of the
most important issues raised by the draft is in this note:
[Editor's note: As Chinese character's in common use by CJK
people, so such table may be modified after making consensus with
language experts of CJK area.]
As a non-native Chinese speaker with an even more limited
knowledge of the use of Han characters in kanji and hanja, I am not
qualified to serve as one of the language experts described in the
note. With even my limited knowledge, however, it is clear that the
overlap of characters creates an enormous problem for this approach.
While it might be possible to create a mapping system that fits the
Chinese user community to some reasonable degree, there are a large
number of characters for which that same mapping would not fit either
the Japanese or the Korean user community. The authors apparently
feel that this could be managed with exclusion lists. I believe that
a reasonable list of such exclusions would run into several thousand
and that some of the most common characters would fall into that list.
I think that the use of an exclusion list of that size is likely to
diminish the effectiveness of this approach to the point of
unusability. If the user community cannot know whether two characters
map to equivalence without knowledge of an extensive exclusion list,
they are considerably worse off than if they were dealing with just
complex and simplified characters sets.
As a trivial example, the character "guo" used in "zhongguo"
(China) is also used in some form by kanji and hanja. As I said, I am
not qualified to say which forms would be seen as equivalent by the
Japanese or Korean language communities. Given the kanji use, though,
it seems possible that it could fall into the excluded category. If it
did, one of the most basic characters for the Chinese community would
remain variable between complex and simplified. With exclusions of
that type possible, attempting the mappings described within the DNS
simply does not seem to me the best approach.
In previous meetings of the IDN working group, I put forward
the comment that the complex to simplified mapping was a one way
transformation, and that without context there was simply no way to
recover some of the mappings. The authors of this draft have done
their best to provide mappings and restore what can be restored.
Their efforts have, unfortunately, been thwarted by the character
unification of Chinese, Japanese, and Korean inside Unicode standards.
The Unicode folks clearly had engineering reasons to avoid replicating
the characters several times. A consequence of that engineering trade
off is that the user community which must be considered for this type
of mapping is global.
It may be possible to create a more limited context within
specific applications or even registries, based on the user
communities for those applications or registries. For the DNS as a
whole, however, I do not believe that this approach can survive the
information loss of the Unicode CJK unification.
best regards,
Ted Hardie