[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [idn] Zone rules (was: wg milestones update)



--On Monday, April 30, 2001 19:54 +0800 Maynard Kang
<maynard@pobox.org.sg> wrote:

>...
> 2) Case-folding is a simple canonical process, and the folding
> rules are the same, I believe (someone please correct me if I'm
> wrong), for most scripts which are able to be represented in
> ASCII (i.e. English, Swahili, Hawaiian,
> Malay, etc).

Well, nearly.

As soon as one moves significantly out of the ASCII subset (not
"English") used in traditional host names, little idiosyncracies
show up.  Some scripts have characters that don't have
representations in the "other" case (the German esszet is one
example), others have mapping rules related to accent or
diacritical characters that differ between countries, even within
a script.

One of the more significant problems with the Chinese conversion/
translation is that, as I understand it, one character in
traditional may map to a two or more character phrase in
simplified and, in general, one cannot expect a 1:1 mapping on
character count.  But, while it is less common, that problem
occurs in Roman-based scripts as well due to historical
representation issues.  For example, the character "u with
umlaut" (one character) is often written as "ue" (two
characters).  "ae" may be a two-character sequence, or a
representational form for a single-character diphthong, or a
representational form for "a with umlaut".  As far as I know, the
second of these never matches the third, but the first can match
either the second, or third, or neither depending on the
underlying language (i.e., one can't tell from script alone).

One may, of course, try to make these problems different by
describing some of them as "case folding" and others as something
else, but I'm not convinced  that is helpful.   So, while the
scale may be different, I don't think the problems themselves are
really very different.

    john