[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [idn] Zone rules (was: wg milestones update)



Case folding does have some edge cases, but is not comparable in complexity.
The use of ae, oe and ue to represent ä, ö, and ü is not a case folding
issue, any more than the use of aa to represent å. In both cases, they don't
represent modern usage, but instead are only used in modern writing in
situations where accents are not permitted, thus serving as fallbacks. These
are an issue for internationalized collation (see
http://www.unicode.net/unicode/reports/tr10/), but not for domain names. The
character ß expands to SS when capitalized (and there are some other cases
in Unicode of that happening), but that is not issue with TC->SC folding.

The issue with TC-SC folding is represented by cases where character X maps
to character Y, but X and Y have substantially different ranges of meanings.
There are certain edge where this happens with case folding: e.g., "China"
could mean either the country or the pottery, while "china" could only mean
the pottery. For case mappings, these are extemely rare and represent no
significant barrier to comprehension.

I am told that there are a substantial number of problem cases in TC->SC
mappings, where meaning is conflated, and that the mapping is not just 1-n
but m-n. More importantly, this issue has been batted about for a year or so
now, and yet to my knowledge nobody has made a specific proposed mapping
public. Such a mapping should have been presented long ago, so that it there
would be enough time for detailed review and assessment by experts in the
field.

Mark

----- Original Message -----
From: "John C Klensin" <klensin@jck.com>
To: "Maynard Kang" <maynard@pobox.org.sg>
Cc: <sun@cnnic.net.cn>; <idn@ops.ietf.org>
Sent: Monday, April 30, 2001 05:53
Subject: Re: [idn] Zone rules (was: wg milestones update)


> --On Monday, April 30, 2001 19:54 +0800 Maynard Kang
> <maynard@pobox.org.sg> wrote:
>
> >...
> > 2) Case-folding is a simple canonical process, and the folding
> > rules are the same, I believe (someone please correct me if I'm
> > wrong), for most scripts which are able to be represented in
> > ASCII (i.e. English, Swahili, Hawaiian,
> > Malay, etc).
>
> Well, nearly.
>
> As soon as one moves significantly out of the ASCII subset (not
> "English") used in traditional host names, little idiosyncracies
> show up.  Some scripts have characters that don't have
> representations in the "other" case (the German esszet is one
> example), others have mapping rules related to accent or
> diacritical characters that differ between countries, even within
> a script.
>
> One of the more significant problems with the Chinese conversion/
> translation is that, as I understand it, one character in
> traditional may map to a two or more character phrase in
> simplified and, in general, one cannot expect a 1:1 mapping on
> character count.  But, while it is less common, that problem
> occurs in Roman-based scripts as well due to historical
> representation issues.  For example, the character "u with
> umlaut" (one character) is often written as "ue" (two
> characters).  "ae" may be a two-character sequence, or a
> representational form for a single-character diphthong, or a
> representational form for "a with umlaut".  As far as I know, the
> second of these never matches the third, but the first can match
> either the second, or third, or neither depending on the
> underlying language (i.e., one can't tell from script alone).
>
> One may, of course, try to make these problems different by
> describing some of them as "case folding" and others as something
> else, but I'm not convinced  that is helpful.   So, while the
> scale may be different, I don't think the problems themselves are
> really very different.
>
>     john
>
>