[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Character equivalence mapping (was: Re: [idn] SLC minutes)



The reasons lowercase was uniformally chosen for case folding of Latin,
Greek, Cyrillic and Armenian in Unicode are (a) the vast majority of text in
use is in lowercase, and (b) there are a number of lowercase characters that
don't have uppercase forms (but may in the future). Moreover, the uppercase
forms are in general more visually confusable than the lowercase are. Take a
look at:

http://www.unicode.org/charts/PDF/U0370.pdf
http://www.unicode.org/charts/PDF/U0400.pdf
http://www.unicode.org/charts/PDF/U0530.pdf

BTW, there are also some pretty odd looking Latin characters, see:

http://www.unicode.org/charts/PDF/U0180.pdf

Not to mention the IPA characters that are sometimes pressed into service:

http://www.unicode.org/charts/PDF/U0250.pdf


While omega uppercase (Ω) is more distinctive than lowercase (ω), the latter
is not particularly confusable with 'w'; no more than upsilon (υ) is with
'u', or for that matter, uppercase upsilon (Υ) with Y.

(If you have an old emailer that doesn't handle UTF-8 you may not see the
Greek characters in parantheses.)

Mark
—————

Πόλλ’ ἠπίστατο ἔργα, κακῶς δ’ ἠπίστατο πάντα — Ὁμήρου Μαργίτῃ
[For transliteration, see http://oss.software.ibm.com/cgi-bin/icu/tr]

http://www.macchiato.com

----- Original Message -----
From: "tedd" <tedd@sperling.com>
To: <idn@ops.ietf.org>
Sent: Friday, January 04, 2002 07:30
Subject: Re: Character equivalence mapping (was: Re: [idn] SLC minutes)


Mark:

Alright, if the Greek script is forced into case-less matching, then
why does it have to be mapped from Upper to Lower Case? I can see
that mapping the upper case Alpha to the lower case alpha makes sense
in terms of not confusing the Latin "A" vs Greek "Alpha" issue. But,
forcing Omega to be mapped to omega only compounds the Latin "w" vs
Greek "w" issue. Are there any considerations for these types of
mapping issues, or is it summarily determined UC ->UC in all matters
regardless?

tedd


>Other scripts do have upper/lowercase correspondences, just like the Latin
>script does. Users of those scripts are just as likely to want caseless
>matching as users of the Latin script (such as you).
>
>For more information, see http://www.unicode.org/unicode/reports/tr21/.
>
>Mark
>-----
>
>P?ll' ?¼?stato ?rga, kak?V d' ?¼?stato ¼?nta - ?m?rou Marg?t?
>[For transliteration, see http://oss.software.ibm.com/cgi-bin/icu/tr]
>
>http://www.macchiato.com
>
>----- Original Message -----
>From: "tedd" <tedd@sperling.com>
>To: <idn@ops.ietf.org>
>Sent: Thursday, January 03, 2002 09:35
>Subject: Re: Character equivalence mapping (was: Re: [idn] SLC minutes)
>
>
>Mark, john, Edmon:
>
>>1. This issue was debated at length some time ago. I suggest that the
>people
>>arguing for visual confusability as a criterion for matching look at that
>>discussion in detail before proceding.
>
>I'm not arguing (in this debate) the "look-a-like" position. In other
>words, it makes no difference to me if certain glyph's look identical
>in numerous char sets. I am arguing the opposite position -- the
>characters in my example don't look a like.
>
>I am arguing the point that the decision "has been made" to map upper
>case Greek letters to lower case letters. For proof, look at the
>current version of nameprep ( http://www.imc.org/nameprep/  ) and try
>running code point 2126 (upper case omega) through it. You will find
>that it IS mapped to code point 03A9 (lower case omega).
>
>My question is "Why?" What's the foundation for this determination?
>For what good reason is there to conclude that the upper case Omega
>should be mapped to a lower case omega?
>
>I see no "A.com" to  "a.com" argument/problem here. Clearly, if
>someone registered ?.com and someone else registered w.com there is
>significant difference in identification between the two names. Those
>two domain names can be completely unique domain names with no
>significant resultant problems. Whereas, in the Latin char set, I can
>see the reason for making W.com and w.com identical (i.e., mapping W
>to w) because there is an UC/LC consideration/distinction in the
>language. But, that's not a problem in the Greek char set -- is it...
>really?
>
>>(i) From observation, when scripts have two cases, the
>>upper-case form is more likely to be highly stylized, and hence
>>differentiated from characters in other scripts, than the
>>lower-case one.  Hence, if one is going to adopt
>>stylization-based (glyph-distinction, if you prefer)
>>canonicalization rules, one is better off treating upper case as
>>the normal form, rather than lower case.
>
>It looks to me as if someone has already made the determination to
>map other languages based upon the Latin char set UC/LC problem
>without concern that other languages may not have the UC/LC
>distinction and thus be absent of the UC/LC problem. I think the
>Greek example I gave above sufficiently demonstrates my observation.
>
>tedd
>
>--
>http://sperling.com


--
http://sperling.com