[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [idn] SLC minutes

To: Edmon <edmon@neteka.com>, tedd <tedd@sperling.com>
Subject: Re: [idn] SLC minutes
From: John C Klensin <klensin@jck.com>
Date: Wed, 02 Jan 2002 20:10:52 -0500
Cc: idn@ops.ietf.org
In-reply-to: <005501c193ed$cc020060$0601a8c0@neteka.com>
References: <005501c193ed$cc020060$0601a8c0@neteka.com>

Edmon, Tedd,

Please don't interpret what I'm about to say as encouragement to
go down the "character equivalence" path.  I think it leads to
madness, especially when used with the exact-matching (i.e.,
"get it exactly right or fail completely) procedures of the DNS.
But, just as an exercise, I would suggest that one might start
with three principles:

(i) From observation, when scripts have two cases, the
upper-case form is more likely to be highly stylized, and hence
differentiated from characters in other scripts, than the
lower-case one.  Hence, if one is going to adopt
stylization-based (glyph-distinction, if you prefer)
canonicalization rules, one is better off treating upper case as
the normal form, rather than lower case.

(ii) Many alphabets, especially western European
"Roman-character-based" ones, utilize diacritical marks to
distinguish sounds of accents associated with a single base
letter.  However, in many languages and dictionaries, a
character without a diacritical mark (or with one diacritical
and not another) is treated as different from that character
without the marking.    As a typographical convention that has
been established into language practice in some places, those
diacritical marks are often dropped when a lower-case character
carrying them is converted to upper case (in other languages and
scripts, or with other characters, they are not dropped).  But,
since they are sometimes dropped, lower-case characters (which
always carry the diacriticals when they are appropriate) provide
better differentiation than upper-case characters.  Hence, if
one is going to adopt differentiation-based canonicalization
rules with recognition of scripts that use diacriticals, one is
better off treating lower case as the normal form, rather than
upper case. 

(iii) There is a theory among scholars of writing systems that
_all_ modern alphabetic writing systems are derived from a
single Old Semitic writing system (which didn't have case
distinctions).  One fairly extreme way to accomplish  the sort
of "equivalence" you are looking for is to drop all attempts to
visually distinguish or match these characters (a process that
depends heavily on the fonts chosen for each, which contradicts
a Unicode design principle), map every character onto its
equivalent in that ancient script, and drop all other
differentiation.  Of course, doing this would eliminate much of
the goal of IDN work, as it would equivalence not only Roman-
and Greek-based scripts but those with, e.g.,  the scripts of
the middle east, Africa, and the Indian subcontinent.  Just
depends on how far one wants to go.

Note that the three possibilities above are contradictory and
fairly close to being mutually exclusive, although a strong case
can be made for each if one starts to do "character
equivalences".

     john

--On Wednesday, 02 January, 2002 19:30 -0500 Edmon
<edmon@neteka.com> wrote:

> Hi Tedd,
> 
> ----- Original Message -----
> From: "tedd" <tedd@sperling.com>
>> Now, the question is specifically: "Why is upper case <OMEGA>
>> mapped to lower case <omega>?"
>> 
> 
> It doesnt have to map to lower case.  It could map to upper
> case.  In fact it could be mapped to some other code point too
> if that also needs to be mapped.  We should however come up
> with a general strategy for Character Equivalence mapping, for
> example, if there are 4 codepoints that could be considered
> "equivalent" in perception then it will be mapped to the lowest
> codepoint regardless whether it is uppercase or lowercase...
> 
>> 
>> For example, what's the difference between "w.com" and
>> "w.com"?
> 
> I would argue that there is no difference perceptually.  That
> is the precise reason why there needs to be character
> equivalence mapping. Which means that all four codepoints: <W>
> <w> <OMEGA> <omega> should be preped so that if a person
> registered:
> 
> <W><w><OMEGA><omega>.example
> 
> no matter how he represented it, (e.g. wwww.example) people
> will still be able to get to the unique domain.
> 
>> 
>> If one was trying to solve a problem here, then I claim they
>> didn't think it out. Now, point out where I'm wrong.
>> 
> 
> You are not wrong.  We are talking about the same thing I
> think...

References:
- Re: [idn] SLC minutes
  - From: "Edmon" <edmon@neteka.com>

Prev by Date: Re: [idn] SLC minutes
Next by Date: Character equivalence mapping (was: Re: [idn] SLC minutes)
Previous by thread: Re: [idn] SLC minutes
Next by thread: Re: [idn] a stupid test
Index(es):
- Date
- Thread