[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Character equivalence mapping (was: Re: [idn] SLC minutes)



Edmon suggested:

> Character Equivalence mapping is to deal with this issue:
> 
> A registrant registers a domain <ALPHA><BETA>.example
> Advertises it to other people as their capital form AB.example
> An end user will not know whether it was Greek or English and attempts to
> access the site with ab.example and does not get to it.
> 
> With Character Equivalance mapping, this situation would not occur.  No
> matter how a domain name is represented, it is always unique.

I think this example nicely points up the contrary problem that
cross-script mapping has. If you start doing cross-script equivalence
mapping to eliminate differences between (to Latin-trained eyes) confusable
letters, you violate the integrity of other scripts and start mapping
the set of possible strings in those scripts even more confusably into
the already crowded domain namespace of Latin strings.

In this particular example, suppose I was a Greek and actually wanted
to register <ALPHA><BETA>.com, in addition to <ALPHA><BETA>.gr for
the <ALPHA><BETA> construction company in Athens. Whoops! I'd be
out of luck since ab.com already exists and is registered to
Allen-Bradley. (See www.ab.com ) Why should I, as a Greek, find my
own Greek namespace unpredictably polluted by some arbitrary list
of equivalences between Greek letters and Latin letters?

And exactly what equivalences would you suggest? Greek uppercase
eta is basically indistinguishable in shape from a Latin uppercase "H".
So do I equivalence map it to Latin "H", which would make no sense at
all for transliteration and serve only the purposes of dumb equations
for people who know nothing about Greek whatsoever? Or do I equivalence map it
to Latin "I", which is the normal transliteration for eta in Modern Greek?
Or do I equivalence map it to Latin "E", which is the normal transliteration
for eta in Ancient Greek?

So does: <ALPHA><BETA>.<OMICRON><MU><ETA><RHO><OMICRON><SIGMA>

equate to: ab.omhpo<sigma> or ab.omiro<sigma> or ab.omero<sigma> or
           ab.omhpos or ab.omiros or ab.omeros ?

By the way, the 5th example is how the Greeks themselves would Latinize
it. (see www.omiros.gr )

The problem of "AB.example" is generally dealt with by context. First
of all "example" would be in Greek if I was really dealing with Greek.
Second, if I wanted people to enter "ab.whatever" I'd be advertising
in *English* to set the expectations. If I wanted people to enter
"<alpha><beta>.....", I'd be advertising in *Greek* to set the expectations,
and people would be using Greek keyboards and expect to enter Greek.

Furthermore, visual confusability quickly runs off the road as the basis
for determining equivalence classes when you start to deal with scripts
that have more complicated rules for the presentation of glyphs than
is typical for the Latin script. Which of several possible forms is
the basis for the confusability used to determine the equivalence?
And this turns into an N-body problem, because you start having to
account for visual confusability between N different scripts -- not just
between N scripts and Latin characters. Where do you draw the line, in
principle? Or do we just end up arguing for the next decade about all
the edge cases?

> 
> Bear in mind that this need to happen only during matching of names within
> the DNS server.
> 
> A registrant can register <ALPHA><b>.example all they want.  This is the
> misconception that I wanted to point out.  Character Equivalence mapping
> does not prohibit mixed scripts.

But it does severe damage to the integrity of namespaces in other scripts.

This is Latin- and English-centric thinking, in my opinion, that would
damage the whole point of having IDN's by folding other scripts towards
Latin characters.

--Ken

> 
> Edmon
> 
>