The two references below summarise much that has been said about the
difficulty of dealing with the internationalisation of Domain Names. Let
us agree once and for all:
1. The completely general problem is mathematically */and/*
computationally intractable, even if we use fuzzy mapping;
2. The problem is a typical engineering challenge to find a workable
solution â future-proofed as much as possible â which is minimally complex;
3. If the engineers (us?) don't solve it, the lawyers will have a
heyday, the courts will find expensive solutions, the cost of running
the web will blow out, and all of us will have mud all over our faces.
4. Now is the time â when there are only a very few registered names
with possible clashes â to do it before we */have/* to go through the
painful process of unregistering names and upgrading TLD machine codes.
So let's sketch out an approach, using <.com.ru> as an example.
a) The <.com.ru> registrar only accepts latin characters for that domain
name, or only accepts Cyrillic characters, */no mix/*, and maps the two
as equivalent. Case-equivalence mapping */may/* also be allowed, at a
cost of more complexity. Let the registrar decide that, and let's be
sure that as far as possible, the issuing authority licencing the TLD to
the registrar ensures legal protection for these */arbitrary/*, but
fixed decisions.
b) the first filter selects name tags whose codes (including diacritics,
etc) are either not all in the Cyrillic block or the Latin block(s) for
special attention.
My guess is that at this point, only a few percent will require special
attention.
c) At this point, the <.com.ru> registrar will need to exercise some
common sense. For instance, it seems unreasonable that this domain
should accept codes outside the Latin and Cyrillic code blocks, and if
they do, then mixes should be strongly discouraged. Certainly, the use
of, say, Hebrew vowel pointing with Latin Codes, while perhaps
acceptable in Israel TLD, should be unacceptable in the Russia TLD. In
fact, as a general rule, mixes of diacritics from one code block with
code points from another, should never be allowed.
Further rules can limit legal sequences of the allowed mixes. For
instance, in alphabetic scripts such as Latin and Cyrillic, isolated
code points from one script found in another make no sense unless
spoofing is intended. Earlier, I suggested that a code-point string of a
single script found mixed with strings of other scripts, should be of
minimum length 2. One can also limit the number of separate substrings
of an alternate script found interspersed with a dominant (national?)
script.
These sort of common-sense rules can be easily implemented and the
computational overhead is minimal. Of course, owners of ridiculous trade
marks (such as <U+004B U+0049 U+039B>, âKIÎâ, for the brand name of the
automobile âKIAâ) will disagree, but realism has to intrude somewhere
into the free market economy.
The problems for universal TLDs (<.com>, <.net>) are far more complex,
because they are required to accept all language scripts. At the TLD
itself, one can allow a limited, but finite number of character strings
to be equivalent, including the rule that script mixtures are
inadmissable, but maybe case folding will be allowed.
Once again, however, application of some judicious sieve filters and
rules about how mixed scripts may be composed, can simplify the handling
of the name tags. There are also sieve rules that can immediately throw
out most inadmissable combinations, such as the string length rule
mentioned above. Those strings remaining can be tossed to a human, who
will be required to be an expert in orthography (nice new line of
business for many on the Unicode list?).
Now, it doesn't make sense for these rules to be part of a standard on
how to extend Domain names to use scripts other than Latin: they are
much better handled as (algorithmic where possible) regulations
specified by the authority for a given TLD, or set of TLDs, in the case
of the universal TLDs.
By using this approach, and starting off with a set of rules that
disallow most forms of script mixes, then where appeals to common sense
and the wishes of a reasonable number of potential clients suggest a
loosening of the rules, this can be done with little disruption to the
existing state of affairs.
George
------
On 22 Feb 2005, at 08:40, Doug Ewell wrote:
Hans Aberg <haberg at math dot su dot se> wrote:
The suggestion I made, was to use a function to detect
confusables by
declaring them equivalent, but retaining the full Unicode character
set for representing the IDN's. If this is used at the registration
level only, the only thing that happens when somebody enters a
confusable, is that it is rejected. There is a problem only when an
authority admits parallel, confusable names to be registered.
Granted. The problem, as I have said so often, is determining what the
set of "confusables" is. Don't just say a/Ð and o/Î, either; that's
only the tip of the iceberg.
On 22 Feb 2005, at 07:03, Erik van der Poel wrote:
Hans Aberg wrote:
Sure you can change it: One can make the equivalence classes
smaller,
whenever one wants.
As a mathematician, one might be inclined to think that way. But
here, we're not talking about theoretical mathematics. We're talking
about network engineering. A totally different way of thinking.
You can't just change the mapping whenever you want because there
are many (client and server) installations out there that can't be
changed overnight (what is known in network parlance as a "flag day").
For example, even if a registry were to change their mapping, go
through their entire database, and delete the names that are
determined to be duplicates (however one might accomplish that),
there will be people with the old version of the app, which uses the
old mapping, and will not be able to find the name (since it has
been deleted).
Now, this might be a good thing if the name is an evil spoof, but
what about innocent registrations? What if two separate parties have
an equally legitimate claim on a particular name? This happens a lot
in the ASCII DNS, and basically, whoever got there first (or is
willing to pay a lot of money) wins.
One way to continue to support these innocent duplicates is to use a
different prefix (i.e. something other than xn--) in the new
mapping, and keep the old names (with the old prefix) in the
database (instead of deleting them). This way, the old clients
continue to find the old innocent names.
But what about the new clients? Now they will suddenly end up on a
different Web site when the user clicks on a link. I suppose the
user will just have to update their client, or the domain name owner
will have to register a different name and update all the Web pages
to point to the different name (assuming that they even have control
over *all* of the Web pages that might contain a link to their site).
And so on. Do you get it now? You can't just change the mapping
"whenever" you want. If you do this at all, you do it as few times
as possible.
Now, you may point out that we are just getting started with IDN and
that not very many names have been registered (and I may even agree
with you), but it would still take a while to come up with a better
mapping (and reach consensus on it -- shudder), and in the meantime,
more names would be registered.
And this still would not negate my main point, which is that you
can't do this "whenever" you want.
Erik