[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [idn] Determining equivalence in Unicode DNS names

To: liana Ye <liana.ydisg@juno.com>, paf@cisco.com
Subject: Re: [idn] Determining equivalence in Unicode DNS names
From: John C Klensin <klensin@jck.com>
Date: Mon, 21 Jan 2002 22:55:34 -0500
Cc: cheshire@apple.com, idn@ops.ietf.org
In-reply-to: <20020120.234743.-324529.3.liana.ydisg@juno.com>
References: <20020120.234743.-324529.3.liana.ydisg@juno.com>

Liana (and others),

Let's try to review how we have gotten here, since I partially
disagree with Patrik (the disagreement leads, however, to the
same conclusion, only more strongly).

--On Sunday, 20 January, 2002 23:01 -0800 liana Ye
<liana.ydisg@juno.com> wrote:

>> So, this wg decided that we will use one and only one
>> matching rule, just  like we decided to use only one
>> character set. Both of these  (the rule and the charset) are
>> created in the Unicode Consortium.

Actually, I don't believe this is correct --or that the language
isn't precise enough-- so let me try to restate it.  The fact
that there is only one matching rule is a consequence of the
design of the DNS: largely because of the binary nature of the
underlying  architecture, the only matching rule that is
possible involves a bit-wise comparison on each octet in turn.
Unless the labels are assumed to be "binary" (i.e., the
case-mapping rule does not apply) that bit-wise comparison is
made under a mask that collapses ASCII upper-case characters
onto their lower-case counterparts.

Now, we can map or transform all sorts of things together before
they get injected into the actual DNS.  But, as far as the DNS
is concerned, there is only one mapping rule, or two if "binary
labels" are handled differently from "character labels", but
that is it.

Are there ways around this?  Of course there are.  The WG has
looked at many of them and given up on them long ago.  For
example:

(i) One could introduce a completely new set of query types for
labels that might contain non-ASCII strings, and apply a
different matching rule for them.  Unfortunately, this would be
hugely complex -- we are having quite enough problems with the
three address-type records/queries we have today -- and might
take forever to deploy.

(ii) One could introduce a single new label and query type whose
sole purpose was to provide a new form of alias -- like CNAME
only with Unicode (encoded somehow) in the label.  The WG looked
at variants on that theme early on and decided to not pursue
them.  In hindsight, the approach essentially implies "within
DNS layering" and has all of the disadvantages of an "above DNS"
layering scheme and few of the advantages.

(iii) One could adopt the "new class" model, using that class to
impose a somewhat different set of matching rules by redefining
the query types.  That was, you will recall, proposed and
rejected (or, more accurately, ignored), I think mostly because
people were concerned about the length of time it would take to
deploy.

(iv) One could try to use EDNS to specify alternate mapping
rules or conventions.  Again, the problem is deployment,
aggravated by some rather complex issues involving caching and
secondary servers.  And, again, the WG considered some set of
options in this group and couldn't get consensus around moving
forward with them.

One thing that all of these --except, possibly, the last-- have
in common is that, while they were permit a _different_ matching
rule, none of them permit per-language or per-script rules.  You
still get only one (or two).

> The decision of use one and only one matching rule is 
> at false, because of there is no ONE rule can deal with 
> hundreds of different scripts no matter how strong it 
> appears that you defend the stand.  

The rule doesn't deal with scripts at all.  It doesn't deal with
languages either.  It deals only with strings of bits and deals
with them in a mathematically simplistic way.

> The charset appears in the form of a table, which gives 
> us a better view of these chars and we have a base 
> to work with. But when you try to use only ONE rule to 
> deal with all of the chars, the rule has to be bias for 
> only one type of script, currently Latin.

No.  The only "bias" is the ability to use a bit masking
algorithm for ASCII upper-lower case matching.  There are severe
and complex problems in dealing with the range of Latin scripts.

>  If we let this 
> pass from this IDN wg, then all of us here are arguing
> for nothing, as the "Prohibit CDN code points" is the 
> only way out suggested by Kenny Huang.   

> Isn't the time for the WG to wake up from the "one rule 
> for all" scope?
> 
> The only way out is to use multiple rules to treat multiple 
> scripts, where each rule is identified by a tag. 

And where are you going to put the tags?  Again, the WG has been
around this question many times.  There isn't a logical place to
put the tags unless one further shortens the fraction of the
label-length-space available for user-visible information.
There are difficult problems in doing the tagging from a user
interface basis.  Language (or even "script") tagging would
prohibit many combinations of characters for which there is
clear demand, and the WG has been reluctant or unwilling to
start making the policy decisions that "one label, one script"
implies.  

The bottom line is that, if you really think you know how to do
this, we are anxiously awaiting the Internet Draft.  But it
needs to cover all of those complex cases with mixed-language
labels and scripts that can't be neatly mapped onto languages
and vice versa.

My own belief is that many of these things require, not a
complex set of binary matching rules, but enough of a notion of
distance functions to talk, not about "matching" but about
similarities and differences.  Cyrillic, Greek, and Latin "A"
look similar, and are "sort of" the same letter (same origins,
overlapping pronounciations), but don't "match".  Greek
lower-case omega "looks somewhat like" Latin "w", but it isn't
the same letter (even "sort of").  Really getting TC<->SC
mappings right may require context, distance/probability
functions, or both (the latter if there isn't enough context).
We can posit doing those things in other sorts of systems, but
not in the DNS, even with a broader set of matching rules.

Now, I know you don't like that answer, and that saddens me.
But wishing, and even unhappiness, won't change the mathematics
that underlie DNS matching.

     john

Follow-Ups:
- Re: [idn] Determining equivalence in Unicode DNS names
  - From: Dave Crocker <dhc@dcrocker.net>
- Re: [idn] Determining equivalence in Unicode DNS names
  - From: "Michael Froomkin - U.Miami School of Law" <froomkin@law.miami.edu>

References:
- Re: [idn] Determining equivalence in Unicode DNS names
  - From: liana Ye <liana.ydisg@juno.com>

Prev by Date: RE: [idn] Prohibit CDN code points
Next by Date: Re: [idn] Prohibit CDN code points
Previous by thread: Re: [idn] Determining equivalence in Unicode DNS names
Next by thread: Re: [idn] Determining equivalence in Unicode DNS names
Index(es):
- Date
- Thread