[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [idn] The layers and character handling in DNS

To: Patrik Fältström <paf@cisco.com>
Subject: Re: [idn] The layers and character handling in DNS
From: John C Klensin <klensin@jck.com>
Date: Sun, 18 Feb 2001 11:16:35 -0500
Cc: idn@ops.ietf.org
Delivery-date: Sun, 18 Feb 2001 08:17:23 -0800
Envelope-to: idn-data@psg.com
--On Sunday, 18 February, 2001 06:59 -0800 Patrik Fältström
<paf@cisco.com> wrote:

>> So I find an IDN domain on the network, and, by the time its
>> name emerges from my application, it is in X, having been
>> passed through F1'(Unicode,Z).  I pass X to Keith out of band
>> (e.g., in an email message with "text/plain; charset=X").
> > I think you send "X" as data, where the charset is "Z".

Yes, sorry -- typographical error.

> The whole goal with the normalization which UTC has done is
> that the number of X and Fn which make Nameprep(F1(X,Z)) !=
> Nameprep(Fn(X,Z)) is as small as possible. We can argue until
> we die how many of these X and n exists which break the
> pattern, and because of this, I think this discussion is mostly
> academic.

I wouldn't try to make a quantitative argument.  However, I
suggest that, if there are _any_ major "local" character sets
that have ambiguous and/or non-reversible mappings to 10646, then
we are in trouble with these sorts of algorithms.  I would define
"major" in terms of either the size or importance of the
potential user group: if we are in trouble with Chinese,
Japanese, Korean, or even with Latin-based alphabets with
different composition rules in different
interpretations/mappings, then I suggest that we are in big
trouble indeed.

> BUT: I think the key message John has, which the IDN group have
> to be aware of is that the IDN wg seems to have made a choice
> that Unicode with the normalization rules which are defined by
> UTC (not IETF) is the absolutely best functions that can be
> found. Further, the problems brought up _will_ happen, and we
> in the IETF can only rely on UTC being smart enough to develop
> normalization tables and do good marketing to see that the
> number of Fn is relatively small for each X.

I think this position works iff:

(i) There are _zero_ nameprep mapping rules or discounted
characters not specified in Unicode (and preferably ISO)
standards.   If we need a many-paged Nameprep document for any
purpose other than to provide a stable version of a UTC spec --I
consider it a problem if we need even that, but it is another,
and arguably separate, problem-- then the "hope UTC gets it
right" story fails.  A single IETF-specific rule, whether for
URIs, funny characters, or anything else, IMO kills this story
and turns it all into an IETF problem.

(ii) We have a really good versioning model so that someone
applying Nameprep can know what rules to use.  If
NamePrep2001(string) != NamePrep2002(string) and one can't tell
which one should be applied, I think that, long-term, the whole
thing deteriorates badly.   Given an obvious variation of the F1
!= F2 argument, I suspect (but haven't worked through the cases
to my satisfaction yet) that means that the relevant version
information needs to be encoded in the names themselves.  I.e.,
the real test is whether
  NamePrepN(F1(X,Z))=NamePrepM(F2(X,Z))   and whether
  M and N can be determined in all places in which it is relevant
(which would be especially important if the equality doesn't
apply --it won't, or one wouldn't need NamePrep versions-- and
one needs to know which version to apply.

> Now, the big question which John doesn't ask explicitly is
> whether these mappings are good enough, and if the IDN wg is
> aware of the fact that it should have been able to say "no, too
> many Fn exists for too many X, so we don't belive these
> mappings are good enough".

Since I don't believe that, if things don't work, users will
forgive us if we just say "don't blame us, it is UTC's problem" I
tend to ask a higher-level question.  And I'm getting
increasingly dissatisfied with the answer I get.

> Personally, I have been thinking of this a lot the last 6-8
> months, and my conclusion is:
> > (1) The IETF is NOT a forum where we have the knowledge on
> making a decision whether a mapping function is good enough or
> not for characters. The only thing the IETF can do is to choose
> someone which works in this area and trust them doing an as
> good job as possible. UTC is doing a good job, but they have
> "bugs" and problems with their mapping tables (as anyone would
> have) and we in the IETF will inherit them -- for good and for
> bad.

Agreed.  But that implies, as above, that we make _zero_ of our
own rules.  If the Nameprep document is more complex than "take
the UTC rules and apply them" (perhaps in some specific order
relative to other things that need doing), I think we are out of
this space and into applying our own judgement... we I think we
agree we should not do.

> (2) Because of the issues John list, or even more important the
> fact that we (amateurs) in the IETF do already know that two
> code points in Unicode which are not normalized to the same
> value still look the same exists, the only way of solving the
> problem is to (a) have special normalization rules (not the UTC
> ones) in the IETF or (b) regardless of IDN try to push all
> applications to be aware of this issue so they start looking at
> dictionary approaches in places of the user interface where
> misunderstanding can happen. And (a) is NOT a path to take due
> to the argument in (1) above.

Indeed, that is the conclusion I keep reaching.

> So, consistent behaviour is something we will get only if F1
> and Fn for the same X map to at least two code points which are
> normalized to the same in the nameprep phase -- and we in the
> IETF _have_ to rely on UTC for this. Eventual misunderstandings
> that can happen (or rather, WILL happen) because of
> non-normalization happening -- for example between latin, greek
> and cyrillic -- will not be solved by nameprep.

And the alternative, of course, is to move these issues into an
environment in which the ambiguities are tolerable or can be
passed back to the user.  That isn't the DNS.

> Or else we should not do this at all.
> > My personal opinion is that we SHOULD do this, but know about
> the limitations, and regardless of the IDN solution with
> nameprep understand that we need a dictionary _aswell_ because
> of the limitations that exists.

And I've gotten a bit more pessimistic and --faced with
approximate agreement on the facts and first-level conclusions--
am getting close to concluding that

   * we should _not_ do this at all
      * we should press forward with a directory approach as the
   only one likely to solve the problem.
      * we should ask ourselves whether, if a directory approach is
   needed anyway, it makes sense to add this much complexity to
   the DNS when the real work will need to be done at another
   level.

And, coming back to the ongoing Last Call at a 10,000 meter
level, if the Requirements document doesn't draw out these types
of issues and at least point to the need for solutions or
explicit decisions, it isn't, IMO, ready for publication.

     john
Prev by Date: Re: [idn] The layers and character handling in DNS
Next by Date: Re: [idn] The layers and character handling in DNS
Prev by thread: Re: [idn] The layers and character handling in DNS
Next by thread: Re: [idn] The layers and character handling in DNS
Index(es):
- Date
- Thread