[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
[idn] Time to reconsider

To: idn@ops.ietf.org
Subject: [idn] Time to reconsider
From: John C Klensin <klensin@jck.com>
Date: Thu, 24 May 2001 09:24:56 -0400
Delivery-date: Thu, 24 May 2001 06:25:22 -0700
Envelope-to: idn-data@psg.com
I started to write the attached as a response to a note in a
side-conversation.  I think it belongs on the list.  Since I
haven't asked the author of the quoted material/ questions, I've
anonymized those remarks (variations on which, actually, could
have been made by several people in the threads of the last few
days although I particularly like this formulation).

Quick summary: I believe that we should not agree to either
Nameprep or the Requirements document as they stand because the
underlying assumptions behind what we are doing have been shown,
by the WG's work, to be fundamentally flawed.  I try to explain
"why" below, followed by a quick summary of what we should be
doing instead.  More on the latter within the next day or two.

(referring to the former (as of draft -06) requirement [30] and
other "localization" ideas)...
> Is scope a requirement? Is scopelessness a requirement?

My personal response would be that this is more evidence that
trying to do these things in the DNS is a mistake.  The issue
isn't that of one coding versus another.  It is that we are
expecting a lookup-and-exact-match system to do the impossible
-- fuzzy matching that doesn't astonish a user whose cultural
and linguistic assumptions cannot be predicted or identified
given the tools at hand.

> There was substantive discussion involving CNNIC, and the
> question of TC and SC equivalence. [...]  at Minneapolis [...]
> I'd prefer to see technical
> consensus.

My personal belief is that you will never get it.  The problem
is that interpretations of "matching" vary locally.  That means
there will be cases of ambiguity -- things that do, or do not,
match depending on context or the user's assumptions.
Simplified and traditional Chinese are either matching issues or
language translation, depending on those assumptions: the
requirement is clear for those who need the matching, but
agreement that it is universally necessary or desirable is
problematic.  

And that may be one of the cases with an answer on which global
consensus can be reached, but the more general problem is
certainly not unique to one language or script.  As we keep
looking, it seems that we have at least a few examples per
script.  Even ASCII (although not the "hostname" subset) has
them if one goes back far enough or starts looking carefully.
E.g., are "tilde" (Ux007e) and "not sign" (Ux00AC) the same
character?   Well, several programming languages and database
systems made them the same (the glyphs were treated as alternate
stylizations of the same character) back in what could be
described as the "unified Roman" days of ISO 646.  For some,
"macron" (Ux00AF)  If, as some registries are proposing (or
actually registering), we can have dingbats in the DNS, there is
certainly no reason to ban these characters.  There are several
single-line optical scanner products on the market that proport
to read URLs and enter them directly into browsers.  Do these
code points match?    How about "vertical line" (Ux007C) and
"broken bar" (Ux00A6)?  That pair and the tilde/not-sign one are
equivalenced in character set translations between ASCII and
another _very_ popular coded character set.   To a person not
trained or socialized in Roman alphabets, the pairs may "look"
more similar than two Han characters, or two characters from
different Han-based language scripts look to an untrained person
such as myself.

But they are separated in Unicode because there are good reasons
to separate them, especially if one is worried about typography.
Should they be matched (treated as equivalent)?  Well,
sometimes: no "yes" or "no" answer is going to satisfy everyone
and, worse, the same answer may be unsatisfactory to the same
person on different occasions.  One needs context --cultural or
use-- and the DNS cannot supply or support that context in an
unambiguous way, at least without kludges which are much more
horrible than any we have contemplated so far.

Consequently, there is no way to get technical agreement without
favoring the social/ political/ cultural preferences of one
group over another.  I believe that the WG has managed to prove
a negative by exhaustion, and that we need to stop before we
wreck the DNS only to get an unsatisfactory and incomplete
result.

So, my formulation to respond to Marc's "time to move" question
is that we should:

(i) Carefully review the requirements document again.  The
issues over [30] (which I saw as a "localization" -- every zone
can have its own matching rules-- hook) may exist in other
places.  The resulting document should either focus on solutions
that are not limited to changes to the DNS or what is permitted
in it or how its names are interpreted (I consider IDNA, etc.,
to be cases of such changes, disclaimers that the interpretation
is elsewhere notwithstanding), or should clearly identify where
important requirements cannot be met in the DNS alone.

(ii) We should make another sweep over "nameprep", to change the
matching and mapping lists from "yes" or "no" to "yes"/ "no"/
and "sometimes" (or an equivalent three-way choice).  The
latter, ambiguous, category, should include cases like those
above as well as the "is 'A' in Roman, Cyrillic, or Greek?"
issue we have kicked around before, and all of their relatives
and parallels in other scripts.

(iii) And then we should concentrate IETF's efforts on searching
and matching systems that permit context and user choice over
the ambiguous cases, rather than hoping that "sneak this into
the DNS by clever coding and hope the problems will somehow
solve themselves elsewhere --perhaps by the world adapting its
natural languages and cultures to our conventions" will be
adequate.  It won't.

     john
Prev by Date: Re: [idn] time to move
Next by Date: Re: [idn] time to move
Prev by thread: Re: [idn] Why we can go directly to UTF-8
Next by thread: Re: [idn] Time to reconsider
Index(es):
- Date
- Thread