[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [idn] ZWNJ

To: Roozbeh Pournader <roozbeh@sharif.edu>
Subject: Re: [idn] ZWNJ
From: John C Klensin <klensin@jck.com>
Date: Mon, 30 Jul 2001 08:35:14 -0400
cc: idn@ops.ietf.org
--On Sunday, 29 July, 2001 15:46 +0430 Roozbeh Pournader
<roozbeh@sharif.edu> wrote:

> You're right about the identifier nature of DNS names. Being
> brought up in such a world, I'm already well familiar with the
> way this impacts the language. For a good example, see Arthur
> C Clarke's The Light of Other Days, ISBN 0812576403, where
> words like SearchEngine are common. The DNS and other
> identifier restrictions have changed the shape of English
> language, for sure.

Conceptually, however, these are _new_ words and new
identifiers, albeit with funny spellings.  The association
between "SearchEngine" and "search engine" is mostly in the mind
of the observer.  It is certainly not in the DNS.

But, to the extent that preservation of those forms is an
advantage, an external-to-the-database process such as nameprep
throws them away.  Either nameprep would have to be reversible
(which seems impossible) or we would need to store the "raw"
names in the DNS, put nameprep in the servers, and make matching
comparisons based on 
  (nameprep (query)) = nameprep (stored-raw-value)
or, worse, I suspect
  encoding(nameprep(query)) =
nameprep(encoding(stored-raw-value))

Those approaches don't seem likely to be workable and certainly
would require massive protocol and server changes -- changes
which the WG has been trying to avoid and which it has been told
are out of its scope.

Perhaps that is part of the point you are trying to make.
 
> Getting back to the thread, Arabic lacks many of the
> possiblities of the Latin script, for getting a distinguished
> sense out of a sequence of letters (which we will call
> identifiers). I consider the use of ZWNJ to be equivalent to
> the use of inter-identifier captialization. Just like that, it
> should be ignored, just like that, it will help the reader,
> and just like that, the original should be retreivable in some
> way.

I think I have understood this, although I haven't been
expressing it very well.  I think that almost every language has
one or another of these issues.  The FooBar convention, for
which believe C and UNIX get the credit (or blame), rather than
the DNS, is really not satisfactory for English, although it is
a hack that sort of works there and not for Arabic.  The basic
Han-folding decisions of Unicode submerge distinctions among the
several languages using derivations of those characters.  And so
on.

> Please note that even in single words, ZWNJ is used. In many
> single words like the Persian words for "houses", "circular",
> "eraser", "compatriot", and "synonymous", or single-word names
> of places, it may not be dropped in any way, or the word
> becomes completely unreadable.

If it is an integral part of the word, such that the word
changes or loses meaning without it, then you, indeed, have a
strong case for its preservation.  I would think that what would
then be needed would be a simple rule (i.e., one that does not
require a program to understand Arabic or consult dictionaries)
to know when it can (and should) be removed and when not.

> Arabic is connected, unlike Latin where the letters are
> separate enough that you can sometimes omit the space (like in
> domain names, or German). It's also unlike Han, where there is
> a good boundary between the words, without even the need for
> spaces. So it should use spaces and ZWNJ heavily to stop
> joining where it will ruin the meaning or readablity of the
> phrases. Please note that ZWNJ is somehow considered a
> *nothing* in the Unicode recommendation. It should only affect
> contextual shaping, and nothing else...

There is a now-traditional nasty comment about standards that
have their origins in printer drivers, but we are stuck with the
thing.

> While I see the use of space-like characters in Latin
> problematic (mainly because of indistinguishablity of the
> written word), the case is difference with ZWNJ. It is not a
> space character.

But it seems to me that, despite the fact (if I have understood
you) that the same character is used in both situations, ZWNJ is
used both within words (your examples from Persian above) and to
separate words.  The "identifier" notion says that we don't get
phrases in identifiers in alphabetic languages.  Non-alphabetic
languages and character sets get lucky in this case, since there
doesn't seem to be any point in confining them to single "words"
rather than short phrases.

> BTW, there are also many other needs for being able to
> retreive the original non-nameprepped name. Have you thought
> about national digit shapes (as used in Arabic and Indic
> scripts), for example? Many countries do not use European
> digits (which Europeans call Arabic).

Of course.  The many examples of this sort of thing, and more
flexible treatment for characters and separators that are not
part of the basic DNS model, are among the things that have
driven me to conclude that we need something with search-like,
rather than lookup, properties to handle international names
without doing violence to the underlying languages.  Any
canonicalization scheme, and any scheme that maps one character
into another, will lose information that would be desirable in
some context, and that isn't a good thing.

But, in the more limited situations you identify, it seems to me
that it would be possible to do a bit of new protocol work and
then adopt some conventions that would be of help.  Suppose, for
example, that we created a new RR type, specifically for
non-"Hosttable"-format labels, for which the "data" (target) was
a raw (no nameprep or other canonicalization or mappings)
Unicode string.  And suppose a convention were adopted such that
any non-traditional label was associated with one of these
records as well as whatever A, MX, AAAA, or equivalent RR it
pointed to (this might force us to make an exception to the "one
record only" rule for CNAME, but maybe that is just a minor
details).

Now, the IDN WG could clearly not do this -- we would need to
get DNSEXT or some other group involved.  But, if the goal is to
be able to retrieve and examine the original name, then, it
seems to me that the way to do that is to store the original
name somewhere, rather than hoping that nameprep can both do its
job and be reversible.

   john
Prev by Date: Re: [idn] new I-D: Safely Encoding of likeness information into ACE label version 0.2
Next by Date: Re: [idn] Intro to my I-D
Prev by thread: Re: [idn] ZWNJ
Next by thread: Re: [idn] ZWNJ
Index(es):
- Date
- Thread