[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [idn] ZWNJ




John,

You're right about the identifier nature of DNS names. Being brought up in
such a world, I'm already well familiar with the way this impacts the
language. For a good example, see Arthur C Clarke's The Light of Other
Days, ISBN 0812576403, where words like SearchEngine are common. The DNS
and other identifier restrictions have changed the shape of English
language, for sure.

Getting back to the thread, Arabic lacks many of the possiblities of the
Latin script, for getting a distinguished sense out of a sequence of
letters (which we will call identifiers). I consider the use of ZWNJ to be
equivalent to the use of inter-identifier captialization. Just like that,
it should be ignored, just like that, it will help the reader, and just
like that, the original should be retreivable in some way.

Please note that even in single words, ZWNJ is used. In many single words
like the Persian words for "houses", "circular", "eraser", "compatriot",
and "synonymous", or single-word names of places, it may not be dropped in
any way, or the word becomes completely unreadable.

Arabic is connected, unlike Latin where the letters are separate enough
that you can sometimes omit the space (like in domain names, or German).
It's also unlike Han, where there is a good boundary between the words,
without even the need for spaces. So it should use spaces and ZWNJ heavily
to stop joining where it will ruin the meaning or readablity of the
phrases. Please note that ZWNJ is somehow considered a *nothing* in the
Unicode recommendation. It should only affect contextual shaping, and
nothing else...

While I see the use of space-like characters in Latin problematic (mainly
because of indistinguishablity of the written word), the case is
difference with ZWNJ. It is not a space character.

BTW, there are also many other needs for being able to retreive the
original non-nameprepped name. Have you thought about national digit
shapes (as used in Arabic and Indic scripts), for example? Many countries
do not use European digits (which Europeans call Arabic).

roozbeh

On Sat, 28 Jul 2001, John C Klensin wrote:

> DNS names are identifiers and, as identifiers, are subject to
> certain restrictions.  Length is one of them -- we even have
> words in English that are over 63 character long, although there
> aren't many of them, and they can't be used in domain names.  A
> second is that, as identifiers, labels are what computer languge
> syntax folks call "atoms" or "atomic".  That typically means
> "one label, one "word".  There is a long history of pushing
> words together to make one DNS labels and using either the one
> joining character available (hyphen) or just catenating them.
> In the latter case, we just hope the user will figure out what
> is going on to preserve whatever mneumonic value we intend.
>
> Parenthetically, this is one place where our colleagues with
> ideographic languages have a huge advantage: they can actually
> write multi-word phrases into DNS labels/identifiers without
> doing violence to the natural rules of the writing system.
>
> So I don't know quite what to do with ZWNJ and other separators
> or near-separators without opening the door to other characters
> normally used in other languages as near-separators or
> punctuation or near-punctuation, e.g., ":", "'", "!", or "&",
> which have been used, normally or artistically, in Indo-European
> languages using Roman-derived character sets for many years, and
> even recognizing distinct interpretations for some of the
> distinct spacing characters and hyphenation ones in Unicode.
>
> For example, while I would _strongly_ not recommend going down
> this path, we could, in principle, adopt presentation and coding
> rules that would permit, e.g.,
>
>    "O'Reilly & Associates"
>
> to use the domain name
>
>    www."O'Reilly & Associates".com
>
> by coding the key second label as
>    O(U+0027)Reilly(U+00A0)(U+0026)(U+00A0)Associates
>
> Again, I don't suggest doing this, but, ultimately, the DNS
> itself would have no problems with it and, if one starts
> introducing near-space characters from other scripts, then there
> is little justification for prohibiting this type of usage in
> Roman-based ones.
>
>      john
>