[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Combining characters (was: Re: [idn] hostname historyhell)



--On Wednesday, 21 November, 2001 09:08 -0800 Kenneth Whistler
<kenw@sybase.com> wrote:

>> We must though be very careful not to inadvertently exclude
>> scripts/characters that are used by some languages even
>> though we thought they were merely symbols.
> 
> The list you are looking for is provided by the Unicode
> Consortium:
> 
> http://www.unicode.org/Public/UNIDATA/Scripts.txt
> 
> That gives script assignments for Unicode characters (Latin,
> Greek, Cyrillic, Devanagari, Bengali, Han, ...), and provides
>...
> Note that many scripts inherently include combining
> characters. I absolutely agree with Kent that a blanket
> prohibition of combining characters is unacceptable. In a
> discussion dominated by English, Chinese, and Korean
> speaker/writers, it might seem o.k., but I assure you that if
> there were as many Arabic, Urdu, Hindi, and Bengali
> speaker/writers participating, it would *not* seem o.k.

Ken, I may not have been reading closely enough, but I don't
believe this discussion has included a proposal to ban combining
characters.  I do have an issue with them, but I think it is
separable (see below).

> Otherwise, deciding to omit punctuation, space characters,
> format control characters, and symbols is fine as a
> conservative approach to the problem, however.

Good to hear this.

The combining character problem (if it is a problem) is that, so
far, we have no proposals on the table that would require that a
DNS label be a valid name in any particular language, or even
that it be drawn, homogeneously, from any particular script.
Until and unless one of those rules is made (my guess is that it
would be nearly impossible to do so, but this is not my area of
expertise), we are thrown back on the traditional DNS rule that,
subject to the hyphen-placement rule, any valid character of the
chosen CCS can appear in any relationship to any other valid
character of the CCS.  In particular, there is no way to require
or assume script-homogeniety.  

If, to use your example, we have a selection of Arabic, Urdu,
Hindi, and Bengali, which characters from each script designed
by its first character,  
   AUHBBHUA
ought to be a valid label.  While we know how to construct AAAA,
UUUU, HHHH, and BBBB, regardless of whether a given character is
combining or non-combining, I worry about interpretation and
ambiguity if combining characters (or partial breaks, etc.) are
taken from one of these scripts and surrounded by characters
from an unrelated script.  Maybe it is not a problem, but I'd
like someone to assure me that is the case.

     john