[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Combining characters (was: Re: [idn] hostname historyhell)



John Klensin said:

> > Note that many scripts inherently include combining
> > characters. I absolutely agree with Kent that a blanket
> > prohibition of combining characters is unacceptable. In a
> > discussion dominated by English, Chinese, and Korean
> > speaker/writers, it might seem o.k., but I assure you that if
> > there were as many Arabic, Urdu, Hindi, and Bengali
> > speaker/writers participating, it would *not* seem o.k.
> 
> Ken, I may not have been reading closely enough, but I don't
> believe this discussion has included a proposal to ban combining
> characters.  I do have an issue with them, but I think it is
> separable (see below).

Kent and I were responding to Eric Hall's first proposal for 
a "safe set":

> How about this as a believable compromise: We start with a "safe set" of
> alphanumeric characters and specifically exclude punctuation, spacing,
> symbols, and combining characters.

John continued:

> The combining character problem (if it is a problem) is that, so
> far, we have no proposals on the table that would require that a
> DNS label be a valid name in any particular language, or even
> that it be drawn, homogeneously, from any particular script.
> Until and unless one of those rules is made (my guess is that it
> would be nearly impossible to do so, but this is not my area of
> expertise), 

It is impossible to do so, and hopeless to attempt. The concept of
what is "a valid name in any particular language" is hopelessly
compromised by borrowing of terminology and names back and forth. Is
"bach.org" a German name or an English name? How about "alliance.com"
versus "alliance.fr" (both exist) -- one is English and one is
French. Cross-script names are less common, but you do run into 
orthographies that borrow one or more letters from another script 
into their own alphabets, and historically these things change status 
over time.

> we are thrown back on the traditional DNS rule that,
> subject to the hyphen-placement rule, any valid character of the
> chosen CCS can appear in any relationship to any other valid
> character of the CCS.  In particular, there is no way to require
> or assume script-homogeniety.

Correct. Nor would you want to.

And this would be a problem even if you limited the IDN to,
for example, 8859-5 (Latin/Cyrillic) -- an 8-bit CCS with no
combining characters. This isn't a problem introduced by
Unicode -- it is just more obvious now because Unicode has so
*many* scripts in it.

And I don't think that combining characters have anything to
do with it.

> 
> If, to use your example, we have a selection of Arabic, Urdu,
> Hindi, and Bengali, which characters from each script designed
> by its first character,  
>    AUHBBHUA
> ought to be a valid label.  While we know how to construct AAAA,
> UUUU, HHHH, and BBBB, regardless of whether a given character is
> combining or non-combining, I worry about interpretation and
> ambiguity if combining characters (or partial breaks, etc.) are
> taken from one of these scripts and surrounded by characters
> from an unrelated script.  Maybe it is not a problem, but I'd
> like someone to assure me that is the case.

It is not a problem.

If you have AUHBBHUA which would, by the way, actually be

            AADBBDAA (where A=Arabic script, used for both Arabic and Urdu)
            12345678
and which would (assuming right-to-left context) display as:
            AADBBDAA
            <<>>>><<
            87345621

the presence of any combining characters amongst the various characters
in the string could influence the resolved direction of some characters,
and hence their display, but would not result in any ambiguities in
character *identity* or the string itself. The label is simply defined
by the underlying sequence of characters in their logical order.

Bidi can get you into visual contortions, but that is the result of
the bidi itself, and not of the presence of combining characters amidst
the bidi. You'd have the same problem with just Hebrew (no points) and
Latin (no accents).

--Ken

> 
>      john