[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [idn] hostname history hell



Tim said:

> I think Eric's proposal may have legs. Or at least something along these
> lines. I agree with John's point that we need to start conservative and
> expand from that base-line. To be too inclusive at first means it would be
> nearly impossible to go back.
> 
> We must though be very careful not to inadvertently exclude
> scripts/characters that are used by some languages even though we thought
> they were merely symbols.

The list you are looking for is provided by the Unicode Consortium:

http://www.unicode.org/Public/UNIDATA/Scripts.txt

That gives script assignments for Unicode characters (Latin, Greek,
Cyrillic, Devanagari, Bengali, Han, ...), and provides guidance for
what not to leave out if you are simply trying to make a conservative
decision without leaving some languages essentially unrepresentable.

Note that many scripts inherently include combining characters. I
absolutely agree with Kent that a blanket prohibition of combining
characters is unacceptable. In a discussion dominated by English,
Chinese, and Korean speaker/writers, it might seem o.k., but I
assure you that if there were as many Arabic, Urdu, Hindi, and
Bengali speaker/writers participating, it would *not* seem o.k.

Otherwise, deciding to omit punctuation, space characters, format
control characters, and symbols is fine as a conservative approach
to the problem, however.

--Ken