[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [idn] IRIs ought to use internationalized *host* names
----- Original Message -----
From: "Adam M. Costello" <idn.amc+0@nicemice.net.RemoveThisWord>
> The Unicode character database classifies each character as belonging to
> exactly one of the following broad classes:
>
> L: letter
> M: mark
> N: number
> P: punctuation
> S: symbol
> Z: separator
> C: other
May I add this?
U: unassigned code points.
>
> We can start by examining which of these classes of ASCII characters are
> allowed in ASCII host labels.
>
> L: 52 exist, all are allowed
> M: 0 exist
> N: 10 exist, all are allowed
> P: 23 exist, only hyphen-minus is allowed
> S: 9 exist, none are allowed
> Z: 1 exists, it is not allowed
> C: 33 exist, none are allowed
U: indefinite, all are allowed .
>
> We can trivially extend these results to form a simple rule covering the
> entire Unicode repertoire, except that we have no precedent for class
> M. Since characters in class M tend to be things like diacritics, they
> should be allowed. So the proposed rule is:
>
> All characters in classes L (letter), M (mark), and N (number) are
> allowed, and U+002D (hyphen-minus) is also allowed. Everything else is
> forbidden.
U should be also allowed in addition to L,M,N.
But in later version of unicode , U may be partitioned into L' ~ C' and smaller U'.
Soobok Lee