[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [idn] Using a new class for IDN
Dan Oscarsson <Dan.Oscarsson@trab.se> wrote:
> - The count of characters that can fit into 63 octets differ when
> using ACE-names and native UCS-names.
True. As an extreme example, consider a label consisting of many
repetitions of the same character outside plane 0. UTF-8, UTF-16, and
UTF-32 all use 4 octets per character, while Punycode uses about 1.
As an extreme example the other way, consider a label consisting of
random characters from plane 0. UTF-16 uses 2 octets per character,
while Punycode uses about 3.5.
> To make things easier for the future, IDNA should require that the IDN
> in the ToUnicode form must not be longer than 63 octets.
ToUnicode does not output octets, it outputs code points. Which
encoding form did you have in mind, UTF-8, UTF-16, or UTF-32?
UTF-32 is always at least as large as UTF-16, sometimes larger, so I'll
assume you don't want that one.
If you go with UTF-16, then all existing ASCII labels over 31 characters
become retroactively invalid, which seems very bad.
If you go with UTF-8, then Indian scripts can fit only 21 characters per
label, versus about 40 for ACE. It's seems a shame to halve the limit
for a billion users. I'd really rather not.
AMC