[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [idn] ToUnicode output can be longer than input



Hi Adam,

----- Original Message -----
From: "Adam M. Costello" <idn.amc+0@nicemice.net.RemoveThisWord>
> For example, consider the input:
>
> x n - - fi fi - a ffl u e n t - s o u ffl - viii - u i c
>
> The spaces are not really there, they just indicate the clusters, which
> represent single code points (ligatures and roman numerals: U+FB01,
> U+FB04, U+2177).  That's 24 code points.

If I counted it correctly, there are 33 "codepoints" in the above ACE
string. (I agree with your assessment however, please see below, but the
example doesnt seem to illustrate your point...)

> The IDNA spec contains an incidental statement that was intended to be
> helpful, in section 4.2:
>
>     The ToUnicode output never contains more code points than its input.
>
> Oops, that's not true, because Nameprep can cause strings to expand.

I can understand this possibility.
Basically, if the length of the Unicode composition for one or more
characters in the string is longer than the ACE composition and the total
excess for all the characters within the string is more than 4 (compensating
the "xn--"), then the ToUnicode output will be longer than the input.

> So the statement needs to be removed or altered if/when the RFC is
> revised.  It would be correct to say that the Punycode decoder cannot
> output more code points than it inputs, but Nameprep can, and therefore
> ToUnicode can.

Seems reasonable.

Edmon