[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[idn] ToUnicode output can be longer than input



The IDNA spec contains an incidental statement that was intended to be
helpful, in section 4.2:

    The ToUnicode output never contains more code points than its input.

Oops, that's not true, because Nameprep can cause strings to expand.
For example, consider the input:

x n - - fi fi - a ffl u e n t - s o u ffl - viii - u i c

The spaces are not really there, they just indicate the clusters, which
represent single code points (ligatures and roman numerals: U+FB01,
U+FB04, U+2177).  That's 24 code points.

ToUnicode would apply Nameprep (which expands the ligatures and roman
numerals to their ASCII equivalents), then apply the Punycode decoder,
yielding:

fifi-affluent-soufflé-viii

(For the Latin-1 impaired, the non-ASCII character is e with an acute
accent.)  That's 26 code points.  26 > 24.

So the statement needs to be removed or altered if/when the RFC is
revised.  It would be correct to say that the Punycode decoder cannot
output more code points than it inputs, but Nameprep can, and therefore
ToUnicode can.

AMC