[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[idn] ToUnicode output can be longer than input

To: IETF idn working group <idn@ops.ietf.org>
Subject: [idn] ToUnicode output can be longer than input
From: "Adam M. Costello" <idn.amc+0@nicemice.net.RemoveThisWord>
Date: Thu, 24 Apr 2003 20:45:53 +0000
Reply-to: IETF idn working group <idn@ops.ietf.org>
User-agent: Mutt/1.4i

The IDNA spec contains an incidental statement that was intended to be
helpful, in section 4.2:

    The ToUnicode output never contains more code points than its input.

Oops, that's not true, because Nameprep can cause strings to expand.
For example, consider the input:

x n - - fi fi - a ffl u e n t - s o u ffl - viii - u i c

The spaces are not really there, they just indicate the clusters, which
represent single code points (ligatures and roman numerals: U+FB01,
U+FB04, U+2177).  That's 24 code points.

ToUnicode would apply Nameprep (which expands the ligatures and roman
numerals to their ASCII equivalents), then apply the Punycode decoder,
yielding:

fifi-affluent-soufflé-viii

(For the Latin-1 impaired, the non-ASCII character is e with an acute
accent.)  That's 26 code points.  26 > 24.

So the statement needs to be removed or altered if/when the RFC is
revised.  It would be correct to say that the Punycode decoder cannot
output more code points than it inputs, but Nameprep can, and therefore
ToUnicode can.

AMC

Follow-Ups:
- Re: [idn] ToUnicode output can be longer than input
  - From: Erik Nordmark <Erik.Nordmark@sun.com>
- Re: [idn] ToUnicode output can be longer than input
  - From: "Edmon Chung" <edmon@neteka.com>

Prev by Date: Re: [idn] Question about a ToUnicode step
Next by Date: Re: [idn] ToUnicode output can be longer than input
Previous by thread: [idn] Challenge: longest UTF-8 with valid domain name
Next by thread: Re: [idn] ToUnicode output can be longer than input
Index(es):
- Date
- Thread