[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [idn] ToUnicode output can be longer than input



Edmon Chung <edmon@neteka.com> wrote:

> Right now, the ACE string provided is not valid because it contains
> characters beyond A-z, 0-9, -.

It is valid.  Some labels are ASCII and some are not.  Some labels are
ACE and some are not.  All four combinations are possible (ASCII ACE,
ASCII non-ACE, non-ASCII ACE, non-ASCII non-ACE).

An ACE label is formally defined as a label that ToUnicode would alter.
A (valid) internationalized label is formally defined as a label to
which ToASCII can be applied without failing.  It can be shown that all
ACE labels are (valid) internationalized labels.

> I think it would be better to find an example that is a valid Punycode
> string that when ToUnicode is performed will exceed the number of
> codepoints of the original.

The Punycode decoder cannot output more code points than it inputs.

If the input of ToUnicode is ASCII, then Nameprep will not be applied,
and therefore the output of ToUnicode cannot contain more code points
than the input.  It's Nameprep that can cause strings to grow, not the
Punycode decoder.

AMC