[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [idn] length restrictions on IDN label
Soobok Lee <lsb@postel.co.kr> wrote:
> I have a punycode label of length 63 octets:
> L1: zq--o39AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA
>
> L2=ToUnicode(L1) produces: U+AC00 x 56 times ( Hangul "KA" repeated 56 times)
>
> But this L2 can be encoded in various unicode/legacy encodings into
> various lengths of octets:
>
> UTF8 : 3 x 56 = 168 octets
> UCS2 : 2 x 56 = 112 octets
> UCS4 : 4 x 56 = 224 octets
> KSX1001/EUC-KR : 2 x 56 = 112 octets
>
> Many internet applications impose/assumes the 63-octets-limit of
> label lengths.
IDN-unaware applications use this simple 63-octet limit. These
applications also assume that the domain label is ASCII. IDN-aware
applications will be careful to use the ASCII form when talking
to IDN-unaware applications. Applications that use non-ASCII
representations will know the more complex syntax rule for non-ASCII
labels (namely, that the label is valid if and only if ToASCII can be
applied to it without failing).
> From implementators' point of view, more precise specificiation is
> needed about whether IDN label/FQDN has *NEW* length restrictions in
> various char encodings
Section 2 defines "internationalized label" as a label to which the
ToASCII operation can be applied without failing. There is no other
restriction on IDN label syntax.
> the implementors have practical security-related need to impose some
> limits on the iDN lables in non-ACE encodings. (for example, to avoid
> buffer overflow errors due to expanded ToUnicode labels)
That's true. A cursory examination of the Punycode algorithm reveals
that each ASCII character can represent at most one code point;
therefore an internationalized label can represent at most 63 code
points, whether it's ACE or not. A given encoding uses a bounded number
of octets per code point, so you can allocate your buffers based on
that.
> The unit of length restriction matters: # of code points or # of
> octets ? That should be made clearer. RFC1035 uses "octets", not a
> character/code point.
RFC 1035 limits domain labels to 63 octets, but RFC 1035 predates IDNA,
and it speaks under the explicit assumption that text is ASCII. Because
DNS is IDN-unaware, all internationalized labels in DNS are in their
ASCII forms. For these reasons, the 63-octet limit applies only to the
ASCII forms of internationalized labels.
IDNA does not introduce any new length restrictions. The 63-octet limit
on ASCII labels is the only length restriction on internationalized
labels.
> Then, U+AC00 x 56 times (in my previous posting) is a valid label
> conforming to RFC1035 ?
No, it's not, and that's why IDNA requires that it be converted to its
ASCII form before being passed into an IDN-unaware protocol like DNS.
> UTF8-encoded IDN labels are not governed by RFC1035 length
> restrictions ?
Not directly. The 63-octet limit applies to the ASCII form, not the
UTF-8 form. It would be absurd to apply the 63-octet limit to every
possible encoding form. You'd have to transcode a label into every
possible encoding just to check whether it's valid.
> IDNA contains brand new length restrictions for 8bit labels which
> obsoletes RFC1035 ?
No, it contains no new length restrictions. The RFC 1035 restriction
on the ASCII form is still the only restriction on the length of
internationalized labels.
AMC