[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [idn] question about cidnuc
- To: idn@ops.ietf.org
- Subject: Re: [idn] question about cidnuc
- From: Paul Hoffman / IMC <phoffman@imc.org>
- Date: Fri, 10 Mar 2000 07:46:54 -0800
- Delivery-date: Fri, 10 Mar 2000 07:47:13 -0800
- Envelope-to: idn-data@psg.com
>i made up two examples for the first two cases.
>are they correct?
>
> 1) no compression: 0x0061 1100 1162
> 2) compressed/one-octet header : 0x1100 1162 -> 0x 11 00 62
> 3) compressed/two-octet header: examples???
This is not correct. There is only one way to encode any input, as required
by the IDN requirements document. In cidnuc, section 2.4.1, Step 1 says
that all the upper octets *must* match in order to use the greater
compression. In the case above, 0x00 does not match 0x11. Thus, the output
of the compression step is 0xD8006111001162.
Yes, that's not a compression, but a slight expansion. It is not expected
that many names will contain letters from widely-disparate scripts, but
even those that do only suffer a one-octet expansion for the whole script.
If the example above was instead 0x1161 1100 1162 (that is, all from Korean
Hangul Jamo), which is more likely, the compressed string would be 0x11610062.
Just to be clear, the compression algorithm doesn't do much for short
strings. The purpose is for long strings that might hit the 63-character
limit after encoding with Base64. The script you gave, Hangul Jamo, is a
prime example of where cidnuc's compression helps. In downcasing UTF8, the
limit for Hangul Jamo is 8 characters; in UTF-5, it is 15 characters; in
cidnuc, it is 37 characters.
--Paul Hoffman, Director
--Internet Mail Consortium