[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [idn] surrogates in draft-ietf-idn-nameprep
- To: idn@ops.ietf.org
- Subject: Re: [idn] surrogates in draft-ietf-idn-nameprep
- From: Paul Hoffman / IMC <phoffman@imc.org>
- Date: Tue, 15 Aug 2000 20:26:49 -0700
- Delivery-date: Tue, 15 Aug 2000 20:32:17 -0700
- Envelope-to: idn-data@psg.com
At 6:38 AM +1000 8/16/00, Frank Ernens wrote:
>Section 3.7.2 says
>
>> So far, all proposals for binary encodings of internationalized name
>> parts have specified UTF-8 as the encoding format. In such an encoding,
>> surrogate characters MUST NOT be used. Therefore, for UTF-8 encodings,
>> the following are prohibited:
>>
>> D800-DFFF [SURROGATE CHARACTERS]
>
>This is incorrect. A pair of surrogates corresponds to a character in
>the 31-bit ISO 10646 code space, and according to RFC2044 anything
>up to 2**31 - 1 can be encoded in UTF-8. Simply transform the
>UCS-2 to UCS-4 and then into UTF-8.
You may have misunderstood the draft in that it is looking at
character code points. There is no encoding assumed for the input.
Surrogate codepoints only make sense when using UTF-16 encoding.
>What might have been meant was that some current implementations of
>UTF-8 mishandle surrogates. Actually, the most likely near-term
>use for them is in user-defined ideographs (e.g. obscure Chinese
>and Japanese personal names) and therefore it is reasonable
>to disallow them - just not for the stated reason. Said another
>way, since all ISO 10646 characters in the range representable
>by pairs of surrogates are currently undefined (except for private
>use characters), and the document elsewhere prohibits undefined
>characters, we don't need this section at all.
Fully disagree. By the time that IDN is finished, 10646 will contain
values outside plane 0. These will include more than "obscure" Han
characters. IDN should be able to handle these just as well as any
other characters.
--Paul Hoffman, Director
--Internet Mail Consortium