[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [idn] Fw: Moving Towards UTF8 vs ASCII(ACE) Forever



Donald Eastlake 3rd <dee3@torque.pothole.com> wrote on the IETF list:

> There is now a standard way to encode URIs containing arbitrary
> UNICODE characters. This is described in RFC 3275 (which is
> currently a Draft Standard), in Section 4.3.3.1, and in the
> corresponding W3C document and has appeared in other W3C documents,
> for exampe XML Base.

So U+00E1 LATIN SMALL LETTER A WITH ACUTE (á), which is 0xC3 0xA1 in
UTF-8, is encoded as
"%C3%A1" (six bytes) according to RFC 3275.  All BMP characters above
U+07FF, including all CJK characters, take three UTF-8 bytes and thus
nine RFC 3275 bytes.

I thought CJK users and others wanted *better* compression.

(No, David, I know you're not all the same person.  I heard lots of
voices saying the same thing.)

-Doug Ewell
 Fullerton, California