[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [idn] Fw: Moving Towards UTF8 vs ASCII(ACE) Forever
Donald Eastlake 3rd <dee3@torque.pothole.com> wrote on the IETF list:
> There is now a standard way to encode URIs containing arbitrary
> UNICODE characters. This is described in RFC 3275 (which is
> currently a Draft Standard), in Section 4.3.3.1, and in the
> corresponding W3C document and has appeared in other W3C documents,
> for exampe XML Base.
So U+00E1 LATIN SMALL LETTER A WITH ACUTE (á), which is 0xC3 0xA1 in
UTF-8, is encoded as
"%C3%A1" (six bytes) according to RFC 3275. All BMP characters above
U+07FF, including all CJK characters, take three UTF-8 bytes and thus
nine RFC 3275 bytes.
I thought CJK users and others wanted *better* compression.
(No, David, I know you're not all the same person. I heard lots of
voices saying the same thing.)
-Doug Ewell
Fullerton, California