[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [idn] FACE: Friendly ASCII-Compatible Encoding
- To: "Adam M. Costello" <amc@cs.berkeley.edu>
- Subject: Re: [idn] FACE: Friendly ASCII-Compatible Encoding
- From: James Seng <James@Seng.cc>
- Date: Tue, 05 Sep 2000 08:59:09 +0800
- Cc: idn working group <idn@ops.ietf.org>
- Delivery-date: Tue, 05 Sep 2000 12:14:15 -0700
- Envelope-to: idn-data@psg.com
Great. :-) Would better if you can write this as an I-D and submit it as WG
doc.
-James Seng
"Adam M. Costello" wrote:
>
> Please forgive me for jumping in with no background, but I just stumbled
> across this working group's web page, found some of the internet drafts
> interesting, and whipped up this idea, which you all may or may not have
> any use for.
>
> AMC
>
> Friendly ASCII-Compatible Encoding (FACE)
> version 0.0.0 (2000-Sep-04-Mon)
> Adam M. Costello <amc@cs.berkeley.edu>
>
> Goals:
>
> 1) To encode Unicode text as an ASCII string in such a way that
> substrings that were already ASCII to begin with remain visible, for
> the benefit of users whose software does not understand the Unicode
> text.
>
> 2) To achieve reasonable efficiency for non-ASCII characters.
>
> 3) To require only the characters [A-Z0-9-] (like DNS labels).
>
> 4) To be simple to describe and implement.
>
> Notation: Let the symbol # denote any of the characters from the set
> [0-9A-V], which represent quintet values in that order:
>
> "0" = 0 = 00000
> "1" = 1 = 00001
> ...
> "9" = 9 = 01001
> "A" = 10 = 01010
> "B" = 11 = 01011
> ...
> "V" = 31 = 11111
>
> To encode a sequence of Unicode characters as a sequence of ASCII
> characters:
>
> A maximal nonempty subsequence of ASCII characters is encoded
> literally, except that any instances of "-" are replaced by "--". If
> the result does not begin with "-", then "-" is prepended. If the
> result does not end with "-", then "-" is appended, except at the
> very end of the whole sequence.
>
> A Unicode character in the range [0x80, 0x3ff] is encoded as "##" in
> base 32 (most significant quintent first).
>
> A Unicode character in the range [0x400, 0x7fff] is encoded as
> "W###" in base 32.
>
> A Unicode character in the range [0x8000, 0xffff] is encoded as
> "X###", where the base 32 number is the offset from 0x8000.
>
> A Unicode character in the range [0x10000, 0x10ffff] is encoded as
> "Y####", where the base 32 number is the offset from 0x10000.
>
> There aren't ever supposed to be any Unicode characters beyond that
> (because they couldn't be represented in UTF-16), but we still have
> "Z" unused in case we need an escape hatch.
>
> To decode a sequence of ASCII characters into a sequence of Unicode
> characters, make one pass from the beginning:
>
> Start in base-32 mode.
>
> In base-32 mode, decode the various sizes of base-32 numbers
> depending on whether the first character is #, W, X, or Y. Allow
> both upper and lower case letters.
>
> In ASCII mode, all characters are literal except for "-".
>
> "--" encountered in either mode decodes as "-" and sets the decoder
> to ASCII mode.
>
> A "-" followed by something other than "-" toggles between ASCII
> mode and base-32 mode (and does not consume the character following
> the "-").
>
> Examples:
>
> Suppose the string we wish to encode is
> "AMURONAMIE-with-super-monkeys", where AMURONAMIE refers to a
> particular sequence of five Japanese characters, whose iso-2022-jp
> encoding is:
>
> $B0B<<F`H~7C(B
>
> The corresponding Unicode values are:
>
> U+5B89 U+5BA4 U+5948 U+7F8E U+6075.
>
> The encoded string is:
>
> WMS9WMT4WMA8WVSNWO3L--with--super--monkeys
>
> The encoding of "champs-elysee", with an acute accent over the
> second-last "e", is:
>
> -champs--elys-79-e
>
> Notice how the hyphens help humans pick out the readable ASCII parts
> and ignore the base-32 gibberish.
>
> Use with DNS:
>
> It is recommended that a standard prefix (such as "u--") be chosen
> for all domain labels that use this encoding, so that they can be
> distinguished from ASCII labels, and so that they never begin with a
> hyphen. A 3-character prefix leaves room for fifteen 16-bit Unicode
> characters.
>
> Hostnames are case insensitive, and that goes for the base-32 parts
> as well as the ASCII parts. However, since existing ASCII domain
> names are usually stored in lower case, it is recommended that the
> base-32 portions of encoded names be stored in upper case, to help
> humans with old software distinguish the ASCII from the base-32.
> Humans with new software that interprets the encoding will, of
> course, see the Unicode characters rather than the base-32 encoding.
>
> Acknowledgements:
>
> Some ideas for FACE were taken from UTF-5, RACE, and SACE.