[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
[idn] FACE: Friendly ASCII-Compatible Encoding
- To: idn working group <idn@ops.ietf.org>
- Subject: [idn] FACE: Friendly ASCII-Compatible Encoding
- From: "Adam M. Costello" <amc@cs.berkeley.edu>
- Date: Mon, 4 Sep 2000 12:57:31 +0000
- Delivery-date: Mon, 04 Sep 2000 05:58:52 -0700
- Envelope-to: idn-data@psg.com
- User-Agent: Mutt/1.2.5i
Please forgive me for jumping in with no background, but I just stumbled
across this working group's web page, found some of the internet drafts
interesting, and whipped up this idea, which you all may or may not have
any use for.
AMC
Friendly ASCII-Compatible Encoding (FACE)
version 0.0.0 (2000-Sep-04-Mon)
Adam M. Costello <amc@cs.berkeley.edu>
Goals:
1) To encode Unicode text as an ASCII string in such a way that
substrings that were already ASCII to begin with remain visible, for
the benefit of users whose software does not understand the Unicode
text.
2) To achieve reasonable efficiency for non-ASCII characters.
3) To require only the characters [A-Z0-9-] (like DNS labels).
4) To be simple to describe and implement.
Notation: Let the symbol # denote any of the characters from the set
[0-9A-V], which represent quintet values in that order:
"0" = 0 = 00000
"1" = 1 = 00001
...
"9" = 9 = 01001
"A" = 10 = 01010
"B" = 11 = 01011
...
"V" = 31 = 11111
To encode a sequence of Unicode characters as a sequence of ASCII
characters:
A maximal nonempty subsequence of ASCII characters is encoded
literally, except that any instances of "-" are replaced by "--". If
the result does not begin with "-", then "-" is prepended. If the
result does not end with "-", then "-" is appended, except at the
very end of the whole sequence.
A Unicode character in the range [0x80, 0x3ff] is encoded as "##" in
base 32 (most significant quintent first).
A Unicode character in the range [0x400, 0x7fff] is encoded as
"W###" in base 32.
A Unicode character in the range [0x8000, 0xffff] is encoded as
"X###", where the base 32 number is the offset from 0x8000.
A Unicode character in the range [0x10000, 0x10ffff] is encoded as
"Y####", where the base 32 number is the offset from 0x10000.
There aren't ever supposed to be any Unicode characters beyond that
(because they couldn't be represented in UTF-16), but we still have
"Z" unused in case we need an escape hatch.
To decode a sequence of ASCII characters into a sequence of Unicode
characters, make one pass from the beginning:
Start in base-32 mode.
In base-32 mode, decode the various sizes of base-32 numbers
depending on whether the first character is #, W, X, or Y. Allow
both upper and lower case letters.
In ASCII mode, all characters are literal except for "-".
"--" encountered in either mode decodes as "-" and sets the decoder
to ASCII mode.
A "-" followed by something other than "-" toggles between ASCII
mode and base-32 mode (and does not consume the character following
the "-").
Examples:
Suppose the string we wish to encode is
"AMURONAMIE-with-super-monkeys", where AMURONAMIE refers to a
particular sequence of five Japanese characters, whose iso-2022-jp
encoding is:
$B0B<<F`H~7C(B
The corresponding Unicode values are:
U+5B89 U+5BA4 U+5948 U+7F8E U+6075.
The encoded string is:
WMS9WMT4WMA8WVSNWO3L--with--super--monkeys
The encoding of "champs-elysee", with an acute accent over the
second-last "e", is:
-champs--elys-79-e
Notice how the hyphens help humans pick out the readable ASCII parts
and ignore the base-32 gibberish.
Use with DNS:
It is recommended that a standard prefix (such as "u--") be chosen
for all domain labels that use this encoding, so that they can be
distinguished from ASCII labels, and so that they never begin with a
hyphen. A 3-character prefix leaves room for fifteen 16-bit Unicode
characters.
Hostnames are case insensitive, and that goes for the base-32 parts
as well as the ASCII parts. However, since existing ASCII domain
names are usually stored in lower case, it is recommended that the
base-32 portions of encoded names be stored in upper case, to help
humans with old software distinguish the ASCII from the base-32.
Humans with new software that interprets the encoding will, of
course, see the Unicode characters rather than the base-32 encoding.
Acknowledgements:
Some ideas for FACE were taken from UTF-5, RACE, and SACE.