[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
[idn] Re: FACE: Friendly ASCII-Compatible Encoding
- To: idn working group <idn@ops.ietf.org>
- Subject: [idn] Re: FACE: Friendly ASCII-Compatible Encoding
- From: "Adam M. Costello" <amc@cs.berkeley.edu>
- Date: Tue, 5 Sep 2000 00:26:01 +0000
- Delivery-date: Mon, 04 Sep 2000 17:28:01 -0700
- Envelope-to: idn-data@psg.com
- User-Agent: Mutt/1.2.5i
Okay, forget that last idea. The first time I read the SACE draft, I
didn't understand it. Now I see the efficiency of keeping more state in
the encoder/decoder. Below is a new version that encodes each non-ASCII
character as a signed difference from the previous non-ASCII character,
represented as a variable-length base-32 integer.
I suspect that this will have an efficiency similar to SACE (which
is much more efficient than UTF-5, and more efficient than RACE for
Japanese and Korean and strings containing outlier characters).
Preserving the readability of ASCII comes at no cost to text that
doesn't use ASCII. It comes at a small cost to text that uses ASCII
sporadically, as is of course efficient for text that is mostly ASCII.
Is there a list of test-strings that I should encode to see how large
they get?
I'll turn this proposal into an internet draft if anyone requests it,
otherwise I'm content to drop it. Or feel free to dissect it and use
the parts you like.
AMC
Friendly ASCII-Compatible Encoding (FACE)
version 0.2.0 (2000-Sep-04-Mon)
Adam M. Costello <amc@cs.berkeley.edu>
Goals:
1) To achieve efficient encoding.
2) To require only the characters [A-Z0-9-] (like DNS labels).
3) To be simple to describe and implement.
4) To encode Unicode text as an ASCII string in such a way that
substrings that were already ASCII to begin with remain visible, for
the benefit of users whose software does not understand the Unicode
text.
Notation: Let the symbol # denote any of the characters from the set
[2-9A-KMNP-Z], which represent quintet values in that order:
"2" = 0 = 00000
"3" = 1 = 00001
"4" = 2 = 00010
"5" = 3 = 00011
"6" = 4 = 00100
"7" = 5 = 00101
"8" = 6 = 00110
"9" = 7 = 00111
"A" = 8 = 01000
"B" = 9 = 01001
"C" = 10 = 01010
"D" = 11 = 01011
"E" = 12 = 01100
"F" = 13 = 01101
"G" = 14 = 01110
"H" = 15 = 01111
"I" = 16 = 10000
"J" = 17 = 10001
"K" = 18 = 10010
"M" = 19 = 10011
"N" = 20 = 10100
"P" = 21 = 10101
"Q" = 22 = 10110
"R" = 23 = 10111
"S" = 24 = 11000
"T" = 25 = 11001
"U" = 26 = 11010
"V" = 27 = 11011
"W" = 28 = 11100
"X" = 29 = 11101
"Y" = 30 = 11110
"Z" = 31 = 11111
The digits "0" and "1" and the letters "O" and "L" ("l") are not used,
to avoid transcription errors.
To encode a sequence of Unicode characters as a sequence of ASCII
characters:
There are two modes, ASCII mode and base-32 mode. Begin in base-32
mode. There is also a "previous non-ASCII character". Initially it
is character U+1A0.
In base-32 mode non-ASCII characters are encoded as differences
from the previous non-ASCII character. The difference is first
represented as a two's complement binary integer. Then the most
significant bits are discarded to bring the length down to one of
the lengths in the table below. As many bits are discarded as
possible without changing the two's complement value, except that
bit 31 may always be discarded. The resulting bit string is then
encoded as a sequence of two to seven characters, depending on its
length, as follows:
length encoding
------ ---------------------------------------------------
9 ## = 0xxxx xxxxx
13 ### = 10xxx xxxxx xxxxx
17 #### = 110xx xxxxx xxxxx xxxxx
21 ##### = 1110x xxxxx xxxxx xxxxx xxxxx
31 ####### = 1111x xxxxx xxxxx xxxxx xxxxx xxxxx xxxxx
Bit-order and quintet order are both big-endian (most significant
first).
In ASCII mode, characters in the range U+0..U+7F, except for U+2D
("-"), are encoded literally as themselves.
In either mode, character U+2D ("-") is encoded as "--".
If the current mode is incapable of encoding the next character,
output "-" and switch modes.
The previous non-ASCII character is not forgotten during excursions
into ASCII mode.
To decode a sequence of ASCII characters into a sequence of Unicode
characters:
Start in base-32 mode, with a previous non-ASCII character of U+1A0.
In base-32 mode, decode the various sizes of base-32 numbers as
indicated in the table above (the format is implied by the first
character). Sign-extend the value to 32 bits and add it to the
previous non-ASCII character. If bit 31 is set in the result, clear
it. Output the result, and remember it for the next non-ASCII
character.
In ASCII mode, decode all characters as themselves, except for 2D
("-").
In either mode, decode "--" as "-".
In either mode, "-" not followed by "-" switches modes. This
requires one character lookahead.
The previous non-ASCII character is not forgotten during excursions
into ASCII mode.
Examples:
Suppose the string we wish to encode is
"AMURONAMIE-with-super-monkeys", where AMURONAMIE refers to a
particular sequence of five Japanese characters, whose iso-2022-jp
encoding is:
$B0B<<F`H~7C(B
The corresponding Unicode values are:
U+5B89 U+5BA4 U+5948 U+7F8E U+6075.
The 32-bit differences are:
0000 0000 0000 0000 0101 1001 1110 1001
0000 0000 0000 0000 0000 0000 0001 1011
1111 1111 1111 1111 1111 1101 1010 0100
0000 0000 0000 0000 0010 0110 0100 0110
1111 1111 1111 1111 1110 0000 1110 0111
The truncated bit strings are:
0 0101 1001 1110 1001
0 0001 1011
1 1101 1010 0100
0 0010 0110 0100 0110
1 1110 0000 1110 0111
The quintet encodings are:
11000 10110 01111 01001
00000 11011
10111 01101 00100
11000 01001 10010 00110
11011 11000 00111 00111
The final encoded string is:
SQHB2VRF6SBK8VS99---with--super--monkeys
The encoding of "champs-elysee", with an acute accent over the
second-last "e", is:
-champs--elys-CB-e
Notice how the hyphens help humans pick out the readable ASCII parts
and ignore the base-32 gibberish.
Use with DNS:
It is recommended that a standard prefix (such as "u--") be chosen
for all domain labels that use this encoding, so that they can
be distinguished from ASCII labels, and so that they never begin
with a hyphen. A 3-character prefix leaves room for about thirty
characters in a small script like Cyrillic or Hebrew, or about
sixteen Han ideographs.
Hostnames are case insensitive, and that goes for the base-32 parts
as well as the ASCII parts. However, since existing ASCII domain
names are usually stored in lower case, it is recommended that the
base-32 portions of encoded names be stored in upper case, to help
humans with old software distinguish the ASCII from the base-32.
Humans with new software that interprets the encoding will, of
course, see the Unicode characters rather than the base-32 encoding.
Comparisons with other schemes:
FACE probably has similar efficiency to SACE, but seems conceptually
simpler. FACE is a little more human-friendly because gibberish and
ASCII are always separated by a hyphen.
FACE and SACE are both more efficient and much more human-friendly
than UTF-5.
RACE is quite efficient for strings that are limited to Latin-1 plus
one other cluster of 256 characters, but one deviation from this set
causes the compression to break down for the entire string. FACE
and SACE are more tolerant of outliers, more efficient for Japanese
and Korean (which use small clusters of syllabic characters in
addition to thousands of Han ideographs), and more human-friendly.
References:
FACE borrows heavily from UTF-5, SACE, and RACE.
http://www.i-d-n.net/