[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
[idn] [draft] Improving ACE using code point reordering v0.5
Could DUDE be improved for Hangul,Vietnames and European languages
without hurting its simplicity and elegance? Yes! :-)
THis draft will appear in IDN Intenet-draft directory soon.
Soobok Lee
lsb@postel.co.kr
---------------------------------------------------------------------------
INTERNET-DRAFT Soobok Lee
draft-ietf-idn-lsb-00.txt
Expires 2001-Dec-26 2001-Jun-26
Improving ACE using code point reordering v0.5
Status of this Memo
This document is an Internet-Draft and is in full conformance with
all provisions of Section 10 of RFC2026.
Internet-Drafts are working documents of the Internet Engineering
Task Force (IETF), its areas, and its working groups. Note
that other groups may also distribute working documents as
Internet-Drafts.
Internet-Drafts are draft documents valid for a maximum of six
months and may be updated, replaced, or obsoleted by other documents
at any time. It is inappropriate to use Internet-Drafts as
reference material or to cite them other than as "work in progress."
The list of current Internet-Drafts can be accessed at
http://www.ietf.org/ietf/1id-abstracts.txt
The list of Internet-Draft Shadow Directories can be accessed at
http://www.ietf.org/shadow.html
Distribution of this document is unlimited. Please send comments to
the authors or to the idn working group at idn@ops.ietf.org.
Abstract
This draft describes statistically-backed Unicode code point reordering
to improve ACE compression. This code point reordering may be inserted
into existing ACE algorithms with only 1~2 lines of modification.
This reordering requires simple character mapping or range-wide mapping
which are simple and easy to implement.
Experiments with DUDE implementation of this idea show great
improvements for Hangul, Vietnames and European domains.
Contents
Overview
Hangul
Basic Latin
Extended Latin and Combining Diacritical Marks
Unified Han
Other character sets
Modified Encoding procedure of DUDE implementation of this idea
Modified Decoding procedure of DUDE implementation of this idea
Example strings
Security considerations
References
Author
LDUDE: Example implementation into DUDE
Overview
Pursuing shorter ACE labels is justified to save memory resources and
to reduce internet traffic even for domains of average length
in various application/core internet protocols.
Both 11172 Hangul syllables and 24000 or more CJK Han syllables
occupy roughly half of the entire unicode space.
Their lexicographical ordering( not in frequency ordering) makes
various ACE compression technique work poorly for them. Frequently
used syllables are spread evenly through out those wide ranges.
Even, Latin characters code range, including 'a' - 'z' has
lexicographical order that does not reflect the fact that 's', 't' and
'r' are more frequently used than 'j','k' and 'h'.
Each Hangul code point integer have useful bit structure in it.
That reflects its hangul jamo(consonant/vowel) combination which
provides us the hint for applying reordering by usage frequency.
Most ACE algorithms show good compression ratio when frequently used
characters are re-located into narrower code areas.
Especially, to reduce DUDE XOR distance, we can make the narrow area
fit in 16,256 and 4096 bytes-long page boundaries.
Hangul
Each code point integer of modern hangul syllables (u+ac00 ~ u+bd7f) can be
decomposed to produce the following 3 jamo(hangul consonant/vowel) indices,
1) L: a leading consonant index in modern 19 leading consonants array,
2) V: a vowel index in modern 21 vowels array,
3) T: an optional trailing consonant index in modern 28 trailing
consonants array.
In fact, Unicode hangul range (u+ac00 ~ u+bd7f) is a 3-dimensional array
Hangul[L][V][T] with base 0xac00 and each hangul syllable has
code point integer determined by L*21*28 + V*28 + T + 0xac00.
Statistically, roughly six of 28 hangul trailing consonants are frequent in
korean nouns. Leading consontants and vowels show relatively even
frequency distributions compared with that of trailing consonants.
We can make a frequency mapping table f(T) for T so that
f(T)= FrequencyOrderOf(T) in the array of modern trailing consonants.
Now, Hangul[L][V][T] is to be reordered into another 3-dimensional array
Hangul[f(T)][f(V)][f(L)] with index T augmented to the highest order.
Each hangul syllable in this reorderd range has a new code point value
determined by f(T)*21*19 + f(V)*19 + f(L) + 0xac00.
Average f(T) value should be sufficiently lower than f(L) and F(V), so that
the lowered code point value and induced lowered successive XOR-differences
contribute to produce shorter ACE labes.
Basic Latin
Basic Latin row u+0070 ~ u+007f has 'p','r','s','t' and 'u' which
are more frequently used in European nouns than '`','j','k','f' and 'g'
in u+0060 ~ u+006f row which includes most frequently used 'a'~'o' .
Let's swap two sets of these 5 characters by character by character basis,
so that 'p','r','s','t','u' go into the u+0060 ~ u+006f row.
Single row fits in 16 bytes boundary and any character sequences
from only this single row make XOR-distance or code window length
shorter than 0x10 and make DUDE and other ACEs do good compression.
Extended Latin and Combining Diacritical Marks
First 6 rows from Latin Extension A(u+0100 ~ u+015f) and 6 rows from
Basic Latin & Latin-1 Supplement (u+0000 ~ u+002f and u+0080 ~ u+00a0)
are swapped.
First 3 rows from Combining Diacritical Marks(u+0300 ~ u+032f) and
3 rows from Latin-1 Supplement (u+00B0 ~ u+00df) are also swapped.
This makes frequently used parts of Latin Extended-A and
Combining Diacritical Marks go into first 256-bytes page(u+0000 ~ u+00ff).
Any character sequences from this single 256-bytes-long page make
XOR-distance or code window length much shorter than 0x100.
This improvement benefits especially East-European and Vietnamese
domains which use Latin Extented A and Combining Diacritical Marks.
Unified Han
I have no statistical data to optimize for this code ranges, yet.
CJK Unified Han syllables (u+4e00 ~ u+a48f) are ordered by their radicals.
We could consider casting most frequently used 256 Han syllables into
single 256-bytes page.
We could consider casting most frequently used 4096 Han syllables into
single 4096-bytes page ,too.
Other character sets
I have no statistical data to optimize for this code ranges, yet.
Japapanes gatakana and hiragana (u+3040 ~ u+30ff) code points have
lexicographical order and if we could put most frequent ones in single row,
it may greatly improve gatakana-only or hiragana-only domains.
For Arabic,Hebrew,Cyrllic and Hindi, we could devise similiar optimizations,
if relavent statistical data are avaiable.
Modified Encoding procedure of DUDE implementation of this idea
All ordering of nybbles and quintets is big-endian (most significant
first). A nybble is 4 bits. XOR is bitwise exclusive or.
This modification are hyphen-safe.
Hyphen encoding/decoding is not affected by this modification.
let prev = 96
for each input integer n (in order) do begin
if n == 45 then output hyphen minus
else begin
n = reorder(n) // ******** ADDED **********
let diff = prev XOR n
extract the least significant nybbles of diff, as few as are
sufficient to hold all the nonzero bits (but at least one)
prepend 0 to the last nybble and 1 to the rest
output base-32 characters corresponding to the quintets
let prev = n
end
end
The encoder must either correctly handle all integer values that can
be represented in the type of its input, or it must check whether
the input contains values that it cannot handle and return an error
if so. Under no circumstances may it produce incorrect output.
Modified Decoding procedure of DUDE implementation of this idea
let prev = 96
while the input string is not exhausted do begin
if the next character is hyphen-minus then output 45
else begin
input characters and convert them to quintets until
encountering a quintet beginning with 0
fail upon encountering a non-base-32 character or end-of-input
strip the first bit of each quintet
concatenate the resulting nybbles to form diff
let prev = prev XOR diff
output restore_order(prev) // ******** MODIFIED **********
end
end
encode the output sequence and compare it to the input string
fail if they are not equal
Example strings
about 20% improvement in DUDE compression ratio is measured in these
Hangul examples.
(A) Korean Strings 1: ( 24 hangul syllables )
U+C138 U+ACC4 U+C758 U+BAA8 U+B4E0 U+C0AC U+B78C U+B4E4 U+C774
U+D55C U+AD6D U+C5B4 U+B97C U+C774 U+D574 U+D55C U+B2E4 U+BA74
U+C5BC U+B9C8 U+B098 U+C88B U+C744 U+AE4C
DUDE-02: 6txIy79Ny53Nz79A8wIzwwNzzuAvyIzv3AtuuIz2vBy27Jz66Iz8sI\
tusAuIyz5I23Az96Iz6zE3xAz2tD96Ry3sI ( 89 chars )
LDUDE : 5szf8pt3bttat3jt6iywhu3bw9qt5m37r2vmxxjxsg2mtvat3auykz\
wpxubc8xd42fw6p ( 70 chars )
(B) Korean Strings 2: ( 6 hangul syllables )
U+C138 U+ACC4 U+C758 U+BAA8 U+B4E0 U+C0AC
DUDE-02: 6txiy79ny53nz79a8wizwwn ( 24 chars )
LDUDE : 5szf8pt3bttat3jt6i ( 19 chars )
LDUDE improvents show better compression ratios for other scripts.
(C) Vietnamese: ( 38 syllables using diacritical marks )
Ta<dotbelow>isaoho<dotbelow>kh<ocirc>ngth<ecirc><hookabove>chi\
<hookabove>no<acute>iti<ecirc><acute>ngVi<ecirc><dotbelow>t
U+0054 u+0061 u+0323 u+0069 u+0073 u+0061 u+006F u+0068 u+006F
u+0323 u+006B u+0068 u+00F4 u+006E u+0067 u+0074 u+0068 u+00EA
u+0309 u+0063 u+0068 u+0069 u+0309 u+006E u+006F u+0301 u+0069
u+0074 u+0069 u+00EA u+0301 u+006E u+0067 U+0056 u+0069 u+00EA
u+0323 u+0074
DUDE-02: vEvfvwcvwktktcqhhvwnvwid3n3kjtdtn2cv8dvykmbvyavyhbvyqv\
yitptp2dv8mvyrjvBvr2dv6jvxh ( 82 chars )
LDUDE : uGuh5c5kckqhh5n4atm3n3ktmtdq2cxd7kmb7a7hb7q7irr2dxm7rt\
muDvr2dvj5f (66 chars , 16 chars(19%) shorter)
(D) Spanish: ( using basic Latin & Latin Supplement )
Porqu<eacute>nopuedensimplementehablarenEspa<ntilde>ol
U+0050 u+006F u+0072 u+0071 u+0075 u+00E9 u+006E u+006F u+0070
u+0075 u+0065 u+0064 u+0065 u+006E u+0073 u+0069 u+006D u+0070
u+006C u+0065 u+006D u+0065 u+006E u+0074 u+0065 u+0068 u+0061
u+0062 u+006C u+0061 u+0072 u+0065 u+006E U+0045 u+0073 u+0070
u+0061 u+00F1 u+006F u+006C
DUDE-02: vAvrtpde3n2hbtrftabbmtptketptnjiimtktbpjdqptdthmuMvgdt\
b3a3qd (61 chars)
LDUDE : uAurftmtg2q2hbrhcbbmfcepnjiimidpjdqpmrmuMuqmb3a3qd
(51 chars, 10 chars (16%) shorter)
(E) Czech: (using Latin Extended A)
Pro<ccaron>prost<ecaron>nemluv<iacute><ccaron>esky
U+0050 u+0072 u+006F u+010D u+0070 u+0072 u+006F u+0073 u+0074
u+011B u+006E u+0065 u+006D u+006C u+0075 u+0076 u+00ED u+010D
u+0065 u+0073 u+006B u+0079
DUDE-02: vAuctptyctzpctptnhtyrtzfmibtjd3mt8atyitgtitc (45 chars)
LDUDE : uAukfycypkfepzpzfmibmtb3m8ayiqtik (34 chars, 24% shorter )
Security considerations
ACE-encoded reordered code points are restored in reverse ACE translation
with no problem, and this improvement do not introdude any new security threat
in ACE.
References
[DUDE02] Mark Welter, Brian Spolarich, Adam Costello,
"DUDE: Differential Unicode Domain Encoding", 2001-May-31,
draft-ietf-idn-dude-02.
[AMCACEW] Adam Costello, "AMC-ACE-W version 0.1.0",
2001-May-31, draft-ietf-idn-amc-ace-w-00, latest version at
http://www.cs.berkeley.edu/~amc/charset/amc-ace-w.
[UNICODE] The Unicode Consortium, "The Unicode Standard",
http://www.unicode.org/unicode/standard/standard.html.
[IDNA] Patrik Falstrom, Paul Hoffman, "Internationalizing Host
Names In Applications (IDNA)", draft-ietf-idn-idna-01
[NAMEPREP] Paul Hoffman, Marc Blanchet, "Preparation of
Internationalized Host Names", Feb 2001,
draft-ietf-idn-nameprep-03
Author
Soobok Lee <lsb@postel.co.kr>
http://www.postel.co.kr
Example implementation
This idea is applicable to any ACEs.
LDUDE (in this example implementation) is merely a name for
DUDE implementation of this idea.
Embedded hangul jamo and Latin frequency tables are subject
to change with further studies in the next revision of this draft.
In Unix, save this example source code into ldude.c
% cc -o ldude ldude.c
% ./ldude -e < input_file > output_file
% ./ldude -d < output_file
A input file should contains u+????-form code points
delimited with spaces or newlines.
/******************************************/
/* ldude.c 0.1 (2001-Jun-11) */
/* Soobok Lee <lsb@postel.co.kr> */
/* Adam M. Costello <amc@cs.berkeley.edu> */
/******************************************/
/* This is ANSI C code (C89) implementing */
/* DUDE (draft-ietf-idn-ldude-01). */
/************************************************************/
/* Public interface (would normally go in its own .h file): */
#include <stdio.h>
#include <limits.h>
enum dude_status {
dude_success,
dude_bad_input,
dude_big_output /* Output would exceed the space provided. */
};
enum case_sensitivity { case_sensitive, case_insensitive };
#if UINT_MAX >= 0x1FFFFF
typedef unsigned int u_code_point;
#else
typedef unsigned long u_code_point;
#endif
enum dude_status dude_encode(
unsigned int input_length,
const u_code_point input[],
const unsigned char uppercase_flags[],
unsigned int *output_size,
char output[] );
/* dude_encode() converts Unicode to DUDE (without any */
/* signature). The input must be represented as an array */
/* of Unicode code points (not code units; surrogate pairs */
/* are not allowed), and the output will be represented as */
/* null-terminated ASCII. The input_length is the number of code */
/* points in the input. The output_size is an in/out argument: */
/* the caller must pass in the maximum number of characters */
/* that may be output (including the terminating null), and on */
/* successful return it will contain the number of characters */
/* actually output (including the terminating null, so it will be */
/* one more than strlen() would return, which is why it is called */
/* output_size rather than output_length). The uppercase_flags */
/* array must hold input_length boolean values, where nonzero */
/* means the corresponding Unicode character should be forced */
/* to uppercase after being decoded, and zero means it is */
/* caseless or should be forced to lowercase. Alternatively, */
/* uppercase_flags may be a null pointer, which is equivalent */
/* to all zeros. The encoder always outputs lowercase base-32 */
/* characters except when nonzero values of uppercase_flags */
/* require otherwise. The return value may be any of the */
/* dude_status values defined above; if not dude_success, then */
/* output_size and output may contain garbage. On success, the */
/* encoder will never need to write an output_size greater than */
/* input_length*k+1 if all the input code points are less than 1 */
/* << (4*k), because of how the encoding is defined. */
enum dude_status dude_decode(
enum case_sensitivity case_sensitivity,
char scratch_space[],
const char input[],
unsigned int *output_length,
u_code_point output[],
unsigned char uppercase_flags[] );
/* dude_decode() converts DUDE (without any signature) to */
/* Unicode. The input must be represented as null-terminated */
/* ASCII, and the output will be represented as an array of */
/* Unicode code points. The case_sensitivity argument influences */
/* the check on the well-formedness of the input string; it */
/* must be case_sensitive if case-sensitive comparisons are */
/* allowed on encoded strings, case_insensitive otherwise. */
/* The scratch_space must point to space at least as large */
/* as the input, which will get overwritten (this allows the */
/* decoder to avoid calling malloc()). The output_length is */
/* an in/out argument: the caller must pass in the maximum */
/* number of code points that may be output, and on successful */
/* return it will contain the actual number of code points */
/* output. The uppercase_flags array must have room for at */
/* least output_length values, or it may be a null pointer if */
/* the case information is not needed. A nonzero flag indicates */
/* that the corresponding Unicode character should be forced to */
/* uppercase by the caller, while zero means it is caseless or */
/* should be forced to lowercase. The return value may be any */
/* of the dude_status values defined above; if not dude_success, */
/* then output_length, output, and uppercase_flags may contain */
/* garbage. On success, the decoder will never need to write */
/* an output_length greater than the length of the input (not */
/* counting the null terminator), because of how the encoding is */
/* defined. */
/**********************************************************/
/* Implementation (would normally go in its own .c file): */
#include <string.h>
/* Character utilities: */
/* base32[q] is the lowercase base-32 character representing */
/* the number q from the range 0 to 31. Note that we cannot */
/* use string literals for ASCII characters because an ANSI C */
/* compiler does not necessarily use ASCII. */
static const char base32[] = {
97, 98, 99, 100, 101, 102, 103, 104, 105, 106, 107, /* a-k */
109, 110, /* m-n */
112, 113, 114, 115, 116, 117, 118, 119, 120, 121, 122, /* p-z */
50, 51, 52, 53, 54, 55, 56, 57 /* 2-9 */
};
/* base32_decode(c) returns the value of a base-32 character, in the */
/* range 0 to 31, or the constant base32_invalid if c is not a valid */
/* base-32 character. */
enum { base32_invalid = 32 };
static unsigned int base32_decode(char c)
{
if (c < 50) return base32_invalid;
if (c <= 57) return c - 26;
if (c < 97) c += 32;
if (c < 97 || c == 108 || c == 111 || c > 122) return base32_invalid;
return c - 97 - (c > 108) - (c > 111);
}
/* unequal(case_sensitivity,s1,s2) returns 0 if the strings s1 and s2 */
/* are equal, 1 otherwise. If case_sensitivity is case_insensitive, */
/* then ASCII A-Z are considered equal to a-z respectively. */
static int unequal( enum case_sensitivity case_sensitivity,
const char s1[], const char s2[] )
{
char c1, c2;
if (case_sensitivity != case_insensitive) return strcmp(s1,s2) != 0;
for (;;) {
c1 = *s1;
c2 = *s2;
if (c1 >= 65 && c1 <= 90) c1 += 32;
if (c2 >= 65 && c2 <= 90) c2 += 32;
if (c1 != c2) return 1;
if (c1 == 0) return 0;
++s1, ++s2;
}
}
/* Begin of Improvement Code by LSB */
/* LANGUAGE-SPECIFIC IMPROVEMENTS TO DUDE BASED ON CODE REORDERING */
/* Common Constants for Hangul */
static u_code_point
SBase = 0xAC00, LBase = 0x1100, VBase = 0x1161, TBase = 0x11A7,
LCount = 19, VCount = 21, TCount = 28, NCount, SCount;
static u_code_point NCount = 588; // VCount * TCount
static u_code_point SCount = 11172; // LCount * NCount
static u_code_point MCount = 399; // LCount * VCount
/* Hangul Jamo Frequency Mapping/Reverse Mapping Table */
int JAMO_L_FREQ[19][2] = {
{1,11}, // "G" 0
{14,0}, // "GG"
{9,9}, // "N"
{5,12}, // "D"
{15,7}, // "DD"
{13,3}, // "R" 5
{7,18}, // "M"
{4,6 }, // "B"
{16,14}, // "BB"
{2, 2}, // "S"
{17,17}, // "SS" 10
{0,16}, // "O"
{3,15}, // "J"
{18,5 }, // "JJ"
{8,1}, // "C"
{12,4}, // "K" 15
{11,8}, // "T"
{10,10}, // "P"
{6 ,13} // "H"
};
int JAMO_V_FREQ[21][2] = {
{2,20}, // "A" 0
{7,5}, // "AE"
{15,0}, // "YA"
{20,13}, // "YAE"
{5,18}, // "EO"
{1, 4}, // "E" 5
{9,8}, // "YEO"
{13,1},// "YE"
{6,13}, // "O"
{10, 6}, // "WA"
{18,9}, // "WAE" 10
{16,17}, // "OE"
{17,14}, // "YO"
{8, 7}, // "U"
{12,16}, // "WEO"
{19, 2}, // "WE" 15
{14,11}, // "WI"
{11,12}, // "YU"
{4,10}, // "EU"
{19,19}, // "YI"
{0,3} // "I" 20
};
int JAMO_T_FREQ[28][2] = {
{0,0}, // "" 0
{5,4}, // "G"
{15,21}, // "GG"
{16,8}, // "GS"
{1,16}, // "N"
{27,1 }, // "NJ" 5
{17,17}, // "NH"
{10,19}, // "D"
{3,24}, // "L"
{18,27}, // "LG"
{19, 7}, // "LM" 10
{20,22}, // "LB"
{21,25}, // "LS"
{22,26}, // "LT"
{23,27}, // "LP"
{24,2}, // "LH" 15
{4,3}, // "M"
{6,6}, // "B"
{25,9}, // "BS"
{7,10}, // "S"
{26,11}, // "SS" 20
{2,12}, // "NG"
{11,13}, // "J"
{10,14}, // "C"
{8 ,15}, // "K"
{12,18}, // "T" 25
{13,20}, // "P"
{9 ,5} // "H" 27
};
/* Hangul Decomposition */
int isHANGUL(u_code_point s) {
int SIndex = s - SBase;
if (SIndex < 0 || SIndex >= SCount) {
return 0;
}
return 1;
};
int isUNIHAN(u_code_point s) {
if (s >= 0x4E00 && s <= 0x9FAF) {
return 1;
}
return 0;
};
int isLatins(u_code_point s) {
if (s < 0x370) {
return 1;
}
return 0;
};
u_code_point reorder_hangul(u_code_point s) {
u_code_point SIndex = s - SBase;
int zL = JAMO_L_FREQ[SIndex / NCount][0];
int zV = JAMO_V_FREQ[(SIndex % NCount) / TCount][0];
int zT = JAMO_T_FREQ[SIndex % TCount][0];
int idx= (zT*MCount + zV*LCount + zL);
if(idx < SCount-0x400) {
idx += 0x400;
} else {
idx = idx - (SCount -0x400);
};
return idx + SBase;
}
u_code_point restore_order_hangul(u_code_point z) {
int T,V,L;
u_code_point zIndex = z-SBase;
if(zIndex < 0x400) {
zIndex += SCount - 0x400;
} else {
zIndex = zIndex - 0x400;
};
T = JAMO_T_FREQ[(zIndex / MCount)][1];
V = JAMO_V_FREQ[(zIndex % MCount) / LCount][1];
L = JAMO_L_FREQ[(zIndex % LCount)][1];
return (L*NCount + V*TCount + T) + SBase;
}
u_code_point reorder_unihan(u_code_point s) {
return s;
}
u_code_point restore_order_unihan(u_code_point z) {
return z;
}
#define MAPCHAR(x,A,B,bytes) if(A<=x && x< (A+bytes)) return(x+(B-A)); if(B<=x && x< (B+bytes)) return(x+(A-B))
#define MAP16BL(x,A,B,block) if(A<=x && x< (A+(block<<4))) return(x+(B-A)); if(B<=x && x< (B+(block<<4))) return(x+(A-B))
u_code_point reorder_latins(u_code_point s) {
MAP16BL(s,0x0100,0x0000,3); // Latin Extension A
MAP16BL(s,0x0130,0x0080,2);
MAP16BL(s,0x0150,0x00A0,1);
MAP16BL(s,0x0300,0x00B0,3); // Combining Diacritical Marks
MAPCHAR(s,0x0070,0x0060,1); // p,`
MAPCHAR(s,0x0072,0x006A,1); // r,j
MAPCHAR(s,0x0073,0x006B,1); // s,k
MAPCHAR(s,0x0074,0x0066,1); // t,f
MAPCHAR(s,0x0075,0x0067,1); // u,g
MAPCHAR(s,0x0050,0x0040,1); // P,@ UPPER
MAPCHAR(s,0x0052,0x004A,1); // R,J UPPER
MAPCHAR(s,0x0053,0x004B,1); // S,K UPPER
MAPCHAR(s,0x0054,0x0046,1); // T,F UPPER
MAPCHAR(s,0x0055,0x0047,1); // U,G UPPER
MAPCHAR(s,0x0160,0x003A,6); // Latin Extension A
MAPCHAR(s,0x0166,0x005B,5);
MAPCHAR(s,0x016B,0x007B,5);
return s;
}
u_code_point restore_order_latins(u_code_point z) {
return reorder_latins(z);
}
u_code_point reorder(u_code_point s) {
if(isHANGUL(s)) return reorder_hangul(s);
if(isUNIHAN(s)) return reorder_unihan(s);
if(isLatins(s)) return reorder_latins(s);
return s;
}
u_code_point restore_order(u_code_point s) {
if(isHANGUL(s)) return restore_order_hangul(s);
if(isUNIHAN(s)) return restore_order_unihan(s);
if(isLatins(s)) return restore_order_latins(s);
return s;
}
/* End of Improvement Code */
/* Encoder: */
enum dude_status dude_encode(
unsigned int input_length,
const u_code_point input[],
const unsigned char uppercase_flags[],
unsigned int *output_size,
char output[] )
{
unsigned int max_out, in, out, k, j;
u_code_point prev, codept, diff, tmp;
char shift;
prev = 0x60;
max_out = *output_size;
for (in = out = 0; in < input_length; ++in) {
/* At the start of each iteration, in and out are the number of */
/* items already input/output, or equivalently, the indices of */
/* the next items to be input/output. */
codept = input[in];
if (codept == 0x2D) {
/* Hyphen-minus stands for itself. */
if (max_out - out < 1) return dude_big_output;
output[out++] = 0x2D;
continue;
}
codept = reorder(codept); // by LSB
diff = prev^codept;
/* Compute the number of base-32 characters (k): */
for (tmp = diff >> 4, k = 1; tmp != 0; ++k, tmp >>= 4);
//fprintf(stderr,"diff %x,%x = prev %x ^ codept %x \n", k,diff,prev,codept);
if (max_out - out < k) return dude_big_output;
shift = uppercase_flags && uppercase_flags[in] ? 32 : 0;
/* shift controls the case of the last base-32 digit. */
/* Each quintet has the form 1xxxx except the last is 0xxxx. */
/* Computing the base-32 digits in reverse order is easiest. */
out += k;
output[out - 1] = base32[diff & 0xF] - shift;
for (j = 2; j <= k; ++j) {
diff >>= 4;
output[out - j] = base32[0x10 | (diff & 0xF)];
}
prev = codept;
}
/* Append the null terminator: */
if (max_out - out < 1) return dude_big_output;
output[out++] = 0;
*output_size = out;
return dude_success;
}
/* Decoder: */
enum dude_status dude_decode(
enum case_sensitivity case_sensitivity,
char scratch_space[],
const char input[],
unsigned int *output_length,
u_code_point output[],
unsigned char uppercase_flags[] )
{
u_code_point prev, q, diff;
char c;
unsigned int max_out, in, out, scratch_size;
enum dude_status status;
prev = 0x60;
max_out = *output_length;
for (c = input[in = 0], out = 0; c != 0; c = input[++in], ++out) {
/* At the start of each iteration, in and out are the number of */
/* items already input/output, or equivalently, the indices of */
/* the next items to be input/output. */
if (max_out - out < 1) return dude_big_output;
if (c == 0x2D) output[out] = c; /* hyphen-minus is literal */
else {
/* Base-32 sequence. Decode quintets until 0xxxx is found: */
for (diff = 0; ; c = input[++in]) {
q = base32_decode(c);
if (q == base32_invalid){ return dude_bad_input; };
diff = (diff << 4) | (q & 0xF);
if (q >> 4 == 0) break;
}
// prev = output[out] = prev ^ diff; commented by LSB
prev = prev ^ diff;
output[out] = restore_order(prev); // LSB
}
/* Case of last character determines uppercase flag: */
if (uppercase_flags) uppercase_flags[out] = c >= 65 && c <= 90;
}
/* Enforce the uniqueness of the encoding by re-encoding */
/* the output and comparing the result to the input: */
scratch_size = ++in;
status = dude_encode(out, output, uppercase_flags,
&scratch_size, scratch_space);
if (status != dude_success || scratch_size != in ||
unequal(case_sensitivity, scratch_space, input)
) return dude_bad_input;
*output_length = out;
return dude_success;
}
/******************************************************************/
/* Wrapper for testing (would normally go in a separate .c file): */
#include <assert.h>
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
/* For testing, we'll just set some compile-time limits rather than */
/* use malloc(), and set a compile-time option rather than using a */
/* command-line option. */
enum {
unicode_max_length = 256,
ace_max_size = 256,
test_case_sensitivity = case_insensitive
/* suitable for host names */
};
static void usage(char **argv)
{
fprintf(stderr,
"%s -e reads code points and writes a DUDE string.\n"
"%s -d reads a DUDE string and writes code points.\n"
"Input and output are plain text in the native character set.\n"
"Code points are in the form u+hex separated by whitespace.\n"
"A DUDE string is a newline-terminated sequence of LDH characters\n"
"(without any signature).\n"
"The case of the u in u+hex is the force-to-uppercase flag.\n"
, argv[0], argv[0]);
exit(EXIT_FAILURE);
}
static void fail(const char *msg)
{
fputs(msg,stderr);
exit(EXIT_FAILURE);
}
static const char too_big[] =
"input or output is too large, recompile with larger limits\n";
static const char invalid_input[] = "invalid input\n";
static const char io_error[] = "I/O error\n";
/* The following string is used to convert LDH */
/* characters between ASCII and the native charset: */
static const char ldh_ascii[] =
"................"
"................"
".............-.."
"0123456789......"
".ABCDEFGHIJKLMNO"
"PQRSTUVWXYZ....."
".abcdefghijklmno"
"pqrstuvwxyz";
int main(int argc, char **argv)
{
enum dude_status status;
int r;
char *p;
if (argc != 2) usage(argv);
if (argv[1][0] != '-') usage(argv);
if (argv[1][2] != 0) usage(argv);
if (argv[1][1] == 'e') {
u_code_point input[unicode_max_length];
unsigned long codept;
unsigned char uppercase_flags[unicode_max_length];
char output[ace_max_size], uplus[3];
unsigned int input_length, output_size, i;
/* Read the input code points: */
input_length = 0;
for (;;) {
r = scanf("%2s%lx", uplus, &codept);
if (ferror(stdin)) fail(io_error);
if (r == EOF || r == 0) break;
if (r != 2 || uplus[1] != '+' || codept > (u_code_point)-1) {
fail(invalid_input);
}
if (input_length == unicode_max_length) fail(too_big);
if (uplus[0] == 'u') uppercase_flags[input_length] = 0;
else if (uplus[0] == 'U') uppercase_flags[input_length] = 1;
else fail(invalid_input);
input[input_length++] = codept;
}
/* Encode: */
output_size = ace_max_size;
status = dude_encode(input_length, input, uppercase_flags,
&output_size, output);
if (status == dude_bad_input) fail(invalid_input);
if (status == dude_big_output) fail(too_big);
assert(status == dude_success);
/* Convert to native charset and output: */
for (p = output; *p != 0; ++p) {
i = *p;
assert(i <= 122 && ldh_ascii[i] != '.');
*p = ldh_ascii[i];
}
r = puts(output);
if (r == EOF) fail(io_error);
return EXIT_SUCCESS;
}
if (argv[1][1] == 'd') {
char input[ace_max_size], scratch[ace_max_size], *pp;
u_code_point output[unicode_max_length];
unsigned char uppercase_flags[unicode_max_length];
unsigned int input_length, output_length, i;
/* Read the DUDE input string and convert to ASCII: */
fgets(input, ace_max_size, stdin);
if (ferror(stdin)) fail(io_error);
if (feof(stdin)) fail(invalid_input);
input_length = strlen(input);
if (input[input_length - 1] != '\n') fail(too_big);
input[--input_length] = 0;
for (p = input; *p != 0; ++p) {
pp = strchr(ldh_ascii, *p);
if (pp == 0) fail(invalid_input);
*p = pp - ldh_ascii;
}
/* Decode: */
output_length = unicode_max_length;
status = dude_decode(test_case_sensitivity, scratch, input,
&output_length, output, uppercase_flags);
if (status == dude_bad_input) fail(invalid_input);
if (status == dude_big_output) fail(too_big);
assert(status == dude_success);
/* Output the result: */
for (i = 0; i < output_length; ++i) {
r = printf("%s+%04lX\n",
uppercase_flags[i] ? "U" : "u",
(unsigned long) output[i] );
if (r < 0) fail(io_error);
}
return EXIT_SUCCESS;
}
usage(argv);
return EXIT_SUCCESS; /* not reached, but quiets compiler warning */
}