[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [idn] [draft] Improving ACE using code point reordering v0.5
Well, some techniques could be used to make it compress better (for example,
look at some of the techniques in the "binary-ordered compression for
Unicode" --
http://www-106.ibm.com/developerworks/unicode/library/u-binary.html), but I
doubt that that the result will remain as simple to implement.
Mark
----- Original Message -----
From: "Soobok Lee" <lsb@postel.co.kr>
To: <idn@ops.ietf.org>
Cc: <lsb@postel.co.kr>; <internet-drafts@ietf.org>
Sent: Monday, June 25, 2001 21:57
Subject: [idn] [draft] Improving ACE using code point reordering v0.5
>
> Could DUDE be improved for Hangul,Vietnames and European languages
> without hurting its simplicity and elegance? Yes! :-)
>
> THis draft will appear in IDN Intenet-draft directory soon.
>
> Soobok Lee
> lsb@postel.co.kr
>
> --------------------------------------------------------------------------
-
>
>
> INTERNET-DRAFT Soobok
Lee
> draft-ietf-idn-lsb-00.txt
> Expires 2001-Dec-26
2001-Jun-26
>
> Improving ACE using code point reordering v0.5
>
> Status of this Memo
>
> This document is an Internet-Draft and is in full conformance with
> all provisions of Section 10 of RFC2026.
>
> Internet-Drafts are working documents of the Internet Engineering
> Task Force (IETF), its areas, and its working groups. Note
> that other groups may also distribute working documents as
> Internet-Drafts.
>
> Internet-Drafts are draft documents valid for a maximum of six
> months and may be updated, replaced, or obsoleted by other documents
> at any time. It is inappropriate to use Internet-Drafts as
> reference material or to cite them other than as "work in progress."
>
> The list of current Internet-Drafts can be accessed at
> http://www.ietf.org/ietf/1id-abstracts.txt
>
> The list of Internet-Draft Shadow Directories can be accessed at
> http://www.ietf.org/shadow.html
>
> Distribution of this document is unlimited. Please send comments to
> the authors or to the idn working group at idn@ops.ietf.org.
>
>
> Abstract
>
> This draft describes statistically-backed Unicode code point
reordering
> to improve ACE compression. This code point reordering may be inserted
> into existing ACE algorithms with only 1~2 lines of modification.
> This reordering requires simple character mapping or range-wide
mapping
> which are simple and easy to implement.
>
> Experiments with DUDE implementation of this idea show great
> improvements for Hangul, Vietnames and European domains.
>
>
> Contents
>
> Overview
> Hangul
> Basic Latin
> Extended Latin and Combining Diacritical Marks
> Unified Han
> Other character sets
> Modified Encoding procedure of DUDE implementation of this idea
> Modified Decoding procedure of DUDE implementation of this idea
> Example strings
> Security considerations
> References
> Author
> LDUDE: Example implementation into DUDE
>
>
>
> Overview
>
> Pursuing shorter ACE labels is justified to save memory resources and
> to reduce internet traffic even for domains of average length
> in various application/core internet protocols.
>
> Both 11172 Hangul syllables and 24000 or more CJK Han syllables
> occupy roughly half of the entire unicode space.
> Their lexicographical ordering( not in frequency ordering) makes
> various ACE compression technique work poorly for them. Frequently
> used syllables are spread evenly through out those wide ranges.
>
> Even, Latin characters code range, including 'a' - 'z' has
> lexicographical order that does not reflect the fact that 's', 't' and
> 'r' are more frequently used than 'j','k' and 'h'.
>
> Each Hangul code point integer have useful bit structure in it.
> That reflects its hangul jamo(consonant/vowel) combination which
> provides us the hint for applying reordering by usage frequency.
>
> Most ACE algorithms show good compression ratio when frequently used
> characters are re-located into narrower code areas.
> Especially, to reduce DUDE XOR distance, we can make the narrow area
> fit in 16,256 and 4096 bytes-long page boundaries.
>
>
> Hangul
>
> Each code point integer of modern hangul syllables (u+ac00 ~ u+bd7f)
can be
> decomposed to produce the following 3 jamo(hangul consonant/vowel)
indices,
> 1) L: a leading consonant index in modern 19 leading consonants
array,
> 2) V: a vowel index in modern 21 vowels array,
> 3) T: an optional trailing consonant index in modern 28 trailing
> consonants array.
>
> In fact, Unicode hangul range (u+ac00 ~ u+bd7f) is a 3-dimensional
array
> Hangul[L][V][T] with base 0xac00 and each hangul syllable has
> code point integer determined by L*21*28 + V*28 + T + 0xac00.
>
> Statistically, roughly six of 28 hangul trailing consonants are
frequent in
> korean nouns. Leading consontants and vowels show relatively even
> frequency distributions compared with that of trailing consonants.
>
> We can make a frequency mapping table f(T) for T so that
> f(T)= FrequencyOrderOf(T) in the array of modern trailing consonants.
>
> Now, Hangul[L][V][T] is to be reordered into another 3-dimensional
array
> Hangul[f(T)][f(V)][f(L)] with index T augmented to the highest order.
>
> Each hangul syllable in this reorderd range has a new code point value
> determined by f(T)*21*19 + f(V)*19 + f(L) + 0xac00.
>
> Average f(T) value should be sufficiently lower than f(L) and F(V), so
that
> the lowered code point value and induced lowered successive
XOR-differences
> contribute to produce shorter ACE labes.
>
>
> Basic Latin
>
> Basic Latin row u+0070 ~ u+007f has 'p','r','s','t' and 'u' which
> are more frequently used in European nouns than '`','j','k','f' and
'g'
> in u+0060 ~ u+006f row which includes most frequently used 'a'~'o' .
>
> Let's swap two sets of these 5 characters by character by character
basis,
> so that 'p','r','s','t','u' go into the u+0060 ~ u+006f row.
>
> Single row fits in 16 bytes boundary and any character sequences
> from only this single row make XOR-distance or code window length
> shorter than 0x10 and make DUDE and other ACEs do good compression.
>
>
> Extended Latin and Combining Diacritical Marks
>
> First 6 rows from Latin Extension A(u+0100 ~ u+015f) and 6 rows from
> Basic Latin & Latin-1 Supplement (u+0000 ~ u+002f and u+0080 ~ u+00a0)
> are swapped.
> First 3 rows from Combining Diacritical Marks(u+0300 ~ u+032f) and
> 3 rows from Latin-1 Supplement (u+00B0 ~ u+00df) are also swapped.
>
> This makes frequently used parts of Latin Extended-A and
> Combining Diacritical Marks go into first 256-bytes page(u+0000 ~
u+00ff).
> Any character sequences from this single 256-bytes-long page make
> XOR-distance or code window length much shorter than 0x100.
>
> This improvement benefits especially East-European and Vietnamese
> domains which use Latin Extented A and Combining Diacritical Marks.
>
>
> Unified Han
>
> I have no statistical data to optimize for this code ranges, yet.
>
> CJK Unified Han syllables (u+4e00 ~ u+a48f) are ordered by their
radicals.
> We could consider casting most frequently used 256 Han syllables into
> single 256-bytes page.
> We could consider casting most frequently used 4096 Han syllables into
> single 4096-bytes page ,too.
>
>
> Other character sets
>
> I have no statistical data to optimize for this code ranges, yet.
>
> Japapanes gatakana and hiragana (u+3040 ~ u+30ff) code points have
> lexicographical order and if we could put most frequent ones in single
row,
> it may greatly improve gatakana-only or hiragana-only domains.
>
> For Arabic,Hebrew,Cyrllic and Hindi, we could devise similiar
optimizations,
> if relavent statistical data are avaiable.
>
>
> Modified Encoding procedure of DUDE implementation of this idea
>
> All ordering of nybbles and quintets is big-endian (most significant
> first). A nybble is 4 bits. XOR is bitwise exclusive or.
>
> This modification are hyphen-safe.
> Hyphen encoding/decoding is not affected by this modification.
>
> let prev = 96
> for each input integer n (in order) do begin
> if n == 45 then output hyphen minus
> else begin
>
> n = reorder(n) // ******** ADDED **********
>
> let diff = prev XOR n
> extract the least significant nybbles of diff, as few as are
> sufficient to hold all the nonzero bits (but at least one)
> prepend 0 to the last nybble and 1 to the rest
> output base-32 characters corresponding to the quintets
> let prev = n
> end
> end
>
> The encoder must either correctly handle all integer values that can
> be represented in the type of its input, or it must check whether
> the input contains values that it cannot handle and return an error
> if so. Under no circumstances may it produce incorrect output.
>
>
> Modified Decoding procedure of DUDE implementation of this idea
>
> let prev = 96
> while the input string is not exhausted do begin
> if the next character is hyphen-minus then output 45
> else begin
> input characters and convert them to quintets until
> encountering a quintet beginning with 0
> fail upon encountering a non-base-32 character or end-of-input
> strip the first bit of each quintet
> concatenate the resulting nybbles to form diff
> let prev = prev XOR diff
>
> output restore_order(prev) // ******** MODIFIED **********
>
> end
> end
> encode the output sequence and compare it to the input string
> fail if they are not equal
>
>
>
> Example strings
>
> about 20% improvement in DUDE compression ratio is measured in these
> Hangul examples.
>
> (A) Korean Strings 1: ( 24 hangul syllables )
> U+C138 U+ACC4 U+C758 U+BAA8 U+B4E0 U+C0AC U+B78C U+B4E4 U+C774
> U+D55C U+AD6D U+C5B4 U+B97C U+C774 U+D574 U+D55C U+B2E4 U+BA74
> U+C5BC U+B9C8 U+B098 U+C88B U+C744 U+AE4C
>
> DUDE-02: 6txIy79Ny53Nz79A8wIzwwNzzuAvyIzv3AtuuIz2vBy27Jz66Iz8sI\
> tusAuIyz5I23Az96Iz6zE3xAz2tD96Ry3sI ( 89 chars )
> LDUDE : 5szf8pt3bttat3jt6iywhu3bw9qt5m37r2vmxxjxsg2mtvat3auykz\
> wpxubc8xd42fw6p ( 70 chars )
>
>
> (B) Korean Strings 2: ( 6 hangul syllables )
> U+C138 U+ACC4 U+C758 U+BAA8 U+B4E0 U+C0AC
>
> DUDE-02: 6txiy79ny53nz79a8wizwwn ( 24 chars )
> LDUDE : 5szf8pt3bttat3jt6i ( 19 chars )
>
>
>
> LDUDE improvents show better compression ratios for other scripts.
>
> (C) Vietnamese: ( 38 syllables using diacritical marks )
> Ta<dotbelow>isaoho<dotbelow>kh<ocirc>ngth<ecirc><hookabove>chi\
> <hookabove>no<acute>iti<ecirc><acute>ngVi<ecirc><dotbelow>t
> U+0054 u+0061 u+0323 u+0069 u+0073 u+0061 u+006F u+0068 u+006F
> u+0323 u+006B u+0068 u+00F4 u+006E u+0067 u+0074 u+0068 u+00EA
> u+0309 u+0063 u+0068 u+0069 u+0309 u+006E u+006F u+0301 u+0069
> u+0074 u+0069 u+00EA u+0301 u+006E u+0067 U+0056 u+0069 u+00EA
> u+0323 u+0074
>
> DUDE-02: vEvfvwcvwktktcqhhvwnvwid3n3kjtdtn2cv8dvykmbvyavyhbvyqv\
> yitptp2dv8mvyrjvBvr2dv6jvxh ( 82 chars )
> LDUDE : uGuh5c5kckqhh5n4atm3n3ktmtdq2cxd7kmb7a7hb7q7irr2dxm7rt\
> muDvr2dvj5f (66 chars , 16 chars(19%) shorter)
>
>
> (D) Spanish: ( using basic Latin & Latin Supplement )
> Porqu<eacute>nopuedensimplementehablarenEspa<ntilde>ol
> U+0050 u+006F u+0072 u+0071 u+0075 u+00E9 u+006E u+006F u+0070
> u+0075 u+0065 u+0064 u+0065 u+006E u+0073 u+0069 u+006D u+0070
> u+006C u+0065 u+006D u+0065 u+006E u+0074 u+0065 u+0068 u+0061
> u+0062 u+006C u+0061 u+0072 u+0065 u+006E U+0045 u+0073 u+0070
> u+0061 u+00F1 u+006F u+006C
>
> DUDE-02: vAvrtpde3n2hbtrftabbmtptketptnjiimtktbpjdqptdthmuMvgdt\
> b3a3qd (61 chars)
> LDUDE : uAurftmtg2q2hbrhcbbmfcepnjiimidpjdqpmrmuMuqmb3a3qd
> (51 chars, 10 chars (16%) shorter)
>
>
> (E) Czech: (using Latin Extended A)
> Pro<ccaron>prost<ecaron>nemluv<iacute><ccaron>esky
> U+0050 u+0072 u+006F u+010D u+0070 u+0072 u+006F u+0073 u+0074
> u+011B u+006E u+0065 u+006D u+006C u+0075 u+0076 u+00ED u+010D
> u+0065 u+0073 u+006B u+0079
>
> DUDE-02: vAuctptyctzpctptnhtyrtzfmibtjd3mt8atyitgtitc (45 chars)
>
> LDUDE : uAukfycypkfepzpzfmibmtb3m8ayiqtik (34 chars, 24%
shorter )
>
>
>
> Security considerations
>
> ACE-encoded reordered code points are restored in reverse ACE
translation
> with no problem, and this improvement do not introdude any new
security threat
> in ACE.
>
>
> References
>
> [DUDE02] Mark Welter, Brian Spolarich, Adam Costello,
> "DUDE: Differential Unicode Domain Encoding", 2001-May-31,
> draft-ietf-idn-dude-02.
>
> [AMCACEW] Adam Costello, "AMC-ACE-W version 0.1.0",
> 2001-May-31, draft-ietf-idn-amc-ace-w-00, latest version at
> http://www.cs.berkeley.edu/~amc/charset/amc-ace-w.
>
>
> [UNICODE] The Unicode Consortium, "The Unicode Standard",
> http://www.unicode.org/unicode/standard/standard.html.
>
> [IDNA] Patrik Falstrom, Paul Hoffman, "Internationalizing Host
> Names In Applications (IDNA)", draft-ietf-idn-idna-01
>
> [NAMEPREP] Paul Hoffman, Marc Blanchet, "Preparation of
> Internationalized Host Names", Feb 2001,
> draft-ietf-idn-nameprep-03
>
>
>
> Author
>
> Soobok Lee <lsb@postel.co.kr>
> http://www.postel.co.kr
>
>
>
> Example implementation
>
> This idea is applicable to any ACEs.
> LDUDE (in this example implementation) is merely a name for
> DUDE implementation of this idea.
>
> Embedded hangul jamo and Latin frequency tables are subject
> to change with further studies in the next revision of this draft.
>
> In Unix, save this example source code into ldude.c
>
> % cc -o ldude ldude.c
> % ./ldude -e < input_file > output_file
> % ./ldude -d < output_file
>
> A input file should contains u+????-form code points
> delimited with spaces or newlines.
>
>
> /******************************************/
> /* ldude.c 0.1 (2001-Jun-11) */
> /* Soobok Lee <lsb@postel.co.kr> */
> /* Adam M. Costello <amc@cs.berkeley.edu> */
> /******************************************/
>
> /* This is ANSI C code (C89) implementing */
> /* DUDE (draft-ietf-idn-ldude-01). */
>
>
> /************************************************************/
> /* Public interface (would normally go in its own .h file): */
>
> #include <stdio.h>
> #include <limits.h>
>
> enum dude_status {
> dude_success,
> dude_bad_input,
> dude_big_output /* Output would exceed the space provided. */
> };
>
> enum case_sensitivity { case_sensitive, case_insensitive };
>
> #if UINT_MAX >= 0x1FFFFF
> typedef unsigned int u_code_point;
> #else
> typedef unsigned long u_code_point;
> #endif
>
> enum dude_status dude_encode(
> unsigned int input_length,
> const u_code_point input[],
> const unsigned char uppercase_flags[],
> unsigned int *output_size,
> char output[] );
>
> /* dude_encode() converts Unicode to DUDE (without any */
> /* signature). The input must be represented as an array */
> /* of Unicode code points (not code units; surrogate pairs */
> /* are not allowed), and the output will be represented as */
> /* null-terminated ASCII. The input_length is the number of code */
> /* points in the input. The output_size is an in/out argument: */
> /* the caller must pass in the maximum number of characters */
> /* that may be output (including the terminating null), and on */
> /* successful return it will contain the number of characters */
> /* actually output (including the terminating null, so it will be */
> /* one more than strlen() would return, which is why it is called */
> /* output_size rather than output_length). The uppercase_flags */
> /* array must hold input_length boolean values, where nonzero */
> /* means the corresponding Unicode character should be forced */
> /* to uppercase after being decoded, and zero means it is */
> /* caseless or should be forced to lowercase. Alternatively, */
> /* uppercase_flags may be a null pointer, which is equivalent */
> /* to all zeros. The encoder always outputs lowercase base-32 */
> /* characters except when nonzero values of uppercase_flags */
> /* require otherwise. The return value may be any of the */
> /* dude_status values defined above; if not dude_success, then */
> /* output_size and output may contain garbage. On success, the */
> /* encoder will never need to write an output_size greater than */
> /* input_length*k+1 if all the input code points are less than 1 */
> /* << (4*k), because of how the encoding is defined. */
>
> enum dude_status dude_decode(
> enum case_sensitivity case_sensitivity,
> char scratch_space[],
> const char input[],
> unsigned int *output_length,
> u_code_point output[],
> unsigned char uppercase_flags[] );
>
> /* dude_decode() converts DUDE (without any signature) to */
> /* Unicode. The input must be represented as null-terminated */
> /* ASCII, and the output will be represented as an array of */
> /* Unicode code points. The case_sensitivity argument influences */
> /* the check on the well-formedness of the input string; it */
> /* must be case_sensitive if case-sensitive comparisons are */
> /* allowed on encoded strings, case_insensitive otherwise. */
> /* The scratch_space must point to space at least as large */
> /* as the input, which will get overwritten (this allows the */
> /* decoder to avoid calling malloc()). The output_length is */
> /* an in/out argument: the caller must pass in the maximum */
> /* number of code points that may be output, and on successful */
> /* return it will contain the actual number of code points */
> /* output. The uppercase_flags array must have room for at */
> /* least output_length values, or it may be a null pointer if */
> /* the case information is not needed. A nonzero flag indicates */
> /* that the corresponding Unicode character should be forced to */
> /* uppercase by the caller, while zero means it is caseless or */
> /* should be forced to lowercase. The return value may be any */
> /* of the dude_status values defined above; if not dude_success, */
> /* then output_length, output, and uppercase_flags may contain */
> /* garbage. On success, the decoder will never need to write */
> /* an output_length greater than the length of the input (not */
> /* counting the null terminator), because of how the encoding is */
> /* defined. */
>
>
> /**********************************************************/
> /* Implementation (would normally go in its own .c file): */
>
> #include <string.h>
>
> /* Character utilities: */
>
> /* base32[q] is the lowercase base-32 character representing */
> /* the number q from the range 0 to 31. Note that we cannot */
> /* use string literals for ASCII characters because an ANSI C */
> /* compiler does not necessarily use ASCII. */
>
> static const char base32[] = {
> 97, 98, 99, 100, 101, 102, 103, 104, 105, 106, 107, /* a-k */
> 109, 110, /* m-n */
> 112, 113, 114, 115, 116, 117, 118, 119, 120, 121, 122, /* p-z */
> 50, 51, 52, 53, 54, 55, 56, 57 /* 2-9 */
> };
>
> /* base32_decode(c) returns the value of a base-32 character, in the */
> /* range 0 to 31, or the constant base32_invalid if c is not a valid */
> /* base-32 character. */
>
> enum { base32_invalid = 32 };
>
> static unsigned int base32_decode(char c)
> {
> if (c < 50) return base32_invalid;
> if (c <= 57) return c - 26;
> if (c < 97) c += 32;
> if (c < 97 || c == 108 || c == 111 || c > 122) return base32_invalid;
> return c - 97 - (c > 108) - (c > 111);
> }
>
> /* unequal(case_sensitivity,s1,s2) returns 0 if the strings s1 and s2 */
> /* are equal, 1 otherwise. If case_sensitivity is case_insensitive, */
> /* then ASCII A-Z are considered equal to a-z respectively. */
>
> static int unequal( enum case_sensitivity case_sensitivity,
> const char s1[], const char s2[] )
> {
> char c1, c2;
>
> if (case_sensitivity != case_insensitive) return strcmp(s1,s2) != 0;
>
> for (;;) {
> c1 = *s1;
> c2 = *s2;
> if (c1 >= 65 && c1 <= 90) c1 += 32;
> if (c2 >= 65 && c2 <= 90) c2 += 32;
> if (c1 != c2) return 1;
> if (c1 == 0) return 0;
> ++s1, ++s2;
> }
> }
>
>
>
> /* Begin of Improvement Code by LSB */
> /* LANGUAGE-SPECIFIC IMPROVEMENTS TO DUDE BASED ON CODE REORDERING */
>
> /* Common Constants for Hangul */
> static u_code_point
> SBase = 0xAC00, LBase = 0x1100, VBase = 0x1161, TBase = 0x11A7,
> LCount = 19, VCount = 21, TCount = 28, NCount, SCount;
> static u_code_point NCount = 588; // VCount * TCount
> static u_code_point SCount = 11172; // LCount * NCount
> static u_code_point MCount = 399; // LCount * VCount
>
> /* Hangul Jamo Frequency Mapping/Reverse Mapping Table */
> int JAMO_L_FREQ[19][2] = {
> {1,11}, // "G" 0
> {14,0}, // "GG"
> {9,9}, // "N"
> {5,12}, // "D"
> {15,7}, // "DD"
> {13,3}, // "R" 5
> {7,18}, // "M"
> {4,6 }, // "B"
> {16,14}, // "BB"
> {2, 2}, // "S"
> {17,17}, // "SS" 10
> {0,16}, // "O"
> {3,15}, // "J"
> {18,5 }, // "JJ"
> {8,1}, // "C"
> {12,4}, // "K" 15
> {11,8}, // "T"
> {10,10}, // "P"
> {6 ,13} // "H"
> };
>
> int JAMO_V_FREQ[21][2] = {
> {2,20}, // "A" 0
> {7,5}, // "AE"
> {15,0}, // "YA"
> {20,13}, // "YAE"
> {5,18}, // "EO"
> {1, 4}, // "E" 5
> {9,8}, // "YEO"
> {13,1},// "YE"
> {6,13}, // "O"
> {10, 6}, // "WA"
> {18,9}, // "WAE" 10
> {16,17}, // "OE"
> {17,14}, // "YO"
> {8, 7}, // "U"
> {12,16}, // "WEO"
> {19, 2}, // "WE" 15
> {14,11}, // "WI"
> {11,12}, // "YU"
> {4,10}, // "EU"
> {19,19}, // "YI"
> {0,3} // "I" 20
> };
>
> int JAMO_T_FREQ[28][2] = {
> {0,0}, // "" 0
> {5,4}, // "G"
> {15,21}, // "GG"
> {16,8}, // "GS"
> {1,16}, // "N"
> {27,1 }, // "NJ" 5
> {17,17}, // "NH"
> {10,19}, // "D"
> {3,24}, // "L"
> {18,27}, // "LG"
> {19, 7}, // "LM" 10
> {20,22}, // "LB"
> {21,25}, // "LS"
> {22,26}, // "LT"
> {23,27}, // "LP"
> {24,2}, // "LH" 15
> {4,3}, // "M"
> {6,6}, // "B"
> {25,9}, // "BS"
> {7,10}, // "S"
> {26,11}, // "SS" 20
> {2,12}, // "NG"
> {11,13}, // "J"
> {10,14}, // "C"
> {8 ,15}, // "K"
> {12,18}, // "T" 25
> {13,20}, // "P"
> {9 ,5} // "H" 27
> };
>
> /* Hangul Decomposition */
>
>
> int isHANGUL(u_code_point s) {
> int SIndex = s - SBase;
> if (SIndex < 0 || SIndex >= SCount) {
> return 0;
> }
> return 1;
> };
> int isUNIHAN(u_code_point s) {
> if (s >= 0x4E00 && s <= 0x9FAF) {
> return 1;
> }
> return 0;
> };
> int isLatins(u_code_point s) {
> if (s < 0x370) {
> return 1;
> }
> return 0;
> };
>
> u_code_point reorder_hangul(u_code_point s) {
> u_code_point SIndex = s - SBase;
> int zL = JAMO_L_FREQ[SIndex / NCount][0];
> int zV = JAMO_V_FREQ[(SIndex % NCount) / TCount][0];
> int zT = JAMO_T_FREQ[SIndex % TCount][0];
> int idx= (zT*MCount + zV*LCount + zL);
> if(idx < SCount-0x400) {
> idx += 0x400;
> } else {
> idx = idx - (SCount -0x400);
> };
> return idx + SBase;
> }
>
> u_code_point restore_order_hangul(u_code_point z) {
> int T,V,L;
> u_code_point zIndex = z-SBase;
> if(zIndex < 0x400) {
> zIndex += SCount - 0x400;
> } else {
> zIndex = zIndex - 0x400;
> };
> T = JAMO_T_FREQ[(zIndex / MCount)][1];
> V = JAMO_V_FREQ[(zIndex % MCount) / LCount][1];
> L = JAMO_L_FREQ[(zIndex % LCount)][1];
> return (L*NCount + V*TCount + T) + SBase;
> }
>
> u_code_point reorder_unihan(u_code_point s) {
> return s;
> }
>
> u_code_point restore_order_unihan(u_code_point z) {
> return z;
> }
>
>
> #define MAPCHAR(x,A,B,bytes) if(A<=x && x< (A+bytes))
return(x+(B-A)); if(B<=x && x< (B+bytes)) return(x+(A-B))
> #define MAP16BL(x,A,B,block) if(A<=x && x< (A+(block<<4)))
return(x+(B-A)); if(B<=x && x< (B+(block<<4))) return(x+(A-B))
>
> u_code_point reorder_latins(u_code_point s) {
> MAP16BL(s,0x0100,0x0000,3); // Latin Extension A
> MAP16BL(s,0x0130,0x0080,2);
> MAP16BL(s,0x0150,0x00A0,1);
> MAP16BL(s,0x0300,0x00B0,3); // Combining Diacritical Marks
> MAPCHAR(s,0x0070,0x0060,1); // p,`
> MAPCHAR(s,0x0072,0x006A,1); // r,j
> MAPCHAR(s,0x0073,0x006B,1); // s,k
> MAPCHAR(s,0x0074,0x0066,1); // t,f
> MAPCHAR(s,0x0075,0x0067,1); // u,g
> MAPCHAR(s,0x0050,0x0040,1); // P,@ UPPER
> MAPCHAR(s,0x0052,0x004A,1); // R,J UPPER
> MAPCHAR(s,0x0053,0x004B,1); // S,K UPPER
> MAPCHAR(s,0x0054,0x0046,1); // T,F UPPER
> MAPCHAR(s,0x0055,0x0047,1); // U,G UPPER
> MAPCHAR(s,0x0160,0x003A,6); // Latin Extension A
> MAPCHAR(s,0x0166,0x005B,5);
> MAPCHAR(s,0x016B,0x007B,5);
> return s;
> }
>
> u_code_point restore_order_latins(u_code_point z) {
> return reorder_latins(z);
> }
>
> u_code_point reorder(u_code_point s) {
> if(isHANGUL(s)) return reorder_hangul(s);
> if(isUNIHAN(s)) return reorder_unihan(s);
> if(isLatins(s)) return reorder_latins(s);
> return s;
> }
> u_code_point restore_order(u_code_point s) {
> if(isHANGUL(s)) return restore_order_hangul(s);
> if(isUNIHAN(s)) return restore_order_unihan(s);
> if(isLatins(s)) return restore_order_latins(s);
> return s;
> }
>
> /* End of Improvement Code */
>
> /* Encoder: */
>
> enum dude_status dude_encode(
> unsigned int input_length,
> const u_code_point input[],
> const unsigned char uppercase_flags[],
> unsigned int *output_size,
> char output[] )
> {
> unsigned int max_out, in, out, k, j;
> u_code_point prev, codept, diff, tmp;
> char shift;
>
> prev = 0x60;
> max_out = *output_size;
>
> for (in = out = 0; in < input_length; ++in) {
>
> /* At the start of each iteration, in and out are the number of */
> /* items already input/output, or equivalently, the indices of */
> /* the next items to be input/output. */
>
> codept = input[in];
>
> if (codept == 0x2D) {
> /* Hyphen-minus stands for itself. */
> if (max_out - out < 1) return dude_big_output;
> output[out++] = 0x2D;
> continue;
> }
>
> codept = reorder(codept); // by LSB
>
> diff = prev^codept;
>
>
> /* Compute the number of base-32 characters (k): */
> for (tmp = diff >> 4, k = 1; tmp != 0; ++k, tmp >>= 4);
> //fprintf(stderr,"diff %x,%x = prev %x ^ codept %x \n",
k,diff,prev,codept);
>
> if (max_out - out < k) return dude_big_output;
> shift = uppercase_flags && uppercase_flags[in] ? 32 : 0;
> /* shift controls the case of the last base-32 digit. */
>
> /* Each quintet has the form 1xxxx except the last is 0xxxx. */
> /* Computing the base-32 digits in reverse order is easiest. */
>
> out += k;
> output[out - 1] = base32[diff & 0xF] - shift;
>
> for (j = 2; j <= k; ++j) {
> diff >>= 4;
> output[out - j] = base32[0x10 | (diff & 0xF)];
> }
>
> prev = codept;
> }
>
> /* Append the null terminator: */
> if (max_out - out < 1) return dude_big_output;
> output[out++] = 0;
>
> *output_size = out;
> return dude_success;
> }
>
>
> /* Decoder: */
>
> enum dude_status dude_decode(
> enum case_sensitivity case_sensitivity,
> char scratch_space[],
> const char input[],
> unsigned int *output_length,
> u_code_point output[],
> unsigned char uppercase_flags[] )
> {
> u_code_point prev, q, diff;
> char c;
> unsigned int max_out, in, out, scratch_size;
> enum dude_status status;
>
> prev = 0x60;
> max_out = *output_length;
>
> for (c = input[in = 0], out = 0; c != 0; c = input[++in], ++out) {
>
> /* At the start of each iteration, in and out are the number of */
> /* items already input/output, or equivalently, the indices of */
> /* the next items to be input/output. */
>
> if (max_out - out < 1) return dude_big_output;
>
> if (c == 0x2D) output[out] = c; /* hyphen-minus is literal */
> else {
> /* Base-32 sequence. Decode quintets until 0xxxx is found: */
>
> for (diff = 0; ; c = input[++in]) {
> q = base32_decode(c);
> if (q == base32_invalid){ return dude_bad_input; };
> diff = (diff << 4) | (q & 0xF);
> if (q >> 4 == 0) break;
> }
>
> // prev = output[out] = prev ^ diff; commented by LSB
> prev = prev ^ diff;
> output[out] = restore_order(prev); // LSB
>
> }
>
> /* Case of last character determines uppercase flag: */
> if (uppercase_flags) uppercase_flags[out] = c >= 65 && c <= 90;
> }
>
> /* Enforce the uniqueness of the encoding by re-encoding */
> /* the output and comparing the result to the input: */
>
> scratch_size = ++in;
> status = dude_encode(out, output, uppercase_flags,
> &scratch_size, scratch_space);
> if (status != dude_success || scratch_size != in ||
> unequal(case_sensitivity, scratch_space, input)
> ) return dude_bad_input;
> *output_length = out;
> return dude_success;
> }
>
>
> /******************************************************************/
> /* Wrapper for testing (would normally go in a separate .c file): */
>
> #include <assert.h>
> #include <stdio.h>
> #include <stdlib.h>
> #include <string.h>
>
> /* For testing, we'll just set some compile-time limits rather than */
> /* use malloc(), and set a compile-time option rather than using a */
> /* command-line option. */
>
> enum {
> unicode_max_length = 256,
> ace_max_size = 256,
> test_case_sensitivity = case_insensitive
> /* suitable for host names */
> };
>
>
> static void usage(char **argv)
> {
> fprintf(stderr,
> "%s -e reads code points and writes a DUDE string.\n"
> "%s -d reads a DUDE string and writes code points.\n"
> "Input and output are plain text in the native character set.\n"
> "Code points are in the form u+hex separated by whitespace.\n"
> "A DUDE string is a newline-terminated sequence of LDH characters\n"
> "(without any signature).\n"
> "The case of the u in u+hex is the force-to-uppercase flag.\n"
> , argv[0], argv[0]);
> exit(EXIT_FAILURE);
> }
>
>
> static void fail(const char *msg)
> {
> fputs(msg,stderr);
> exit(EXIT_FAILURE);
> }
>
> static const char too_big[] =
> "input or output is too large, recompile with larger limits\n";
> static const char invalid_input[] = "invalid input\n";
> static const char io_error[] = "I/O error\n";
>
>
> /* The following string is used to convert LDH */
> /* characters between ASCII and the native charset: */
>
> static const char ldh_ascii[] =
> "................"
> "................"
> ".............-.."
> "0123456789......"
> ".ABCDEFGHIJKLMNO"
> "PQRSTUVWXYZ....."
> ".abcdefghijklmno"
> "pqrstuvwxyz";
>
>
> int main(int argc, char **argv)
> {
> enum dude_status status;
> int r;
> char *p;
>
> if (argc != 2) usage(argv);
> if (argv[1][0] != '-') usage(argv);
> if (argv[1][2] != 0) usage(argv);
>
> if (argv[1][1] == 'e') {
> u_code_point input[unicode_max_length];
> unsigned long codept;
> unsigned char uppercase_flags[unicode_max_length];
> char output[ace_max_size], uplus[3];
> unsigned int input_length, output_size, i;
>
> /* Read the input code points: */
>
> input_length = 0;
>
> for (;;) {
> r = scanf("%2s%lx", uplus, &codept);
> if (ferror(stdin)) fail(io_error);
> if (r == EOF || r == 0) break;
>
> if (r != 2 || uplus[1] != '+' || codept > (u_code_point)-1) {
> fail(invalid_input);
> }
>
> if (input_length == unicode_max_length) fail(too_big);
>
> if (uplus[0] == 'u') uppercase_flags[input_length] = 0;
> else if (uplus[0] == 'U') uppercase_flags[input_length] = 1;
> else fail(invalid_input);
>
> input[input_length++] = codept;
> }
>
> /* Encode: */
>
> output_size = ace_max_size;
> status = dude_encode(input_length, input, uppercase_flags,
> &output_size, output);
> if (status == dude_bad_input) fail(invalid_input);
> if (status == dude_big_output) fail(too_big);
> assert(status == dude_success);
>
> /* Convert to native charset and output: */
>
> for (p = output; *p != 0; ++p) {
> i = *p;
> assert(i <= 122 && ldh_ascii[i] != '.');
> *p = ldh_ascii[i];
> }
>
> r = puts(output);
> if (r == EOF) fail(io_error);
> return EXIT_SUCCESS;
> }
>
> if (argv[1][1] == 'd') {
> char input[ace_max_size], scratch[ace_max_size], *pp;
> u_code_point output[unicode_max_length];
> unsigned char uppercase_flags[unicode_max_length];
> unsigned int input_length, output_length, i;
>
> /* Read the DUDE input string and convert to ASCII: */
>
> fgets(input, ace_max_size, stdin);
> if (ferror(stdin)) fail(io_error);
> if (feof(stdin)) fail(invalid_input);
> input_length = strlen(input);
> if (input[input_length - 1] != '\n') fail(too_big);
> input[--input_length] = 0;
>
> for (p = input; *p != 0; ++p) {
> pp = strchr(ldh_ascii, *p);
> if (pp == 0) fail(invalid_input);
> *p = pp - ldh_ascii;
> }
>
> /* Decode: */
>
> output_length = unicode_max_length;
> status = dude_decode(test_case_sensitivity, scratch, input,
> &output_length, output, uppercase_flags);
> if (status == dude_bad_input) fail(invalid_input);
> if (status == dude_big_output) fail(too_big);
> assert(status == dude_success);
>
> /* Output the result: */
>
> for (i = 0; i < output_length; ++i) {
> r = printf("%s+%04lX\n",
> uppercase_flags[i] ? "U" : "u",
> (unsigned long) output[i] );
> if (r < 0) fail(io_error);
> }
>
> return EXIT_SUCCESS;
> }
>
> usage(argv);
> return EXIT_SUCCESS; /* not reached, but quiets compiler warning */
> }
>
>
>
>
>