[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [idn] [draft] Improving ACE using code point reordering v0.5

To: "Soobok Lee" <lsb@postel.co.kr>,<idn@ops.ietf.org>
Subject: Re: [idn] [draft] Improving ACE using code point reordering v0.5
From: "Mark Davis" <mark@macchiato.com>
Date: Tue, 26 Jun 2001 14:37:59 -0700
Cc: <lsb@postel.co.kr>,<internet-drafts@ietf.org>
Delivery-date: Tue, 26 Jun 2001 14:37:17 -0700
Envelope-to: idn-data@psg.com
Well, some techniques could be used to make it compress better (for example,
look at some of the techniques in the "binary-ordered compression for
Unicode" --
http://www-106.ibm.com/developerworks/unicode/library/u-binary.html), but I
doubt that that the result will remain as simple to implement.

Mark

----- Original Message -----
From: "Soobok Lee" <lsb@postel.co.kr>
To: <idn@ops.ietf.org>
Cc: <lsb@postel.co.kr>; <internet-drafts@ietf.org>
Sent: Monday, June 25, 2001 21:57
Subject: [idn] [draft] Improving ACE using code point reordering v0.5


>
> Could DUDE be improved for Hangul,Vietnames and European languages
> without hurting its simplicity and elegance?  Yes! :-)
>
> THis draft will appear in IDN Intenet-draft directory soon.
>
> Soobok Lee
> lsb@postel.co.kr
>
> --------------------------------------------------------------------------
-
>
>
> INTERNET-DRAFT                                                    Soobok
Lee
> draft-ietf-idn-lsb-00.txt
> Expires 2001-Dec-26
2001-Jun-26
>
>          Improving ACE using code point reordering v0.5
>
> Status of this Memo
>
>     This document is an Internet-Draft and is in full conformance with
>     all provisions of Section 10 of RFC2026.
>
>     Internet-Drafts are working documents of the Internet Engineering
>     Task Force (IETF), its areas, and its working groups.  Note
>     that other groups may also distribute working documents as
>     Internet-Drafts.
>
>     Internet-Drafts are draft documents valid for a maximum of six
>     months and may be updated, replaced, or obsoleted by other documents
>     at any time.  It is inappropriate to use Internet-Drafts as
>     reference material or to cite them other than as "work in progress."
>
>     The list of current Internet-Drafts can be accessed at
>     http://www.ietf.org/ietf/1id-abstracts.txt
>
>     The list of Internet-Draft Shadow Directories can be accessed at
>     http://www.ietf.org/shadow.html
>
>     Distribution of this document is unlimited.  Please send comments to
>     the authors or to the idn working group at idn@ops.ietf.org.
>
>
> Abstract
>
>     This draft describes statistically-backed Unicode code point
reordering
>     to improve ACE compression. This code point reordering may be inserted
>     into existing ACE algorithms with only 1~2 lines of modification.
>     This reordering requires simple character mapping or range-wide
mapping
>     which are simple and easy to implement.
>
>     Experiments with DUDE implementation of this idea show great
>     improvements for Hangul, Vietnames and European domains.
>
>
> Contents
>
>     Overview
>     Hangul
>     Basic Latin
>     Extended Latin and Combining Diacritical Marks
>     Unified Han
>     Other character sets
>     Modified Encoding procedure of DUDE implementation of this idea
>     Modified Decoding procedure of DUDE implementation of this idea
>     Example strings
>     Security considerations
>     References
>     Author
>     LDUDE: Example implementation into DUDE
>
>
>
> Overview
>
>     Pursuing shorter ACE labels is justified to save memory resources and
>     to reduce internet traffic  even for domains of average length
>     in various application/core internet protocols.
>
>     Both 11172 Hangul syllables and 24000 or more CJK Han syllables
>     occupy roughly half of the entire unicode space.
>     Their lexicographical ordering( not in frequency ordering) makes
>     various ACE compression technique work poorly for them. Frequently
>     used syllables are spread evenly through out those wide ranges.
>
>     Even, Latin characters code range, including 'a' - 'z' has
>     lexicographical order that does not reflect the fact that 's', 't' and
>     'r' are more frequently used than 'j','k' and 'h'.
>
>     Each Hangul code point integer have useful bit structure in it.
>     That reflects its hangul jamo(consonant/vowel) combination  which
>     provides us the hint for applying reordering by usage frequency.
>
>     Most ACE algorithms show good compression ratio when frequently used
>     characters are re-located into narrower code areas.
>     Especially, to reduce DUDE XOR distance, we can make the narrow area
>     fit in 16,256 and 4096 bytes-long page boundaries.
>
>
> Hangul
>
>     Each code point integer of modern hangul syllables (u+ac00 ~ u+bd7f)
can be
>     decomposed to produce the following 3 jamo(hangul consonant/vowel)
indices,
>         1) L: a leading consonant index in modern 19 leading consonants
array,
>         2) V: a vowel index in modern 21 vowels array,
>         3) T: an optional trailing consonant index in modern 28 trailing
>            consonants array.
>
>     In fact, Unicode hangul range (u+ac00 ~ u+bd7f) is a 3-dimensional
array
>     Hangul[L][V][T] with base 0xac00 and each hangul syllable has
>     code point integer determined by  L*21*28 + V*28 + T +  0xac00.
>
>     Statistically, roughly six of 28 hangul trailing consonants are
frequent in
>     korean nouns.  Leading consontants and vowels show relatively even
>     frequency distributions compared with that of trailing consonants.
>
>     We can make a frequency mapping table f(T) for T so that
>     f(T)= FrequencyOrderOf(T) in the array of modern trailing consonants.
>
>     Now, Hangul[L][V][T]  is to be reordered into another 3-dimensional
array
>     Hangul[f(T)][f(V)][f(L)] with index T augmented to the highest order.
>
>     Each hangul syllable in this reorderd range has a new code point value
>     determined by  f(T)*21*19 + f(V)*19 + f(L) + 0xac00.
>
>     Average f(T) value should be sufficiently lower than f(L) and F(V), so
that
>     the lowered code point value and induced lowered successive
XOR-differences
>     contribute to produce shorter ACE labes.
>
>
> Basic Latin
>
>     Basic Latin row u+0070 ~ u+007f has 'p','r','s','t' and 'u' which
>     are more frequently used in European nouns than '`','j','k','f' and
'g'
>     in u+0060 ~ u+006f row which includes most frequently used 'a'~'o' .
>
>     Let's swap two sets of these 5 characters by character by character
basis,
>     so that 'p','r','s','t','u' go into the u+0060 ~ u+006f row.
>
>     Single row fits in 16 bytes boundary and any character sequences
>     from only this single row make XOR-distance or code window length
>     shorter than 0x10 and make DUDE and other ACEs do good compression.
>
>
> Extended Latin and Combining Diacritical Marks
>
>     First 6 rows from Latin Extension A(u+0100 ~ u+015f)  and 6 rows from
>     Basic Latin & Latin-1 Supplement (u+0000 ~ u+002f and u+0080 ~ u+00a0)
>     are swapped.
>     First 3 rows from Combining Diacritical Marks(u+0300 ~ u+032f)  and
>     3 rows from Latin-1 Supplement (u+00B0 ~ u+00df) are also swapped.
>
>     This makes frequently used parts of Latin Extended-A and
>     Combining Diacritical Marks go into first 256-bytes page(u+0000 ~
u+00ff).
>     Any character sequences from this single 256-bytes-long page make
>     XOR-distance or code window length  much shorter than 0x100.
>
>     This improvement benefits especially East-European and Vietnamese
>     domains which use Latin Extented A and Combining Diacritical Marks.
>
>
> Unified Han
>
>     I have no statistical data to optimize for this code ranges, yet.
>
>     CJK Unified Han syllables (u+4e00 ~ u+a48f) are ordered by their
radicals.
>     We could consider casting most frequently used 256 Han syllables into
>     single 256-bytes page.
>     We could consider casting most frequently used 4096 Han syllables into
>     single 4096-bytes page ,too.
>
>
> Other character sets
>
>     I have no statistical data to optimize for this code ranges, yet.
>
>     Japapanes gatakana and hiragana (u+3040 ~ u+30ff) code points have
>     lexicographical order and if we could put most frequent ones in single
row,
>     it may greatly improve gatakana-only or hiragana-only domains.
>
>     For Arabic,Hebrew,Cyrllic and Hindi, we could devise similiar
optimizations,
>     if relavent statistical data are avaiable.
>
>
> Modified Encoding procedure of DUDE implementation of this idea
>
>     All ordering of nybbles and quintets is big-endian (most significant
>     first).  A nybble is 4 bits.  XOR is bitwise exclusive or.
>
>     This modification are hyphen-safe.
>     Hyphen encoding/decoding is not affected by this modification.
>
>     let prev = 96
>     for each input integer n (in order) do begin
>       if n == 45 then output hyphen minus
>       else begin
>
>         n = reorder(n)   //  ******** ADDED **********
>
>         let diff = prev XOR n
>         extract the least significant nybbles of diff, as few as are
>           sufficient to hold all the nonzero bits (but at least one)
>         prepend 0 to the last nybble and 1 to the rest
>         output base-32 characters corresponding to the quintets
>         let prev = n
>       end
>     end
>
>     The encoder must either correctly handle all integer values that can
>     be represented in the type of its input, or it must check whether
>     the input contains values that it cannot handle and return an error
>     if so.  Under no circumstances may it produce incorrect output.
>
>
> Modified Decoding procedure of DUDE implementation of this idea
>
>     let prev = 96
>     while the input string is not exhausted do begin
>       if the next character is hyphen-minus then output 45
>       else begin
>         input characters and convert them to quintets until
>           encountering a quintet beginning with 0
>         fail upon encountering a non-base-32 character or end-of-input
>         strip the first bit of each quintet
>         concatenate the resulting nybbles to form diff
>         let prev = prev XOR diff
>
>         output restore_order(prev)  //  ******** MODIFIED **********
>
>       end
>     end
>     encode the output sequence and compare it to the input string
>     fail if they are not equal
>
>
>
> Example strings
>
>     about 20% improvement in DUDE compression ratio is measured in these
>     Hangul examples.
>
>     (A) Korean Strings 1: ( 24 hangul syllables )
>         U+C138 U+ACC4 U+C758 U+BAA8 U+B4E0 U+C0AC U+B78C U+B4E4 U+C774
>         U+D55C U+AD6D U+C5B4 U+B97C U+C774 U+D574 U+D55C U+B2E4 U+BA74
>         U+C5BC U+B9C8 U+B098 U+C88B U+C744 U+AE4C
>
>         DUDE-02: 6txIy79Ny53Nz79A8wIzwwNzzuAvyIzv3AtuuIz2vBy27Jz66Iz8sI\
>                  tusAuIyz5I23Az96Iz6zE3xAz2tD96Ry3sI ( 89 chars )
>         LDUDE  : 5szf8pt3bttat3jt6iywhu3bw9qt5m37r2vmxxjxsg2mtvat3auykz\
>                  wpxubc8xd42fw6p ( 70 chars )
>
>
>     (B) Korean Strings 2:  ( 6 hangul syllables )
>         U+C138 U+ACC4 U+C758 U+BAA8 U+B4E0 U+C0AC
>
>         DUDE-02: 6txiy79ny53nz79a8wizwwn  ( 24 chars )
>         LDUDE  : 5szf8pt3bttat3jt6i  ( 19 chars )
>
>
>
>     LDUDE improvents show better compression ratios for other scripts.
>
>     (C) Vietnamese: ( 38 syllables using diacritical marks )
>         Ta<dotbelow>isaoho<dotbelow>kh<ocirc>ngth<ecirc><hookabove>chi\
>         <hookabove>no<acute>iti<ecirc><acute>ngVi<ecirc><dotbelow>t
>         U+0054 u+0061 u+0323 u+0069 u+0073 u+0061 u+006F u+0068 u+006F
>         u+0323 u+006B u+0068 u+00F4 u+006E u+0067 u+0074 u+0068 u+00EA
>         u+0309 u+0063 u+0068 u+0069 u+0309 u+006E u+006F u+0301 u+0069
>         u+0074 u+0069 u+00EA u+0301 u+006E u+0067 U+0056 u+0069 u+00EA
>         u+0323 u+0074
>
>         DUDE-02: vEvfvwcvwktktcqhhvwnvwid3n3kjtdtn2cv8dvykmbvyavyhbvyqv\
>                  yitptp2dv8mvyrjvBvr2dv6jvxh ( 82 chars )
>         LDUDE  : uGuh5c5kckqhh5n4atm3n3ktmtdq2cxd7kmb7a7hb7q7irr2dxm7rt\
>                  muDvr2dvj5f (66 chars , 16 chars(19%) shorter)
>
>
>     (D) Spanish: ( using basic Latin & Latin Supplement )
>         Porqu<eacute>nopuedensimplementehablarenEspa<ntilde>ol
>         U+0050 u+006F u+0072 u+0071 u+0075 u+00E9 u+006E u+006F u+0070
>         u+0075 u+0065 u+0064 u+0065 u+006E u+0073 u+0069 u+006D u+0070
>         u+006C u+0065 u+006D u+0065 u+006E u+0074 u+0065 u+0068 u+0061
>         u+0062 u+006C u+0061 u+0072 u+0065 u+006E U+0045 u+0073 u+0070
>         u+0061 u+00F1 u+006F u+006C
>
>         DUDE-02: vAvrtpde3n2hbtrftabbmtptketptnjiimtktbpjdqptdthmuMvgdt\
>                  b3a3qd  (61 chars)
>         LDUDE  : uAurftmtg2q2hbrhcbbmfcepnjiimidpjdqpmrmuMuqmb3a3qd
>                  (51 chars, 10 chars (16%) shorter)
>
>
>     (E) Czech:  (using Latin Extended A)
>         Pro<ccaron>prost<ecaron>nemluv<iacute><ccaron>esky
>         U+0050 u+0072 u+006F u+010D u+0070 u+0072 u+006F u+0073 u+0074
>         u+011B u+006E u+0065 u+006D u+006C u+0075 u+0076 u+00ED u+010D
>         u+0065 u+0073 u+006B u+0079
>
>         DUDE-02: vAuctptyctzpctptnhtyrtzfmibtjd3mt8atyitgtitc (45 chars)
>
>         LDUDE  : uAukfycypkfepzpzfmibmtb3m8ayiqtik (34 chars, 24%
shorter )
>
>
>
> Security considerations
>
>     ACE-encoded reordered code points are restored in reverse ACE
translation
>     with no problem, and this improvement do not introdude any new
security threat
>     in ACE.
>
>
> References
>
>     [DUDE02] Mark Welter, Brian Spolarich, Adam Costello,
>     "DUDE: Differential Unicode Domain Encoding", 2001-May-31,
>     draft-ietf-idn-dude-02.
>
>     [AMCACEW] Adam Costello, "AMC-ACE-W version 0.1.0",
>     2001-May-31, draft-ietf-idn-amc-ace-w-00, latest version at
>     http://www.cs.berkeley.edu/~amc/charset/amc-ace-w.
>
>
>     [UNICODE] The Unicode Consortium, "The Unicode Standard",
>     http://www.unicode.org/unicode/standard/standard.html.
>
>     [IDNA]  Patrik Falstrom, Paul Hoffman, "Internationalizing Host
>     Names In Applications (IDNA)",  draft-ietf-idn-idna-01
>
>     [NAMEPREP]  Paul Hoffman, Marc Blanchet,  "Preparation of
>     Internationalized Host Names",  Feb 2001,
>     draft-ietf-idn-nameprep-03
>
>
>
> Author
>
>     Soobok Lee <lsb@postel.co.kr>
>     http://www.postel.co.kr
>
>
>
> Example implementation
>
>     This idea is applicable to any ACEs.
>     LDUDE (in this example implementation) is merely a name for
>     DUDE implementation of this idea.
>
>     Embedded hangul jamo and Latin frequency tables are subject
>     to change with further studies in the next revision of this draft.
>
>     In Unix, save this example source code into ldude.c
>
>     % cc -o ldude ldude.c
>     % ./ldude -e < input_file > output_file
>     % ./ldude -d < output_file
>
>     A input file should contains u+????-form code points
>     delimited with spaces or newlines.
>
>
> /******************************************/
> /* ldude.c 0.1  (2001-Jun-11)             */
> /* Soobok  Lee  <lsb@postel.co.kr>        */
> /* Adam M. Costello <amc@cs.berkeley.edu> */
> /******************************************/
>
> /* This is ANSI C code (C89) implementing */
> /* DUDE (draft-ietf-idn-ldude-01).        */
>
>
> /************************************************************/
> /* Public interface (would normally go in its own .h file): */
>
> #include <stdio.h>
> #include <limits.h>
>
> enum dude_status {
>   dude_success,
>   dude_bad_input,
>   dude_big_output  /* Output would exceed the space provided. */
> };
>
> enum case_sensitivity { case_sensitive, case_insensitive };
>
> #if UINT_MAX >= 0x1FFFFF
> typedef unsigned int u_code_point;
> #else
> typedef unsigned long u_code_point;
> #endif
>
> enum dude_status dude_encode(
>   unsigned int input_length,
>   const u_code_point input[],
>   const unsigned char uppercase_flags[],
>   unsigned int *output_size,
>   char output[] );
>
>     /* dude_encode() converts Unicode to DUDE (without any            */
>     /* signature).  The input must be represented as an array         */
>     /* of Unicode code points (not code units; surrogate pairs        */
>     /* are not allowed), and the output will be represented as        */
>     /* null-terminated ASCII.  The input_length is the number of code */
>     /* points in the input.  The output_size is an in/out argument:   */
>     /* the caller must pass in the maximum number of characters       */
>     /* that may be output (including the terminating null), and on    */
>     /* successful return it will contain the number of characters     */
>     /* actually output (including the terminating null, so it will be */
>     /* one more than strlen() would return, which is why it is called */
>     /* output_size rather than output_length).  The uppercase_flags   */
>     /* array must hold input_length boolean values, where nonzero     */
>     /* means the corresponding Unicode character should be forced     */
>     /* to uppercase after being decoded, and zero means it is         */
>     /* caseless or should be forced to lowercase.  Alternatively,     */
>     /* uppercase_flags may be a null pointer, which is equivalent     */
>     /* to all zeros.  The encoder always outputs lowercase base-32    */
>     /* characters except when nonzero values of uppercase_flags       */
>     /* require otherwise.  The return value may be any of the         */
>     /* dude_status values defined above; if not dude_success, then    */
>     /* output_size and output may contain garbage.  On success, the   */
>     /* encoder will never need to write an output_size greater than   */
>     /* input_length*k+1 if all the input code points are less than 1  */
>     /* << (4*k), because of how the encoding is defined.              */
>
> enum dude_status dude_decode(
>   enum case_sensitivity case_sensitivity,
>   char scratch_space[],
>   const char input[],
>   unsigned int *output_length,
>   u_code_point output[],
>   unsigned char uppercase_flags[] );
>
>     /* dude_decode() converts DUDE (without any signature) to         */
>     /* Unicode.  The input must be represented as null-terminated     */
>     /* ASCII, and the output will be represented as an array of       */
>     /* Unicode code points.  The case_sensitivity argument influences */
>     /* the check on the well-formedness of the input string; it       */
>     /* must be case_sensitive if case-sensitive comparisons are       */
>     /* allowed on encoded strings, case_insensitive otherwise.        */
>     /* The scratch_space must point to space at least as large        */
>     /* as the input, which will get overwritten (this allows the      */
>     /* decoder to avoid calling malloc()).  The output_length is      */
>     /* an in/out argument: the caller must pass in the maximum        */
>     /* number of code points that may be output, and on successful    */
>     /* return it will contain the actual number of code points        */
>     /* output.  The uppercase_flags array must have room for at       */
>     /* least output_length values, or it may be a null pointer if     */
>     /* the case information is not needed.  A nonzero flag indicates  */
>     /* that the corresponding Unicode character should be forced to   */
>     /* uppercase by the caller, while zero means it is caseless or    */
>     /* should be forced to lowercase.  The return value may be any    */
>     /* of the dude_status values defined above; if not dude_success,  */
>     /* then output_length, output, and uppercase_flags may contain    */
>     /* garbage.  On success, the decoder will never need to write     */
>     /* an output_length greater than the length of the input (not     */
>     /* counting the null terminator), because of how the encoding is  */
>     /* defined.                                                       */
>
>
> /**********************************************************/
> /* Implementation (would normally go in its own .c file): */
>
> #include <string.h>
>
> /* Character utilities: */
>
> /* base32[q] is the lowercase base-32 character representing  */
> /* the number q from the range 0 to 31.  Note that we cannot  */
> /* use string literals for ASCII characters because an ANSI C */
> /* compiler does not necessarily use ASCII.                   */
>
> static const char base32[] = {
>   97, 98, 99, 100, 101, 102, 103, 104, 105, 106, 107,     /* a-k */
>   109, 110,                                               /* m-n */
>   112, 113, 114, 115, 116, 117, 118, 119, 120, 121, 122,  /* p-z */
>   50, 51, 52, 53, 54, 55, 56, 57                          /* 2-9 */
> };
>
> /* base32_decode(c) returns the value of a base-32 character, in the */
> /* range 0 to 31, or the constant base32_invalid if c is not a valid */
> /* base-32 character.                                                */
>
> enum { base32_invalid = 32 };
>
> static unsigned int base32_decode(char c)
> {
>   if (c < 50) return base32_invalid;
>   if (c <= 57) return c - 26;
>   if (c < 97) c += 32;
>   if (c < 97 || c == 108 || c == 111 || c > 122) return base32_invalid;
>   return c - 97 - (c > 108) - (c > 111);
> }
>
> /* unequal(case_sensitivity,s1,s2) returns 0 if the strings s1 and s2 */
> /* are equal, 1 otherwise.  If case_sensitivity is case_insensitive,  */
> /* then ASCII A-Z are considered equal to a-z respectively.           */
>
> static int unequal( enum case_sensitivity case_sensitivity,
>                     const char s1[], const char s2[]        )
> {
>   char c1, c2;
>
>   if (case_sensitivity != case_insensitive) return strcmp(s1,s2) != 0;
>
>   for (;;) {
>     c1 = *s1;
>     c2 = *s2;
>     if (c1 >= 65 && c1 <= 90) c1 += 32;
>     if (c2 >= 65 && c2 <= 90) c2 += 32;
>     if (c1 != c2) return 1;
>     if (c1 == 0) return 0;
>     ++s1, ++s2;
>   }
> }
>
>
>
> /* Begin of Improvement Code by LSB  */
> /* LANGUAGE-SPECIFIC IMPROVEMENTS TO DUDE  BASED ON CODE REORDERING */
>
> /* Common Constants for Hangul */
> static u_code_point
>         SBase = 0xAC00, LBase = 0x1100, VBase = 0x1161, TBase = 0x11A7,
>         LCount = 19, VCount = 21, TCount = 28, NCount, SCount;
> static u_code_point NCount = 588;   // VCount * TCount
> static u_code_point SCount = 11172; // LCount * NCount
> static u_code_point MCount = 399; // LCount * VCount
>
> /* Hangul Jamo Frequency Mapping/Reverse Mapping Table */
> int JAMO_L_FREQ[19][2] = {
>         {1,11}, // "G"    0
>         {14,0}, // "GG"
>         {9,9}, // "N"
>         {5,12}, // "D"
>         {15,7}, // "DD"
>         {13,3}, // "R"    5
>         {7,18}, // "M"
>         {4,6 }, // "B"
>         {16,14}, // "BB"
>         {2, 2}, // "S"
>         {17,17}, // "SS" 10
>         {0,16}, // "O"
>         {3,15}, // "J"
>         {18,5 }, // "JJ"
>         {8,1}, // "C"
>         {12,4}, // "K"   15
>         {11,8}, // "T"
>         {10,10}, // "P"
>         {6 ,13}  // "H"
>     };
>
>  int JAMO_V_FREQ[21][2] = {
>         {2,20}, // "A"    0
>         {7,5}, // "AE"
>         {15,0}, // "YA"
>         {20,13}, // "YAE"
>         {5,18}, // "EO"
>         {1, 4}, // "E"     5
>         {9,8}, // "YEO"
>         {13,1},// "YE"
>         {6,13}, // "O"
>         {10, 6}, // "WA"
>         {18,9}, // "WAE"  10
>         {16,17}, // "OE"
>         {17,14}, // "YO"
>         {8, 7}, // "U"
>         {12,16}, // "WEO"
>         {19, 2}, // "WE"   15
>         {14,11}, // "WI"
>         {11,12}, // "YU"
>         {4,10}, // "EU"
>         {19,19}, // "YI"
>         {0,3}  // "I"     20
>     };
>
>   int JAMO_T_FREQ[28][2] = {
>         {0,0}, // ""      0
>         {5,4}, // "G"
>         {15,21}, // "GG"
>         {16,8}, // "GS"
>         {1,16}, // "N"
>         {27,1 }, // "NJ"   5
>         {17,17}, // "NH"
>         {10,19}, // "D"
>         {3,24}, // "L"
>         {18,27}, // "LG"
>         {19, 7}, // "LM"   10
>         {20,22}, // "LB"
>         {21,25}, // "LS"
>         {22,26}, // "LT"
>         {23,27}, // "LP"
>         {24,2}, // "LH"   15
>         {4,3}, // "M"
>         {6,6}, // "B"
>         {25,9}, // "BS"
>         {7,10}, // "S"
>         {26,11}, // "SS"   20
>         {2,12}, // "NG"
>         {11,13}, // "J"
>         {10,14}, // "C"
>         {8 ,15}, // "K"
>         {12,18}, // "T"    25
>         {13,20}, // "P"
>         {9 ,5}  // "H"    27
>     };
>
> /* Hangul Decomposition */
>
>
> int isHANGUL(u_code_point s) {
>         int SIndex = s - SBase;
>         if (SIndex < 0 || SIndex >= SCount) {
>             return 0;
>         }
>         return 1;
> };
> int isUNIHAN(u_code_point s) {
>         if (s >= 0x4E00 && s <= 0x9FAF) {
>             return 1;
>         }
>         return 0;
> };
> int isLatins(u_code_point s) {
>         if (s <  0x370) {
>             return 1;
>         }
>         return 0;
> };
>
> u_code_point reorder_hangul(u_code_point s) {
>         u_code_point SIndex = s - SBase;
>         int zL = JAMO_L_FREQ[SIndex / NCount][0];
>         int zV = JAMO_V_FREQ[(SIndex % NCount) / TCount][0];
>         int zT = JAMO_T_FREQ[SIndex % TCount][0];
>         int idx= (zT*MCount +  zV*LCount +  zL);
>         if(idx < SCount-0x400) {
>            idx += 0x400;
>         } else {
>            idx  = idx - (SCount -0x400);
>         };
>         return  idx + SBase;
> }
>
> u_code_point restore_order_hangul(u_code_point z) {
>         int T,V,L;
>         u_code_point zIndex = z-SBase;
>         if(zIndex < 0x400) {
>            zIndex += SCount - 0x400;
>         } else {
>            zIndex  = zIndex - 0x400;
>         };
>         T = JAMO_T_FREQ[(zIndex / MCount)][1];
>         V = JAMO_V_FREQ[(zIndex % MCount) / LCount][1];
>         L = JAMO_L_FREQ[(zIndex % LCount)][1];
>         return (L*NCount + V*TCount + T) + SBase;
> }
>
> u_code_point reorder_unihan(u_code_point s) {
>         return s;
> }
>
> u_code_point restore_order_unihan(u_code_point z) {
>         return z;
> }
>
>
> #define MAPCHAR(x,A,B,bytes) if(A<=x && x< (A+bytes))
return(x+(B-A)); if(B<=x && x< (B+bytes))      return(x+(A-B))
> #define MAP16BL(x,A,B,block) if(A<=x && x< (A+(block<<4)))
return(x+(B-A)); if(B<=x && x< (B+(block<<4))) return(x+(A-B))
>
> u_code_point reorder_latins(u_code_point s) {
>         MAP16BL(s,0x0100,0x0000,3); // Latin Extension A
>         MAP16BL(s,0x0130,0x0080,2);
>         MAP16BL(s,0x0150,0x00A0,1);
>         MAP16BL(s,0x0300,0x00B0,3); // Combining Diacritical Marks
>         MAPCHAR(s,0x0070,0x0060,1); // p,`
>         MAPCHAR(s,0x0072,0x006A,1); // r,j
>         MAPCHAR(s,0x0073,0x006B,1); // s,k
>         MAPCHAR(s,0x0074,0x0066,1); // t,f
>         MAPCHAR(s,0x0075,0x0067,1); // u,g
>         MAPCHAR(s,0x0050,0x0040,1); // P,@  UPPER
>         MAPCHAR(s,0x0052,0x004A,1); // R,J  UPPER
>         MAPCHAR(s,0x0053,0x004B,1); // S,K  UPPER
>         MAPCHAR(s,0x0054,0x0046,1); // T,F  UPPER
>         MAPCHAR(s,0x0055,0x0047,1); // U,G  UPPER
>         MAPCHAR(s,0x0160,0x003A,6); // Latin Extension A
>         MAPCHAR(s,0x0166,0x005B,5);
>         MAPCHAR(s,0x016B,0x007B,5);
>         return s;
> }
>
> u_code_point restore_order_latins(u_code_point z) {
>         return reorder_latins(z);
> }
>
> u_code_point reorder(u_code_point s) {
>         if(isHANGUL(s)) return reorder_hangul(s);
>         if(isUNIHAN(s)) return reorder_unihan(s);
>         if(isLatins(s)) return reorder_latins(s);
>         return s;
> }
> u_code_point restore_order(u_code_point s) {
>         if(isHANGUL(s)) return restore_order_hangul(s);
>         if(isUNIHAN(s)) return restore_order_unihan(s);
>         if(isLatins(s)) return restore_order_latins(s);
>         return s;
> }
>
> /* End of Improvement Code */
>
> /* Encoder: */
>
> enum dude_status dude_encode(
>   unsigned int input_length,
>   const u_code_point input[],
>   const unsigned char uppercase_flags[],
>   unsigned int *output_size,
>   char output[] )
> {
>   unsigned int max_out, in, out, k, j;
>   u_code_point prev, codept, diff, tmp;
>   char shift;
>
>   prev = 0x60;
>   max_out = *output_size;
>
>   for (in = out = 0;  in < input_length;  ++in) {
>
>     /* At the start of each iteration, in and out are the number of */
>     /* items already input/output, or equivalently, the indices of  */
>     /* the next items to be input/output.                           */
>
>     codept = input[in];
>
>     if (codept == 0x2D) {
>       /* Hyphen-minus stands for itself. */
>       if (max_out - out < 1) return dude_big_output;
>       output[out++] = 0x2D;
>       continue;
>     }
>
>     codept = reorder(codept); // by LSB
>
>     diff = prev^codept;
>
>
>     /* Compute the number of base-32 characters (k): */
>     for (tmp = diff >> 4, k = 1;  tmp != 0;  ++k, tmp >>= 4);
>     //fprintf(stderr,"diff %x,%x = prev %x ^ codept %x \n",
k,diff,prev,codept);
>
>     if (max_out - out < k) return dude_big_output;
>     shift = uppercase_flags && uppercase_flags[in] ? 32 : 0;
>     /* shift controls the case of the last base-32 digit. */
>
>     /* Each quintet has the form 1xxxx except the last is 0xxxx. */
>     /* Computing the base-32 digits in reverse order is easiest. */
>
>     out += k;
>     output[out - 1] = base32[diff & 0xF] - shift;
>
>     for (j = 2;  j <= k;  ++j) {
>       diff >>= 4;
>       output[out - j] = base32[0x10 | (diff & 0xF)];
>     }
>
>     prev = codept;
>   }
>
>   /* Append the null terminator: */
>   if (max_out - out < 1) return dude_big_output;
>   output[out++] = 0;
>
>   *output_size = out;
>   return dude_success;
> }
>
>
> /* Decoder: */
>
> enum dude_status dude_decode(
>   enum case_sensitivity case_sensitivity,
>   char scratch_space[],
>   const char input[],
>   unsigned int *output_length,
>   u_code_point output[],
>   unsigned char uppercase_flags[] )
> {
>   u_code_point prev, q, diff;
>   char c;
>   unsigned int max_out, in, out, scratch_size;
>   enum dude_status status;
>
>   prev = 0x60;
>   max_out = *output_length;
>
>   for (c = input[in = 0], out = 0;  c != 0;  c = input[++in], ++out) {
>
>     /* At the start of each iteration, in and out are the number of */
>     /* items already input/output, or equivalently, the indices of  */
>     /* the next items to be input/output.                           */
>
>     if (max_out - out < 1) return dude_big_output;
>
>     if (c == 0x2D) output[out] = c;  /* hyphen-minus is literal */
>     else {
>       /* Base-32 sequence.  Decode quintets until 0xxxx is found: */
>
>       for (diff = 0;  ;  c = input[++in]) {
>         q = base32_decode(c);
>         if (q == base32_invalid){ return dude_bad_input; };
>         diff = (diff << 4) | (q & 0xF);
>         if (q >> 4 == 0) break;
>       }
>
>       // prev = output[out] = prev ^ diff; commented by LSB
>       prev = prev ^ diff;
>       output[out] = restore_order(prev); // LSB
>
>     }
>
>     /* Case of last character determines uppercase flag: */
>     if (uppercase_flags) uppercase_flags[out] = c >= 65 && c <= 90;
>   }
>
>   /* Enforce the uniqueness of the encoding by re-encoding */
>   /* the output and comparing the result to the input:     */
>
>   scratch_size = ++in;
>   status = dude_encode(out, output, uppercase_flags,
>                        &scratch_size, scratch_space);
>   if (status != dude_success || scratch_size != in ||
>       unequal(case_sensitivity, scratch_space, input)
>      ) return dude_bad_input;
>   *output_length = out;
>   return dude_success;
> }
>
>
> /******************************************************************/
> /* Wrapper for testing (would normally go in a separate .c file): */
>
> #include <assert.h>
> #include <stdio.h>
> #include <stdlib.h>
> #include <string.h>
>
> /* For testing, we'll just set some compile-time limits rather than */
> /* use malloc(), and set a compile-time option rather than using a  */
> /* command-line option.                                             */
>
> enum {
>   unicode_max_length = 256,
>   ace_max_size = 256,
>   test_case_sensitivity = case_insensitive
>                           /* suitable for host names */
> };
>
>
> static void usage(char **argv)
> {
>   fprintf(stderr,
>     "%s -e reads code points and writes a DUDE string.\n"
>     "%s -d reads a DUDE string and writes code points.\n"
>     "Input and output are plain text in the native character set.\n"
>     "Code points are in the form u+hex separated by whitespace.\n"
>     "A DUDE string is a newline-terminated sequence of LDH characters\n"
>     "(without any signature).\n"
>     "The case of the u in u+hex is the force-to-uppercase flag.\n"
>     , argv[0], argv[0]);
>   exit(EXIT_FAILURE);
> }
>
>
> static void fail(const char *msg)
> {
>   fputs(msg,stderr);
>   exit(EXIT_FAILURE);
> }
>
> static const char too_big[] =
>   "input or output is too large, recompile with larger limits\n";
> static const char invalid_input[] = "invalid input\n";
> static const char io_error[] = "I/O error\n";
>
>
> /* The following string is used to convert LDH      */
> /* characters between ASCII and the native charset: */
>
> static const char ldh_ascii[] =
>   "................"
>   "................"
>   ".............-.."
>   "0123456789......"
>   ".ABCDEFGHIJKLMNO"
>   "PQRSTUVWXYZ....."
>   ".abcdefghijklmno"
>   "pqrstuvwxyz";
>
>
> int main(int argc, char **argv)
> {
>   enum dude_status status;
>   int r;
>   char *p;
>
>   if (argc != 2) usage(argv);
>   if (argv[1][0] != '-') usage(argv);
>   if (argv[1][2] != 0) usage(argv);
>
>   if (argv[1][1] == 'e') {
>     u_code_point input[unicode_max_length];
>     unsigned long codept;
>     unsigned char uppercase_flags[unicode_max_length];
>     char output[ace_max_size], uplus[3];
>     unsigned int input_length, output_size, i;
>
>     /* Read the input code points: */
>
>     input_length = 0;
>
>     for (;;) {
>       r = scanf("%2s%lx", uplus, &codept);
>       if (ferror(stdin)) fail(io_error);
>       if (r == EOF || r == 0) break;
>
>       if (r != 2 || uplus[1] != '+' || codept > (u_code_point)-1) {
>         fail(invalid_input);
>       }
>
>       if (input_length == unicode_max_length) fail(too_big);
>
>       if (uplus[0] == 'u') uppercase_flags[input_length] = 0;
>       else if (uplus[0] == 'U') uppercase_flags[input_length] = 1;
>       else fail(invalid_input);
>
>       input[input_length++] = codept;
>     }
>
>     /* Encode: */
>
>     output_size = ace_max_size;
>     status = dude_encode(input_length, input, uppercase_flags,
>                          &output_size, output);
>     if (status == dude_bad_input) fail(invalid_input);
>     if (status == dude_big_output) fail(too_big);
>     assert(status == dude_success);
>
>     /* Convert to native charset and output: */
>
>     for (p = output;  *p != 0;  ++p) {
>       i = *p;
>       assert(i <= 122 && ldh_ascii[i] != '.');
>       *p = ldh_ascii[i];
>     }
>
>     r = puts(output);
>     if (r == EOF) fail(io_error);
>     return EXIT_SUCCESS;
>   }
>
>   if (argv[1][1] == 'd') {
>     char input[ace_max_size], scratch[ace_max_size], *pp;
>     u_code_point output[unicode_max_length];
>     unsigned char uppercase_flags[unicode_max_length];
>     unsigned int input_length, output_length, i;
>
>     /* Read the DUDE input string and convert to ASCII: */
>
>     fgets(input, ace_max_size, stdin);
>     if (ferror(stdin)) fail(io_error);
>     if (feof(stdin)) fail(invalid_input);
>     input_length = strlen(input);
>     if (input[input_length - 1] != '\n') fail(too_big);
>     input[--input_length] = 0;
>
>     for (p = input;  *p != 0;  ++p) {
>       pp = strchr(ldh_ascii, *p);
>       if (pp == 0) fail(invalid_input);
>       *p = pp - ldh_ascii;
>     }
>
>     /* Decode: */
>
>     output_length = unicode_max_length;
>     status = dude_decode(test_case_sensitivity, scratch, input,
>                          &output_length, output, uppercase_flags);
>     if (status == dude_bad_input) fail(invalid_input);
>     if (status == dude_big_output) fail(too_big);
>     assert(status == dude_success);
>
>     /* Output the result: */
>
>     for (i = 0;  i < output_length;  ++i) {
>       r = printf("%s+%04lX\n",
>                  uppercase_flags[i] ? "U" : "u",
>                  (unsigned long) output[i] );
>       if (r < 0) fail(io_error);
>     }
>
>     return EXIT_SUCCESS;
>   }
>
>   usage(argv);
>   return EXIT_SUCCESS;  /* not reached, but quiets compiler warning */
> }
>
>
>
>
>
Prev by Date: Re: [idn] Report from the ACE design team
Next by Date: Re: [idn] Report from the ACE design team
Prev by thread: [idn] [draft] Improving ACE using code point reordering v0.5
Next by thread: [idn] Why do IETF people focus so much on face-to-face meetings (like London)...??
Index(es):
- Date
- Thread