[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
[idn] [draft] Improving ACE using code point reordering v0.5

To: <idn@ops.ietf.org>
Subject: [idn] [draft] Improving ACE using code point reordering v0.5
From: Soobok Lee <lsb@postel.co.kr>
Date: Tue, 26 Jun 2001 13:57:52 +0900 (KST)
cc: <lsb@postel.co.kr>, <internet-drafts@ietf.org>
Delivery-date: Mon, 25 Jun 2001 21:59:45 -0700
Envelope-to: idn-data@psg.com
Could DUDE be improved for Hangul,Vietnames and European languages
without hurting its simplicity and elegance?  Yes! :-)

THis draft will appear in IDN Intenet-draft directory soon.

Soobok Lee
lsb@postel.co.kr

---------------------------------------------------------------------------


INTERNET-DRAFT                                                    Soobok Lee
draft-ietf-idn-lsb-00.txt
Expires 2001-Dec-26                                              2001-Jun-26

         Improving ACE using code point reordering v0.5

Status of this Memo

    This document is an Internet-Draft and is in full conformance with
    all provisions of Section 10 of RFC2026.

    Internet-Drafts are working documents of the Internet Engineering
    Task Force (IETF), its areas, and its working groups.  Note
    that other groups may also distribute working documents as
    Internet-Drafts.

    Internet-Drafts are draft documents valid for a maximum of six
    months and may be updated, replaced, or obsoleted by other documents
    at any time.  It is inappropriate to use Internet-Drafts as
    reference material or to cite them other than as "work in progress."

    The list of current Internet-Drafts can be accessed at
    http://www.ietf.org/ietf/1id-abstracts.txt

    The list of Internet-Draft Shadow Directories can be accessed at
    http://www.ietf.org/shadow.html

    Distribution of this document is unlimited.  Please send comments to
    the authors or to the idn working group at idn@ops.ietf.org.


Abstract

    This draft describes statistically-backed Unicode code point reordering
    to improve ACE compression. This code point reordering may be inserted
    into existing ACE algorithms with only 1~2 lines of modification.
    This reordering requires simple character mapping or range-wide mapping
    which are simple and easy to implement.

    Experiments with DUDE implementation of this idea show great
    improvements for Hangul, Vietnames and European domains.


Contents

    Overview
    Hangul
    Basic Latin
    Extended Latin and Combining Diacritical Marks
    Unified Han
    Other character sets
    Modified Encoding procedure of DUDE implementation of this idea
    Modified Decoding procedure of DUDE implementation of this idea
    Example strings
    Security considerations
    References
    Author
    LDUDE: Example implementation into DUDE



Overview

    Pursuing shorter ACE labels is justified to save memory resources and
    to reduce internet traffic  even for domains of average length
    in various application/core internet protocols.

    Both 11172 Hangul syllables and 24000 or more CJK Han syllables
    occupy roughly half of the entire unicode space.
    Their lexicographical ordering( not in frequency ordering) makes
    various ACE compression technique work poorly for them. Frequently
    used syllables are spread evenly through out those wide ranges.

    Even, Latin characters code range, including 'a' - 'z' has
    lexicographical order that does not reflect the fact that 's', 't' and
    'r' are more frequently used than 'j','k' and 'h'.

    Each Hangul code point integer have useful bit structure in it.
    That reflects its hangul jamo(consonant/vowel) combination  which
    provides us the hint for applying reordering by usage frequency.

    Most ACE algorithms show good compression ratio when frequently used
    characters are re-located into narrower code areas.
    Especially, to reduce DUDE XOR distance, we can make the narrow area
    fit in 16,256 and 4096 bytes-long page boundaries.


Hangul

    Each code point integer of modern hangul syllables (u+ac00 ~ u+bd7f) can be
    decomposed to produce the following 3 jamo(hangul consonant/vowel) indices,
        1) L: a leading consonant index in modern 19 leading consonants array,
        2) V: a vowel index in modern 21 vowels array,
        3) T: an optional trailing consonant index in modern 28 trailing
           consonants array.

    In fact, Unicode hangul range (u+ac00 ~ u+bd7f) is a 3-dimensional array
    Hangul[L][V][T] with base 0xac00 and each hangul syllable has
    code point integer determined by  L*21*28 + V*28 + T +  0xac00.

    Statistically, roughly six of 28 hangul trailing consonants are frequent in
    korean nouns.  Leading consontants and vowels show relatively even
    frequency distributions compared with that of trailing consonants.

    We can make a frequency mapping table f(T) for T so that
    f(T)= FrequencyOrderOf(T) in the array of modern trailing consonants.

    Now, Hangul[L][V][T]  is to be reordered into another 3-dimensional array
    Hangul[f(T)][f(V)][f(L)] with index T augmented to the highest order.

    Each hangul syllable in this reorderd range has a new code point value
    determined by  f(T)*21*19 + f(V)*19 + f(L) + 0xac00.

    Average f(T) value should be sufficiently lower than f(L) and F(V), so that
    the lowered code point value and induced lowered successive XOR-differences
    contribute to produce shorter ACE labes.


Basic Latin

    Basic Latin row u+0070 ~ u+007f has 'p','r','s','t' and 'u' which
    are more frequently used in European nouns than '`','j','k','f' and 'g'
    in u+0060 ~ u+006f row which includes most frequently used 'a'~'o' .

    Let's swap two sets of these 5 characters by character by character basis,
    so that 'p','r','s','t','u' go into the u+0060 ~ u+006f row.

    Single row fits in 16 bytes boundary and any character sequences
    from only this single row make XOR-distance or code window length
    shorter than 0x10 and make DUDE and other ACEs do good compression.


Extended Latin and Combining Diacritical Marks

    First 6 rows from Latin Extension A(u+0100 ~ u+015f)  and 6 rows from
    Basic Latin & Latin-1 Supplement (u+0000 ~ u+002f and u+0080 ~ u+00a0)
    are swapped.
    First 3 rows from Combining Diacritical Marks(u+0300 ~ u+032f)  and
    3 rows from Latin-1 Supplement (u+00B0 ~ u+00df) are also swapped.

    This makes frequently used parts of Latin Extended-A and
    Combining Diacritical Marks go into first 256-bytes page(u+0000 ~ u+00ff).
    Any character sequences from this single 256-bytes-long page make
    XOR-distance or code window length  much shorter than 0x100.

    This improvement benefits especially East-European and Vietnamese
    domains which use Latin Extented A and Combining Diacritical Marks.


Unified Han

    I have no statistical data to optimize for this code ranges, yet.

    CJK Unified Han syllables (u+4e00 ~ u+a48f) are ordered by their radicals.
    We could consider casting most frequently used 256 Han syllables into
    single 256-bytes page.
    We could consider casting most frequently used 4096 Han syllables into
    single 4096-bytes page ,too.


Other character sets

    I have no statistical data to optimize for this code ranges, yet.

    Japapanes gatakana and hiragana (u+3040 ~ u+30ff) code points have
    lexicographical order and if we could put most frequent ones in single row,
    it may greatly improve gatakana-only or hiragana-only domains.

    For Arabic,Hebrew,Cyrllic and Hindi, we could devise similiar optimizations,
    if relavent statistical data are avaiable.


Modified Encoding procedure of DUDE implementation of this idea

    All ordering of nybbles and quintets is big-endian (most significant
    first).  A nybble is 4 bits.  XOR is bitwise exclusive or.

    This modification are hyphen-safe.
    Hyphen encoding/decoding is not affected by this modification.

    let prev = 96
    for each input integer n (in order) do begin
      if n == 45 then output hyphen minus
      else begin

        n = reorder(n)   //  ******** ADDED **********

        let diff = prev XOR n
        extract the least significant nybbles of diff, as few as are
          sufficient to hold all the nonzero bits (but at least one)
        prepend 0 to the last nybble and 1 to the rest
        output base-32 characters corresponding to the quintets
        let prev = n
      end
    end

    The encoder must either correctly handle all integer values that can
    be represented in the type of its input, or it must check whether
    the input contains values that it cannot handle and return an error
    if so.  Under no circumstances may it produce incorrect output.


Modified Decoding procedure of DUDE implementation of this idea

    let prev = 96
    while the input string is not exhausted do begin
      if the next character is hyphen-minus then output 45
      else begin
        input characters and convert them to quintets until
          encountering a quintet beginning with 0
        fail upon encountering a non-base-32 character or end-of-input
        strip the first bit of each quintet
        concatenate the resulting nybbles to form diff
        let prev = prev XOR diff

        output restore_order(prev)  //  ******** MODIFIED **********

      end
    end
    encode the output sequence and compare it to the input string
    fail if they are not equal



Example strings

    about 20% improvement in DUDE compression ratio is measured in these
    Hangul examples.

    (A) Korean Strings 1: ( 24 hangul syllables )
        U+C138 U+ACC4 U+C758 U+BAA8 U+B4E0 U+C0AC U+B78C U+B4E4 U+C774
        U+D55C U+AD6D U+C5B4 U+B97C U+C774 U+D574 U+D55C U+B2E4 U+BA74
        U+C5BC U+B9C8 U+B098 U+C88B U+C744 U+AE4C

        DUDE-02: 6txIy79Ny53Nz79A8wIzwwNzzuAvyIzv3AtuuIz2vBy27Jz66Iz8sI\
                 tusAuIyz5I23Az96Iz6zE3xAz2tD96Ry3sI ( 89 chars )
        LDUDE  : 5szf8pt3bttat3jt6iywhu3bw9qt5m37r2vmxxjxsg2mtvat3auykz\
                 wpxubc8xd42fw6p ( 70 chars )


    (B) Korean Strings 2:  ( 6 hangul syllables )
        U+C138 U+ACC4 U+C758 U+BAA8 U+B4E0 U+C0AC

        DUDE-02: 6txiy79ny53nz79a8wizwwn  ( 24 chars )
        LDUDE  : 5szf8pt3bttat3jt6i  ( 19 chars )



    LDUDE improvents show better compression ratios for other scripts.

    (C) Vietnamese: ( 38 syllables using diacritical marks )
        Ta<dotbelow>isaoho<dotbelow>kh<ocirc>ngth<ecirc><hookabove>chi\
        <hookabove>no<acute>iti<ecirc><acute>ngVi<ecirc><dotbelow>t
        U+0054 u+0061 u+0323 u+0069 u+0073 u+0061 u+006F u+0068 u+006F
        u+0323 u+006B u+0068 u+00F4 u+006E u+0067 u+0074 u+0068 u+00EA
        u+0309 u+0063 u+0068 u+0069 u+0309 u+006E u+006F u+0301 u+0069
        u+0074 u+0069 u+00EA u+0301 u+006E u+0067 U+0056 u+0069 u+00EA
        u+0323 u+0074

        DUDE-02: vEvfvwcvwktktcqhhvwnvwid3n3kjtdtn2cv8dvykmbvyavyhbvyqv\
                 yitptp2dv8mvyrjvBvr2dv6jvxh ( 82 chars )
        LDUDE  : uGuh5c5kckqhh5n4atm3n3ktmtdq2cxd7kmb7a7hb7q7irr2dxm7rt\
                 muDvr2dvj5f (66 chars , 16 chars(19%) shorter)


    (D) Spanish: ( using basic Latin & Latin Supplement )
        Porqu<eacute>nopuedensimplementehablarenEspa<ntilde>ol
        U+0050 u+006F u+0072 u+0071 u+0075 u+00E9 u+006E u+006F u+0070
        u+0075 u+0065 u+0064 u+0065 u+006E u+0073 u+0069 u+006D u+0070
        u+006C u+0065 u+006D u+0065 u+006E u+0074 u+0065 u+0068 u+0061
        u+0062 u+006C u+0061 u+0072 u+0065 u+006E U+0045 u+0073 u+0070
        u+0061 u+00F1 u+006F u+006C

        DUDE-02: vAvrtpde3n2hbtrftabbmtptketptnjiimtktbpjdqptdthmuMvgdt\
                 b3a3qd  (61 chars)
        LDUDE  : uAurftmtg2q2hbrhcbbmfcepnjiimidpjdqpmrmuMuqmb3a3qd
                 (51 chars, 10 chars (16%) shorter)


    (E) Czech:  (using Latin Extended A)
        Pro<ccaron>prost<ecaron>nemluv<iacute><ccaron>esky
        U+0050 u+0072 u+006F u+010D u+0070 u+0072 u+006F u+0073 u+0074
        u+011B u+006E u+0065 u+006D u+006C u+0075 u+0076 u+00ED u+010D
        u+0065 u+0073 u+006B u+0079

        DUDE-02: vAuctptyctzpctptnhtyrtzfmibtjd3mt8atyitgtitc (45 chars)

        LDUDE  : uAukfycypkfepzpzfmibmtb3m8ayiqtik (34 chars, 24% shorter )



Security considerations

    ACE-encoded reordered code points are restored in reverse ACE translation
    with no problem, and this improvement do not introdude any new security threat
    in ACE.


References

    [DUDE02] Mark Welter, Brian Spolarich, Adam Costello,
    "DUDE: Differential Unicode Domain Encoding", 2001-May-31,
    draft-ietf-idn-dude-02.

    [AMCACEW] Adam Costello, "AMC-ACE-W version 0.1.0",
    2001-May-31, draft-ietf-idn-amc-ace-w-00, latest version at
    http://www.cs.berkeley.edu/~amc/charset/amc-ace-w.


    [UNICODE] The Unicode Consortium, "The Unicode Standard",
    http://www.unicode.org/unicode/standard/standard.html.

    [IDNA]  Patrik Falstrom, Paul Hoffman, "Internationalizing Host
    Names In Applications (IDNA)",  draft-ietf-idn-idna-01

    [NAMEPREP]  Paul Hoffman, Marc Blanchet,  "Preparation of
    Internationalized Host Names",  Feb 2001,
    draft-ietf-idn-nameprep-03



Author

    Soobok Lee <lsb@postel.co.kr>
    http://www.postel.co.kr



Example implementation

    This idea is applicable to any ACEs.
    LDUDE (in this example implementation) is merely a name for
    DUDE implementation of this idea.

    Embedded hangul jamo and Latin frequency tables are subject
    to change with further studies in the next revision of this draft.

    In Unix, save this example source code into ldude.c

    % cc -o ldude ldude.c
    % ./ldude -e < input_file > output_file
    % ./ldude -d < output_file

    A input file should contains u+????-form code points
    delimited with spaces or newlines.


/******************************************/
/* ldude.c 0.1  (2001-Jun-11)             */
/* Soobok  Lee  <lsb@postel.co.kr>        */
/* Adam M. Costello <amc@cs.berkeley.edu> */
/******************************************/

/* This is ANSI C code (C89) implementing */
/* DUDE (draft-ietf-idn-ldude-01).        */


/************************************************************/
/* Public interface (would normally go in its own .h file): */

#include <stdio.h>
#include <limits.h>

enum dude_status {
  dude_success,
  dude_bad_input,
  dude_big_output  /* Output would exceed the space provided. */
};

enum case_sensitivity { case_sensitive, case_insensitive };

#if UINT_MAX >= 0x1FFFFF
typedef unsigned int u_code_point;
#else
typedef unsigned long u_code_point;
#endif

enum dude_status dude_encode(
  unsigned int input_length,
  const u_code_point input[],
  const unsigned char uppercase_flags[],
  unsigned int *output_size,
  char output[] );

    /* dude_encode() converts Unicode to DUDE (without any            */
    /* signature).  The input must be represented as an array         */
    /* of Unicode code points (not code units; surrogate pairs        */
    /* are not allowed), and the output will be represented as        */
    /* null-terminated ASCII.  The input_length is the number of code */
    /* points in the input.  The output_size is an in/out argument:   */
    /* the caller must pass in the maximum number of characters       */
    /* that may be output (including the terminating null), and on    */
    /* successful return it will contain the number of characters     */
    /* actually output (including the terminating null, so it will be */
    /* one more than strlen() would return, which is why it is called */
    /* output_size rather than output_length).  The uppercase_flags   */
    /* array must hold input_length boolean values, where nonzero     */
    /* means the corresponding Unicode character should be forced     */
    /* to uppercase after being decoded, and zero means it is         */
    /* caseless or should be forced to lowercase.  Alternatively,     */
    /* uppercase_flags may be a null pointer, which is equivalent     */
    /* to all zeros.  The encoder always outputs lowercase base-32    */
    /* characters except when nonzero values of uppercase_flags       */
    /* require otherwise.  The return value may be any of the         */
    /* dude_status values defined above; if not dude_success, then    */
    /* output_size and output may contain garbage.  On success, the   */
    /* encoder will never need to write an output_size greater than   */
    /* input_length*k+1 if all the input code points are less than 1  */
    /* << (4*k), because of how the encoding is defined.              */

enum dude_status dude_decode(
  enum case_sensitivity case_sensitivity,
  char scratch_space[],
  const char input[],
  unsigned int *output_length,
  u_code_point output[],
  unsigned char uppercase_flags[] );

    /* dude_decode() converts DUDE (without any signature) to         */
    /* Unicode.  The input must be represented as null-terminated     */
    /* ASCII, and the output will be represented as an array of       */
    /* Unicode code points.  The case_sensitivity argument influences */
    /* the check on the well-formedness of the input string; it       */
    /* must be case_sensitive if case-sensitive comparisons are       */
    /* allowed on encoded strings, case_insensitive otherwise.        */
    /* The scratch_space must point to space at least as large        */
    /* as the input, which will get overwritten (this allows the      */
    /* decoder to avoid calling malloc()).  The output_length is      */
    /* an in/out argument: the caller must pass in the maximum        */
    /* number of code points that may be output, and on successful    */
    /* return it will contain the actual number of code points        */
    /* output.  The uppercase_flags array must have room for at       */
    /* least output_length values, or it may be a null pointer if     */
    /* the case information is not needed.  A nonzero flag indicates  */
    /* that the corresponding Unicode character should be forced to   */
    /* uppercase by the caller, while zero means it is caseless or    */
    /* should be forced to lowercase.  The return value may be any    */
    /* of the dude_status values defined above; if not dude_success,  */
    /* then output_length, output, and uppercase_flags may contain    */
    /* garbage.  On success, the decoder will never need to write     */
    /* an output_length greater than the length of the input (not     */
    /* counting the null terminator), because of how the encoding is  */
    /* defined.                                                       */


/**********************************************************/
/* Implementation (would normally go in its own .c file): */

#include <string.h>

/* Character utilities: */

/* base32[q] is the lowercase base-32 character representing  */
/* the number q from the range 0 to 31.  Note that we cannot  */
/* use string literals for ASCII characters because an ANSI C */
/* compiler does not necessarily use ASCII.                   */

static const char base32[] = {
  97, 98, 99, 100, 101, 102, 103, 104, 105, 106, 107,     /* a-k */
  109, 110,                                               /* m-n */
  112, 113, 114, 115, 116, 117, 118, 119, 120, 121, 122,  /* p-z */
  50, 51, 52, 53, 54, 55, 56, 57                          /* 2-9 */
};

/* base32_decode(c) returns the value of a base-32 character, in the */
/* range 0 to 31, or the constant base32_invalid if c is not a valid */
/* base-32 character.                                                */

enum { base32_invalid = 32 };

static unsigned int base32_decode(char c)
{
  if (c < 50) return base32_invalid;
  if (c <= 57) return c - 26;
  if (c < 97) c += 32;
  if (c < 97 || c == 108 || c == 111 || c > 122) return base32_invalid;
  return c - 97 - (c > 108) - (c > 111);
}

/* unequal(case_sensitivity,s1,s2) returns 0 if the strings s1 and s2 */
/* are equal, 1 otherwise.  If case_sensitivity is case_insensitive,  */
/* then ASCII A-Z are considered equal to a-z respectively.           */

static int unequal( enum case_sensitivity case_sensitivity,
                    const char s1[], const char s2[]        )
{
  char c1, c2;

  if (case_sensitivity != case_insensitive) return strcmp(s1,s2) != 0;

  for (;;) {
    c1 = *s1;
    c2 = *s2;
    if (c1 >= 65 && c1 <= 90) c1 += 32;
    if (c2 >= 65 && c2 <= 90) c2 += 32;
    if (c1 != c2) return 1;
    if (c1 == 0) return 0;
    ++s1, ++s2;
  }
}



/* Begin of Improvement Code by LSB  */
/* LANGUAGE-SPECIFIC IMPROVEMENTS TO DUDE  BASED ON CODE REORDERING */

/* Common Constants for Hangul */
static u_code_point
        SBase = 0xAC00, LBase = 0x1100, VBase = 0x1161, TBase = 0x11A7,
        LCount = 19, VCount = 21, TCount = 28, NCount, SCount;
static u_code_point NCount = 588;   // VCount * TCount
static u_code_point SCount = 11172; // LCount * NCount
static u_code_point MCount = 399; // LCount * VCount

/* Hangul Jamo Frequency Mapping/Reverse Mapping Table */
int JAMO_L_FREQ[19][2] = {
        {1,11}, // "G"    0
        {14,0}, // "GG"
        {9,9}, // "N"
        {5,12}, // "D"
        {15,7}, // "DD"
        {13,3}, // "R"    5
        {7,18}, // "M"
        {4,6 }, // "B"
        {16,14}, // "BB"
        {2, 2}, // "S"
        {17,17}, // "SS" 10
        {0,16}, // "O"
        {3,15}, // "J"
        {18,5 }, // "JJ"
        {8,1}, // "C"
        {12,4}, // "K"   15
        {11,8}, // "T"
        {10,10}, // "P"
        {6 ,13}  // "H"
    };

 int JAMO_V_FREQ[21][2] = {
        {2,20}, // "A"    0
        {7,5}, // "AE"
        {15,0}, // "YA"
        {20,13}, // "YAE"
        {5,18}, // "EO"
        {1, 4}, // "E"     5
        {9,8}, // "YEO"
        {13,1},// "YE"
        {6,13}, // "O"
        {10, 6}, // "WA"
        {18,9}, // "WAE"  10
        {16,17}, // "OE"
        {17,14}, // "YO"
        {8, 7}, // "U"
        {12,16}, // "WEO"
        {19, 2}, // "WE"   15
        {14,11}, // "WI"
        {11,12}, // "YU"
        {4,10}, // "EU"
        {19,19}, // "YI"
        {0,3}  // "I"     20
    };

  int JAMO_T_FREQ[28][2] = {
        {0,0}, // ""      0
        {5,4}, // "G"
        {15,21}, // "GG"
        {16,8}, // "GS"
        {1,16}, // "N"
        {27,1 }, // "NJ"   5
        {17,17}, // "NH"
        {10,19}, // "D"
        {3,24}, // "L"
        {18,27}, // "LG"
        {19, 7}, // "LM"   10
        {20,22}, // "LB"
        {21,25}, // "LS"
        {22,26}, // "LT"
        {23,27}, // "LP"
        {24,2}, // "LH"   15
        {4,3}, // "M"
        {6,6}, // "B"
        {25,9}, // "BS"
        {7,10}, // "S"
        {26,11}, // "SS"   20
        {2,12}, // "NG"
        {11,13}, // "J"
        {10,14}, // "C"
        {8 ,15}, // "K"
        {12,18}, // "T"    25
        {13,20}, // "P"
        {9 ,5}  // "H"    27
    };

/* Hangul Decomposition */


int isHANGUL(u_code_point s) {
        int SIndex = s - SBase;
        if (SIndex < 0 || SIndex >= SCount) {
            return 0;
        }
        return 1;
};
int isUNIHAN(u_code_point s) {
        if (s >= 0x4E00 && s <= 0x9FAF) {
            return 1;
        }
        return 0;
};
int isLatins(u_code_point s) {
        if (s <  0x370) {
            return 1;
        }
        return 0;
};

u_code_point reorder_hangul(u_code_point s) {
        u_code_point SIndex = s - SBase;
        int zL = JAMO_L_FREQ[SIndex / NCount][0];
        int zV = JAMO_V_FREQ[(SIndex % NCount) / TCount][0];
        int zT = JAMO_T_FREQ[SIndex % TCount][0];
        int idx= (zT*MCount +  zV*LCount +  zL);
        if(idx < SCount-0x400) {
           idx += 0x400;
        } else {
           idx  = idx - (SCount -0x400);
        };
        return  idx + SBase;
}

u_code_point restore_order_hangul(u_code_point z) {
        int T,V,L;
        u_code_point zIndex = z-SBase;
        if(zIndex < 0x400) {
           zIndex += SCount - 0x400;
        } else {
           zIndex  = zIndex - 0x400;
        };
        T = JAMO_T_FREQ[(zIndex / MCount)][1];
        V = JAMO_V_FREQ[(zIndex % MCount) / LCount][1];
        L = JAMO_L_FREQ[(zIndex % LCount)][1];
        return (L*NCount + V*TCount + T) + SBase;
}

u_code_point reorder_unihan(u_code_point s) {
        return s;
}

u_code_point restore_order_unihan(u_code_point z) {
        return z;
}


#define MAPCHAR(x,A,B,bytes) if(A<=x && x< (A+bytes))      return(x+(B-A)); if(B<=x && x< (B+bytes))      return(x+(A-B))
#define MAP16BL(x,A,B,block) if(A<=x && x< (A+(block<<4))) return(x+(B-A)); if(B<=x && x< (B+(block<<4))) return(x+(A-B))

u_code_point reorder_latins(u_code_point s) {
        MAP16BL(s,0x0100,0x0000,3); // Latin Extension A
        MAP16BL(s,0x0130,0x0080,2);
        MAP16BL(s,0x0150,0x00A0,1);
        MAP16BL(s,0x0300,0x00B0,3); // Combining Diacritical Marks
        MAPCHAR(s,0x0070,0x0060,1); // p,`
        MAPCHAR(s,0x0072,0x006A,1); // r,j
        MAPCHAR(s,0x0073,0x006B,1); // s,k
        MAPCHAR(s,0x0074,0x0066,1); // t,f
        MAPCHAR(s,0x0075,0x0067,1); // u,g
        MAPCHAR(s,0x0050,0x0040,1); // P,@  UPPER
        MAPCHAR(s,0x0052,0x004A,1); // R,J  UPPER
        MAPCHAR(s,0x0053,0x004B,1); // S,K  UPPER
        MAPCHAR(s,0x0054,0x0046,1); // T,F  UPPER
        MAPCHAR(s,0x0055,0x0047,1); // U,G  UPPER
        MAPCHAR(s,0x0160,0x003A,6); // Latin Extension A
        MAPCHAR(s,0x0166,0x005B,5);
        MAPCHAR(s,0x016B,0x007B,5);
        return s;
}

u_code_point restore_order_latins(u_code_point z) {
        return reorder_latins(z);
}

u_code_point reorder(u_code_point s) {
        if(isHANGUL(s)) return reorder_hangul(s);
        if(isUNIHAN(s)) return reorder_unihan(s);
        if(isLatins(s)) return reorder_latins(s);
        return s;
}
u_code_point restore_order(u_code_point s) {
        if(isHANGUL(s)) return restore_order_hangul(s);
        if(isUNIHAN(s)) return restore_order_unihan(s);
        if(isLatins(s)) return restore_order_latins(s);
        return s;
}

/* End of Improvement Code */

/* Encoder: */

enum dude_status dude_encode(
  unsigned int input_length,
  const u_code_point input[],
  const unsigned char uppercase_flags[],
  unsigned int *output_size,
  char output[] )
{
  unsigned int max_out, in, out, k, j;
  u_code_point prev, codept, diff, tmp;
  char shift;

  prev = 0x60;
  max_out = *output_size;

  for (in = out = 0;  in < input_length;  ++in) {

    /* At the start of each iteration, in and out are the number of */
    /* items already input/output, or equivalently, the indices of  */
    /* the next items to be input/output.                           */

    codept = input[in];

    if (codept == 0x2D) {
      /* Hyphen-minus stands for itself. */
      if (max_out - out < 1) return dude_big_output;
      output[out++] = 0x2D;
      continue;
    }

    codept = reorder(codept); // by LSB

    diff = prev^codept;


    /* Compute the number of base-32 characters (k): */
    for (tmp = diff >> 4, k = 1;  tmp != 0;  ++k, tmp >>= 4);
    //fprintf(stderr,"diff %x,%x = prev %x ^ codept %x \n", k,diff,prev,codept);

    if (max_out - out < k) return dude_big_output;
    shift = uppercase_flags && uppercase_flags[in] ? 32 : 0;
    /* shift controls the case of the last base-32 digit. */

    /* Each quintet has the form 1xxxx except the last is 0xxxx. */
    /* Computing the base-32 digits in reverse order is easiest. */

    out += k;
    output[out - 1] = base32[diff & 0xF] - shift;

    for (j = 2;  j <= k;  ++j) {
      diff >>= 4;
      output[out - j] = base32[0x10 | (diff & 0xF)];
    }

    prev = codept;
  }

  /* Append the null terminator: */
  if (max_out - out < 1) return dude_big_output;
  output[out++] = 0;

  *output_size = out;
  return dude_success;
}


/* Decoder: */

enum dude_status dude_decode(
  enum case_sensitivity case_sensitivity,
  char scratch_space[],
  const char input[],
  unsigned int *output_length,
  u_code_point output[],
  unsigned char uppercase_flags[] )
{
  u_code_point prev, q, diff;
  char c;
  unsigned int max_out, in, out, scratch_size;
  enum dude_status status;

  prev = 0x60;
  max_out = *output_length;

  for (c = input[in = 0], out = 0;  c != 0;  c = input[++in], ++out) {

    /* At the start of each iteration, in and out are the number of */
    /* items already input/output, or equivalently, the indices of  */
    /* the next items to be input/output.                           */

    if (max_out - out < 1) return dude_big_output;

    if (c == 0x2D) output[out] = c;  /* hyphen-minus is literal */
    else {
      /* Base-32 sequence.  Decode quintets until 0xxxx is found: */

      for (diff = 0;  ;  c = input[++in]) {
        q = base32_decode(c);
        if (q == base32_invalid){ return dude_bad_input; };
        diff = (diff << 4) | (q & 0xF);
        if (q >> 4 == 0) break;
      }

      // prev = output[out] = prev ^ diff; commented by LSB
      prev = prev ^ diff;
      output[out] = restore_order(prev); // LSB

    }

    /* Case of last character determines uppercase flag: */
    if (uppercase_flags) uppercase_flags[out] = c >= 65 && c <= 90;
  }

  /* Enforce the uniqueness of the encoding by re-encoding */
  /* the output and comparing the result to the input:     */

  scratch_size = ++in;
  status = dude_encode(out, output, uppercase_flags,
                       &scratch_size, scratch_space);
  if (status != dude_success || scratch_size != in ||
      unequal(case_sensitivity, scratch_space, input)
     ) return dude_bad_input;
  *output_length = out;
  return dude_success;
}


/******************************************************************/
/* Wrapper for testing (would normally go in a separate .c file): */

#include <assert.h>
#include <stdio.h>
#include <stdlib.h>
#include <string.h>

/* For testing, we'll just set some compile-time limits rather than */
/* use malloc(), and set a compile-time option rather than using a  */
/* command-line option.                                             */

enum {
  unicode_max_length = 256,
  ace_max_size = 256,
  test_case_sensitivity = case_insensitive
                          /* suitable for host names */
};


static void usage(char **argv)
{
  fprintf(stderr,
    "%s -e reads code points and writes a DUDE string.\n"
    "%s -d reads a DUDE string and writes code points.\n"
    "Input and output are plain text in the native character set.\n"
    "Code points are in the form u+hex separated by whitespace.\n"
    "A DUDE string is a newline-terminated sequence of LDH characters\n"
    "(without any signature).\n"
    "The case of the u in u+hex is the force-to-uppercase flag.\n"
    , argv[0], argv[0]);
  exit(EXIT_FAILURE);
}


static void fail(const char *msg)
{
  fputs(msg,stderr);
  exit(EXIT_FAILURE);
}

static const char too_big[] =
  "input or output is too large, recompile with larger limits\n";
static const char invalid_input[] = "invalid input\n";
static const char io_error[] = "I/O error\n";


/* The following string is used to convert LDH      */
/* characters between ASCII and the native charset: */

static const char ldh_ascii[] =
  "................"
  "................"
  ".............-.."
  "0123456789......"
  ".ABCDEFGHIJKLMNO"
  "PQRSTUVWXYZ....."
  ".abcdefghijklmno"
  "pqrstuvwxyz";


int main(int argc, char **argv)
{
  enum dude_status status;
  int r;
  char *p;

  if (argc != 2) usage(argv);
  if (argv[1][0] != '-') usage(argv);
  if (argv[1][2] != 0) usage(argv);

  if (argv[1][1] == 'e') {
    u_code_point input[unicode_max_length];
    unsigned long codept;
    unsigned char uppercase_flags[unicode_max_length];
    char output[ace_max_size], uplus[3];
    unsigned int input_length, output_size, i;

    /* Read the input code points: */

    input_length = 0;

    for (;;) {
      r = scanf("%2s%lx", uplus, &codept);
      if (ferror(stdin)) fail(io_error);
      if (r == EOF || r == 0) break;

      if (r != 2 || uplus[1] != '+' || codept > (u_code_point)-1) {
        fail(invalid_input);
      }

      if (input_length == unicode_max_length) fail(too_big);

      if (uplus[0] == 'u') uppercase_flags[input_length] = 0;
      else if (uplus[0] == 'U') uppercase_flags[input_length] = 1;
      else fail(invalid_input);

      input[input_length++] = codept;
    }

    /* Encode: */

    output_size = ace_max_size;
    status = dude_encode(input_length, input, uppercase_flags,
                         &output_size, output);
    if (status == dude_bad_input) fail(invalid_input);
    if (status == dude_big_output) fail(too_big);
    assert(status == dude_success);

    /* Convert to native charset and output: */

    for (p = output;  *p != 0;  ++p) {
      i = *p;
      assert(i <= 122 && ldh_ascii[i] != '.');
      *p = ldh_ascii[i];
    }

    r = puts(output);
    if (r == EOF) fail(io_error);
    return EXIT_SUCCESS;
  }

  if (argv[1][1] == 'd') {
    char input[ace_max_size], scratch[ace_max_size], *pp;
    u_code_point output[unicode_max_length];
    unsigned char uppercase_flags[unicode_max_length];
    unsigned int input_length, output_length, i;

    /* Read the DUDE input string and convert to ASCII: */

    fgets(input, ace_max_size, stdin);
    if (ferror(stdin)) fail(io_error);
    if (feof(stdin)) fail(invalid_input);
    input_length = strlen(input);
    if (input[input_length - 1] != '\n') fail(too_big);
    input[--input_length] = 0;

    for (p = input;  *p != 0;  ++p) {
      pp = strchr(ldh_ascii, *p);
      if (pp == 0) fail(invalid_input);
      *p = pp - ldh_ascii;
    }

    /* Decode: */

    output_length = unicode_max_length;
    status = dude_decode(test_case_sensitivity, scratch, input,
                         &output_length, output, uppercase_flags);
    if (status == dude_bad_input) fail(invalid_input);
    if (status == dude_big_output) fail(too_big);
    assert(status == dude_success);

    /* Output the result: */

    for (i = 0;  i < output_length;  ++i) {
      r = printf("%s+%04lX\n",
                 uppercase_flags[i] ? "U" : "u",
                 (unsigned long) output[i] );
      if (r < 0) fail(io_error);
    }

    return EXIT_SUCCESS;
  }

  usage(argv);
  return EXIT_SUCCESS;  /* not reached, but quiets compiler warning */
}
Prev by Date: [idn] Why do IETF people focus so much on face-to-face meetings (like London)...??
Next by Date: Re: [idn] Report from the ACE design team
Prev by thread: [idn] I-D ACTION:draft-ietf-idn-ace-eval-jp-00.txt
Next by thread: Re: [idn] [draft] Improving ACE using code point reordering v0.5
Index(es):
- Date
- Thread