[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[idn] Re: permission <draft-ietf-idn-ace37-00.txt (attach)



At/À 11:48 2001-07-05 -0400, Edmon you wrote/vous écriviez:
>Hi all,
>
>I was unaware that the workgroup no longer accepts new drafts.

see:

Message-Id: <5.1.0.14.1.20010626000156.03d85e50@mail.viagenie.qc.ca>
Date: Tue, 26 Jun 2001 00:05:28 -0400
To: idn@ops.ietf.org
From: Marc Blanchet <Marc.Blanchet@viagenie.qc.ca>
Subject: [idn] wg next steps

and:

Message-Id: <5.1.0.14.1.20010629080012.02042a10@mail.viagenie.qc.ca>
Date: Fri, 29 Jun 2001 08:06:14 -0400
To: idn@ops.ietf.org
From: Marc Blanchet <Marc.Blanchet@viagenie.qc.ca>
Subject: [idn] document pools active

And, as I wrote in the email, you are _encouraged_ to submit as individual 
submission. The only difference is filename and no listing in the ietf idn 
wg charter web page.

Marc.


>   Anyway, I
>have drafted a new ACE based on the simplicity of DUDE which has hugely
>improved compression.  Worst case scenario CJK could have 21 han characters!
>Attached below is a copy of the draft (for my original submission), you can
>also find it at http://www.dnsii.org/idn-ace37-00.txt (easier to read) and
>hopefully in the i-d-n.net website soon.
>
>ACE37 is based on the one-pass one-mode scheme of DUDE (diiferential XOR),
>then utilizes a simple code block shifting (similar to the reference points
>in the AMC series) to hugely increase the capacity for CJK (worst case
>scenario 21 han characters!) and then utilizes base-32 for compression (as
>in LACE) (DUDE and AMC-w/v uses base-32 only for flagging).  In addition to
>base-32, a base-4 scheme is introduced by using the remaining characters
>{wxyz}.  These contain 2 bits of character information and doubles as an
>indicator for codepoint brackets.  All the while, the algorithm is kept to
>be as simple as DUDE.
>
>Hopefully you might find that it is interesting and appropriate to be
>considered as an ACE within the IETF.  Afterall, it was intended to be an
>integrated version of the three primary ACEs: DUDE, LACE and the AMC series,
>identified by the ACE design team report.
>
>Looking forward to all your inputs.
>
>Edmon
>
>PS. I have created an Excel worksheet to illustrate the Encoding and
>Decoding procedures as well you can find them at
>http://www.dnsii.org/ace37/ace37-encode.xls and
>http://www.dnsii.org/ace37/ace37-decode.xls respectively.
>
>
>
>----- Original Message -----
>From: "Marc Blanchet" <Marc.Blanchet@viagenie.qc.ca>
>To: "Natalia Syracuse" <nsyracus@ietf.org>; <edmon@neteka.com>;
><david@neteka.com>
>Cc: <jseng@pobox.org.sg>
>Sent: Thursday, July 05, 2001 8:50 AM
>Subject: Re: permission <draft-ietf-idn-ace37-00.txt (attach)
>
>
> > I'm sorry but the new wg policy is to not accept draft unless there is a
> > demonstrated support. But drafts are _highly_ encouraged to be published
>as
> > individual submissions. I would recommend to put idn in the filename and
> > use this filenaming convention: draft-<yourname>-idn-ace37-00.txt. After
> > publication in the internet-draft, the author should announce it in the wg
> > mailing list and I'll put a reference to it in the wg web page.
> >
> > So please publish it as individual submission.
> >
> > Marc.
> >
> > At/À 08:34 2001-07-05 -0400, Natalia Syracuse you wrote/vous écriviez:
> > >
> > >
> > >
> > >Internet Draft                                 Edmon Chung, Neteka Inc.
> > ><draft-ietf-idn-ace37-00.txt>                  David Leung, Neteka Inc.
> > >                                                               June 2001
> > >
> > >
> > >
> > >           ACE Utilizing All 37 Alphanumeric Characters (ACE37)
> > >
> > >
> > >STATUS OF THIS MEMO
> > >
> > >    This document is an Internet-Draft and is in full conformance with
> > >    all provisions of Section 10 of RFC2026.
> > >
> > >    Internet-Drafts are working documents of the Internet Engineering
> > >    Task Force (IETF), its areas, and its working groups.  Note that
> > >    other groups may also distribute working documents as Internet-
> > >    Drafts.  Internet-Drafts are draft documents valid for a maximum of
> > >    six months and may be updated, replaced, or obsoleted by other
> > >    documents at any time.  It is inappropriate to use Internet-Drafts
> > >    as reference material or to cite them other than as "work in
> > >    progress."
> > >
> > >    The reader is cautioned not to depend on the values that appear in
> > >    examples to be current or complete, since their purpose is primarily
> > >    educational.  Distribution of this memo is unlimited.
> > >
> > >    The list of current Internet-Drafts can be accessed at
> > >    http://www.ietf.org/ietf/1id-abstracts.txt
> > >    The list of Internet-Draft Shadow Directories can be accessed at
> > >    http://www.ietf.org/shadow.html.
> > >
> > >Abstract
> > >
> > >    ACE37 is a combination of DUDE-02, AMC-W/V and LACE.  ACE37 utilizes
> > >    the simple one pass algorithm of DUDE, the character block
> > >    considerations of AMC-W/V and the Base-32 compression of LACE.  It
> > >    also fully utilizes entire LDH set currently allowed in the DNS (A-
> > >    z, 0-9 and "-") within its character repertoire to optimize
> > >    performance and compression.  Even for the worst-case scenario in
> > >    ACE37, any name can have 21 characters including Chinese, Japanese
> > >    and Korean names. Two Excel spreadsheets for ACE37 encoding and
> > >    decoding can be found at http://www.dnsii.org/ace37/ace37-encode.xls
> > >    and http://www.dnsii.org/ace37/ace37-decode.xls respectively.
> > >
> > >    While DUDE-02 provides a very efficient differential mechanism, its
> > >    compression is inefficient as it fails to take advantage of the
> > >    base-32 scheme in using all 5-bits for character information.  The
> > >    AMC series is highly efficient in compression but requires
> > >    complicated mode changes and therefore inefficient in process.  LACE
> > >    is rather moderate and requires a two-pass mechanism but utilizes
> > >    base-32 for good compression.
> > >
> > >
> > >Chung & Leung                                                  [Page 1]
> > >ACE37       ACE Utilizing All 37 Alphanumeric Characters      July 2001
> > >
> > >    ACE37 uses simple character block shifting to achieve the
> > >    compression efficiency of the AMC series, retains the one-pass and
> > >    one mode XOR differential mechanism used by DUDE while embracing the
> > >    base-32 compression used by LACE for efficient character bit
> > >    information.
> > >
> > >Terminology
> > >
> > >    The key words "MUST", "SHALL", "REQUIRED", "SHOULD", "RECOMMENDED",
> > >    and "MAY" in this document are to be interpreted as described in RFC
> > >    2119 [RFC2119].
> > >
> > >    LDH: Letters, Digits and Hyphens: a string of characters that
> > >    consists only hyphens ("-"), English letters (A-z) and digits (0-9),
> > >    which might not be a result of an algorithm for transcoding
> > >    multilingual characters. For example: whatever-you-want.example
> > >
> > >    ACE - ASCII Compatible Encoding: a string of characters resulting
> > >    from a particular algorithm for transforming multilingual character
> > >    information into an alphanumeric form acceptable by the existing
> > >    DNS.  For example: bq--3bhc2zmh.tld.  In essence, ACE is a subset of
> > >    LDH.
> > >
> > >    Hexadecimal values are shown preceeded by "0x".  For example, 0x60
> > >    is decimal 96.  Binary values are shown preceeded by "0b" for
> > >    example "0b1000" is decimal 8.  As in the Unicode Standard
> > >    [UNICODE], Unicode code points are denoted by "U+" followed by four
> > >    to six hexadecimal digits, while a range of code points (or
> > >    hexadecimal numbers) is denoted by two hexadecimal numbers separated
> > >    by "..", with no prefixes.
> > >
> > >    Octets: sequences of 8 bits; Quintets: sequences of 5 bits;
> > >    Quartets: sequences of 4 bits; Duplets: sequences of 2 bits.
> > >
> > >    XOR: bitwise exclusive or.  Given 2 nonnegative integers A and B, A
> > >    XOR B is the nonnegative integer value whose binary representation
> > >    is 1 wherever A and B disagrees, and 0 wherever they agree.
> > >
> > >Table Of Contents
> > >
> > >    1. Introduction....................................................3
> > >    2. Code Block Shifting.............................................4
> > >    3. Base-32 Characters..............................................5
> > >    4. Base-4 Characters...............................................6
> > >
> > >    5. LDH Considerations..............................................9
> > >    6. Encoding Procedure..............................................9
> > >    7. Decoding Procedure.............................................11
> > >    8. Examples.......................................................13
> > >    9. Summary & Comparisons..........................................15
> > >    10. Security Considerations.......................................16
> > >    11. References....................................................16
> > >
> > >Chung & Leung                                                  [Page 2]
> > >ACE37       ACE Utilizing All 37 Alphanumeric Characters      July 2001
> > >
> > >1. Introduction
> > >
> > >    ACE37 takes into account the recommendations and findings of the ACE
> > >    design team to create a "super-ACE" that incorporates the key
> > >    advantages of the various considered ACEs without complicated mode
> > >    changes.  The encoding (Section 6) and decoding (Section 7) process
> > >    is largely similar to and as simple as DUDE-02.  The encoding
> > >    processes for ACE37 in comparison with DUDE-02 could be summarized:
> > >
> > >         ACE37 Encoding Procedure     |     DUDE Encoding Procedure
> > >     ---------------------------------+---------------------------------
> > >     (1) let initial prev = 0x00      | (1) let initial prev = 0x60
> > >     (2) if n = LDH output "-n"       | (2) if n = hyphen output "-"
> > >     (3) code block shift to obtain   | (3) diff = prev XOR n
> > >           ACE37 shifted n (Section 2)| (4) prepend "0" to the last
> > >     (4) diff = prev XOR n            |      quartet and "1" to others
> > >     (5) output in appropriate base-4 | (5) output a base-32 character
> > >           and base-32 form           |      for each corresponding
> > >           (Sections 3&4)             |      quintet
> > >     (6) let prev = n                 | (6) let prev = n
> > >
> > >    Similarly, the decoding process can be described and compared:
> > >
> > >         ACE37 Decoding Procedure     |     DUDE Decoding Procedure
> > >     ---------------------------------+---------------------------------
> > >     (1) let initial prev = 0x00      | (1) let initial prev = 0x60
> > >     (2) if char = hyphen discard "-" | (2) if char = hyphen consume
> > >           and output next char       |       and output 0x002D
> > >     (3) consume and convert char into| (3) consume and convert to
> > >           duplets and quintets       |       quintets until encoun-
> > >           (according to Sections 3&4)|       erring a quintet with "0"
> > >     (4) concatenate to form diff     |       as first bit
> > >           (based on Sections 4.1&4.2)| (4) strip all first bits off
> > >     (5) let prev = prev XOR diff     | (5) concatente to form diff
> > >     (6) reverse code block shifting  | (6) let prev = prev XOR diff
> > >     (7) output Unicode code point    | (7) output Unicode code point
> > >
> > >    The features of ACE37 include:
> > >
> > >    Unique & Reversible - the ACE37 encoding scheme yields a unique and
> > >    consistent result string for a given set of Unicode code points.
> > >    The encoded string could be decoded back to the original Unicode
> > >    code points without loss of character data.
> > >
> > >    Simple - ACE37 utilizes a one-pass system and the XOR differential
> > >    function to encode and decode.  Code block shifting is done by a
> > >    simple calculation instead of mapping or creation of arbitrary
> > >    reference points. Complex mode changes are not required.
> > >
> > >    Spacious - With the code block shifting coupled with a base-32
> > >    scheme, ACE37 can accommodate up to 21 unique Han characters
> > >    (including CJK) within the 63 octets allowed by the DNS.  Other
> > >    Latin based scripts can reach up to 31 characters.
> > >Chung & Leung                                                  [Page 3]
> > >ACE37       ACE Utilizing All 37 Alphanumeric Characters      July 2001
> > >
> > >
> > >    Completeness - any sequence of Unicode code points
> > >    (U+0000..U+10FFFF) could be encoded.  Restrictions of allowed code
> > >    points is not discussed, but is expected that Nameprep [Nameprep]
> > >    will be used prior to ACE37 encoding.
> > >
> > >    In essence, it captures the focus criterions discussed by the
> > >    workgroup ACE design team - reversibility, simplicity and
> > >    compression capability.  Moreover, ACE37 utilizes a very simple code
> > >    block shifting (Section 2) mechanism to allow up to any 21 CJK
> > >    ideographs to be encoded within the 63-octet constraint.
> > >
> > >2. Code Block Shifting
> > >
> > >    While the DNS was not originally designed for multilingual
> > >    characters, Unicode was not designed with the DNS in mind and
> > >    therefore code points were apparently not allocated in an ACE-
> > >    friendly way.
> > >
> > >    The AMC series [AMC-W & AMC-V] utilizes a number of reference points
> > >    to achieve better compression efficiency by anticipating and
> > >    minimizing delta between characters.  For ACE37, a much simpler
> > >    rendering is used.  More specifically, the entire character block
> > >    U+3000..U+9FFF for CJK ideographs is shifted down by 0x3000.  That
> > >    is U+3000 will become 0x0000, U+4000 becomes 0x1000, and so on.  To
> > >    compensate for the downwards shift, the general script and symbol
> > >    characters in U+0000..U+2FFF will be shifted upwards by 0x7000.
> > >    Therefore, U+0100 will become 0x7100, U+2000 becomes 0x9000, and so
> > >    on.  All other code points (U+A000..U+10FFFF) are unchanged.
> > >
> > >       Original Unicode Allocation   |     ACE37 Code Block Shifted
> > >     --------------------------------|-------------------------------
> > >       General Scripts  U+0000 -+    |     +- 0x0000 CJK Misc
> > >                        U+1000  |    |     |  0x1000 CJK Ideographs
> > >                                +-   |  -> |  0x2000
> > >       Symbols          U+2000 -+ \  | /   |  0x3000
> > >                                   \ |/    |  0x4000
> > >       CJK Misc         U+3000 -+   \/     |  0x5000
> > >       CJK Ideographs   U+4000  |   /\     +- 0x6000
> > >                        U+5000  |  / |\
> > >                        U+6000  +--  | \   +- 0x7000 General Scripts
> > >                        U+7000  |    |  -> |  0x8000
> > >                        U+8000  |    |     |
> > >                        U+9000 -+    |     +- 0x9000 Symbols
> > >                                     |
> > >       Hangul           U+A000 -+    |     +- 0xA000 Hangul
> > >                        U+B000  |    |     |  0xB000
> > >                        U+C000  +----|---> |  0xC000
> > >                        U+D000  |    |     |  0xD000
> > >         :                 :   -+    |     +-    :      :
> > >                                     |
> > >
> > >
> > >Chung & Leung                                                  [Page 4]
> > >ACE37       ACE Utilizing All 37 Alphanumeric Characters      July 2001
> > >
> > >    This shifting effectively moves the entire Han library to within
> > >    0x6FFF and therefore could be represented in 15-bits or exactly 3
> > >    base-32 characters.  (details on base-32 characters in Section 3)
> > >
> > >    For example, the Chinese character for <change> with the original
> > >    Unicode code point at U+8F49, will be shifted to 0x5F49 and can be
> > >    represented in 3 quintets, and in turn with 3 base-32 characters:
> > >
> > >                     Character: <change>
> > >            Unicode Code Point: U+8F49
> > >                 ACE37 Shifted: 0x5F49
> > >        Corresponding Quartets: 0101 1111 0100 1001
> > >            Resulting Quintets: 10111 11010 01001
> > >                       Base-32: nq9   (further discussed in Section 3)
> > >
> > >    This in turn means that any Chinese character could be represented
> > >    with 3 base-32 characters making the total possible characters
> > >    within a label, even without further compression introduced by the
> > >    XOR differential process (Section 6), to be at least 21.  The ACE37
> > >    code block shifting process could be described as follows:
> > >
> > >       for each input code point = n
> > >       if n <= 9FFF
> > >          n = n - 0x3000      /*downwards shifting*/
> > >          if n <= 0
> > >             n = 0x9FFF + n   /*compensation for U+0000..U+2FFF*/
> > >
> > >    The character block shifting introduced here is extremely simple and
> > >    utilizes simple calculation that requires no mapping function.  At
> > >    the same time, it achieves the goal in adjusting the Unicode
> > >    allocation so that it becomes more ACE friendly.
> > >
> > >3. Base-32 Characters
> > >
> > >    Base-32 characters are used in LACE for compression, while DUDE-02
> > >    and the AMC series only utilizes it for quartet flagging to indicate
> > >    the last quartet of each encoded code point.  ACE37 utilizes base-32
> > >    characters for compression while base-4 characters, which will be
> > >    introduced in Section 4, determine the compressed code point
> > >    brackets.
> > >
> > >    The following table shows the 32 base-32 characters and their
> > >    corresponding quintets:
> > >
> > >    Base-32 Character =to= Corresponding Quintet
> > >        0 = 00000       8 = 01000       g = 10000       o = 11000
> > >        1 = 00001       9 = 01001       h = 10001       p = 11001
> > >        2 = 00010       a = 01010       i = 10010       q = 11010
> > >        3 = 00011       b = 01011       j = 10011       r = 11011
> > >        4 = 00100       c = 01100       k = 10100       s = 11100
> > >        5 = 00101       d = 01101       l = 10101       t = 11101
> > >        6 = 00110       e = 01110       m = 10110       u = 11110
> > >        7 = 00111       f = 01111       n = 10111       v = 11111
> > >Chung & Leung                                                  [Page 5]
> > >ACE37       ACE Utilizing All 37 Alphanumeric Characters      July 2001
> > >
> > >
> > >    With this layout of base-32 characters, it is also possible to
> > >    implement a computation based base-32 conversion instead of having
> > >    to resort to mapping and lookup tables:
> > >
> > >       For each quintet = q
> > >           if q <= 0x0F
> > >              then hex dump q to form base-32 character
> > >           if 0x10 <= q <= 0x1F
> > >              then q = q - 0x10
> > >                 and char(q + 0x67) to form base-32 character
> > >
> > >    Note that 0x67 is the code value for the letter "g".  Therefore, for
> > >    example if the quintet is 0b10001 its base-32 character can be
> > >    obtained by:
> > >
> > >       0x10 <= q=0b10001=0x11 <= 0x1F
> > >       therefore q = q - 0x10 = 0x11 - 0x10 = 0x01
> > >             and base-32 character = char(0x01 + 0x67)
> > >                 char(0x68) = "h"
> > >
> > >4. Base-4 Characters
> > >
> > >    ACE37 goes beyond the 32 characters (base-32) to include the
> > >    remaining 4 characters {w,x,y,z} in the alphabet.  These base-4
> > >    characters enable ACE37 to better utilize the existing "resources"
> > >    (the allowed characters) to represent IDN character information,
> > >    therefore making it's encoding more efficient.
> > >
> > >    The set of base-4 characters are {w,x,y,z} and will be used to
> > >    represent the following duplets (duplets are groups containing 2
> > >    bits):
> > >
> > >    Base-4 Character =to= Corresponding Duplet
> > >                   w   =  00
> > >                   x   =  01
> > >                   y   =  10
> > >                   z   =  11
> > >
> > >4.1 Base-4 Indicators
> > >
> > >    Base-4 characters while carrying character information, also doubles
> > >    as an indicator for code point brackets.  In DUDE-02, an extra bit
> > >    was pre-pended to each quartet.  The last quartet of each encoded
> > >    code point will be pre-pended with "0", marking the end of the code
> > >    point.  In ACE37, base-4 characters will determine the length
> > >    (number of ACE37 characters) of the encoded code point.  Actually,
> > >    to be more precise, the encoded bits are in fact the "diff" and not
> > >    the code point itself (diff carries the same meaning as in DUDE-02
> > >    and is further discussed in Sections 6 & 7)
> > >
> > >
> > >
> > >Chung & Leung                                                  [Page 6]
> > >ACE37       ACE Utilizing All 37 Alphanumeric Characters      July 2001
> > >
> > >    The following table explains how base-4 characters are combined with
> > >    base-32 characters to form a representation of a diff (key: b4=base-
> > >    4, b32=base-32):
> > >
> > >              diff value         |bits|       ACE37 Form
> > >        -------------------------|----|----------------------------
> > >                  diff<=0x7F     |  7 | <b4><b32>
> > >            0x80<=diff<=0x7FFF   | 15 | <b32><b32><b32>
> > >          0x8000<=diff<=0x1FFFF  | 17 | w<b4><b32><b32><b32>
> > >         0x20000<=diff<=0xFFFFF  | 20 | ww<b32><b32><b32><b32>
> > >        0x100000<=diff<=0x10FFFF | 22 | <b4>w<b32><b32><b32><b32>
> > >
> > >    Note that the "bits" column represents the maximum number of
> > >    significant bits for the given diff value.  For example when
> > >    diff<=0x7F, the maximum value is 0b1111111, therefore the number of
> > >    significant bits is 7.
> > >
> > >    Note also that to encode a 17-bit diff, the letter "w" is used as an
> > >    indicator to distinguish the sequence from the 7 bit diff where a
> > >    base-32 character is expected to follow a base-4 character.  Since
> > >    "w" represents "00" that has no value, it will not be used in the
> > >    base-4 representation for a 17-bit diff (if a "00" is used, it means
> > >    that there are only 15 significant bits and therefore should use the
> > >    15 bit diff form).  This is the case for the 20-bit form as well.
> > >    The "w" is used as an arbitrary indicator in the 22-bit form and
> > >    MUST be discarded during decoding.
> > >
> > >    By analyzing the ACE37 form, an encoded string could be successfully
> > >    returned to its original form.  There is no overlap and the form can
> > >    be determined precisely.  The following 5 rules dictate the 5
> > >    different ACE37 forms:
> > >
> > >    (1) Encode: if diff<=0x7F
> > >        Decode: if first character is <b4> AND next character NOT <b4>
> > >                Then it MUST be in 7-bit form: <b4><b32>
> > >
> > >    (2) Encode: if 0x80<=diff<=0x7FFF
> > >        Decode: if first character is <b32>
> > >                Then it MUST be a 15-bit form: <b32><b32><b32>
> > >
> > >    (3) Encode: if 0x8000<=diff<=0x1FFFF
> > >        Decode: if first character is "w" AND next character is <b4>
> > >                   AND NOT "w"
> > >                Then it MUST be in 17-bit form: w<b4><b32><b32><b32>
> > >
> > >    (4) Encode: if 0x20000<=diff<=0xFFFFF
> > >        Decode: if first character is "w" AND next character is "w"
> > >                Then it MUST be in 20-bit form: ww<b32><b32><b32><b32>
> > >
> > >    (5) Encode: if 0x80<=diff<=0x7FFF
> > >        Decode: if first character is <b4> AND NOT "w"
> > >                   AND next character is "w"
> > >                Then it MUST be 22-bit form: <b4>w<b32><b32><b32><b32>
> > >Chung & Leung                                                  [Page 7]
> > >ACE37       ACE Utilizing All 37 Alphanumeric Characters      July 2001
> > >
> > >
> > >    Note that the ACE37 scheme can effectively encode a diff of up to 22
> > >    significant bits or 0x3FFFFF.  The Unicode code points are expected
> > >    to range only between 0x0000..0x10FFFF, therefore ACE37 will be able
> > >    to handle any Unicode code point.
> > >
> > >    Additionally, base-4 characters (and sometimes base-32 characters)
> > >    could be used for mixed-case annotation.  This optional mixed-case
> > >    annotation mechanism is discussed in Appendix B.
> > >
> > >4.2 First Code Point Considerations
> > >
> > >    There are additional considerations for the first code point that is
> > >    encoded or decoded to ensure that if the first code point is within
> > >    the first Unicode plane (U+0000..U+FFFF), it will not occupy more
> > >    than 4 ACE37 characters.
> > >
> > >    This special consideration affects only Rules (1), (3) and (4)
> > >    explained in Section 4.1.  Rule (1) is discarded for the first code
> > >    point, therefore any diff under 0x7FFF will be in the form
> > >    <b32><b32><b32>.  The form for Rule (3) becomes simply
> > >    <b4><b32><b32><b32> without the "w" indicator.  Similarly, the form
> > >    for Rule (4) becomes w<b32><b32><b32><b32> with one less "w".
> > >
> > >    The first code point considerations can be summarized in the
> > >    following 4 rules:
> > >
> > >    (a) Encode: if diff<=0x7FFF
> > >        Decode: if first character is <b32>
> > >                Then it MUST be in 15-bit form: <b32><b32><b32>
> > >
> > >    (b) Encode: if 0x8000<=diff<=0x1FFFF
> > >        Decode: if first character is <b4> AND NOT "w"
> > >                Then it MUST be in 17-bit form: <b4><b32><b32><b32>
> > >
> > >    (c) Encode: if 0x20000<=diff<=0xFFFFF
> > >        Decode: if first character is "w"
> > >                Then it MUST be in 20-bit form: w<b32><b32><b32><b32>
> > >
> > >    (d) Encode & Decode: same as Rule (5) in Section 4.1
> > >
> > >    Besides special considerations for base-4 character usage, prev
> > >    setting is also specially considered for the first code point.  As
> > >    laid out in Section 6, in order to detect for the first code point,
> > >    the prev is evaluated.  If prev = 0x00, it is assumed that it is the
> > >    first code point as 0x00 SHOULD not be a permitted character for
> > >    input.  When an LDH is the first code point, there is a need to make
> > >    a special consideration.  Regularly, if n = LDH is encountered
> > >    (Section 5), it will be output as "-n" and prev is not changed.
> > >    However, if the first code point is an LDH, after outputting "-n",
> > >    prev is updated to = lowercase(n).  This is to ensure and maintain
> > >    that only the first code point coming in will have a prev = 0x00.
> > >
> > >Chung & Leung                                                  [Page 8]
> > >ACE37       ACE Utilizing All 37 Alphanumeric Characters      July 2001
> > >
> > >5. LDH Considerations
> > >
> > >    Finally, the 37th character of the entire LDH repertoire, the hyphen
> > >    will be used to indicate LDH exceptions.  Extending the hyphen
> > >    consideration of DUDE-02, ACE37 gives special consideration for the
> > >    entire LDH repertoire.  All LDH characters will be encoded "as is"
> > >    with the addition of a leading hyphen.  For example, the character
> > >    "a" will be encoded within ACE37 as "-a".  The hyphen character "-"
> > >    will be encoded as "--".
> > >
> > >    This ensures that each LDH character will only take up 2 character
> > >    spaces within an ACE37 encoded string and also will allow
> > >    administrators to see the actual characters, similar to the AMC
> > >    series.  Unlike the AMC series however, the hyphen is not used to
> > >    indicate an ongoing mode change, but only the following character.
> > >    Therefore retaining the simplicity of the DUDE-02 single-mode,
> > >    single-pass philosophy.
> > >
> > >6. Encoding Procedure
> > >
> > >    Similar to DUDE, all ordering of bits and quartets is big-endian.
> > >    The following describes the encoding procedure:
> > >
> > >    Set initial value for prev = 0x00
> > >    for each input code point = n
> > >       if n is an LDH {A-z, 0-9, -}
> > >          output "-n"                   (Section 5: LDH Considerations)
> > >          if prev = 0x00                (Section 4.2: First Code Point)
> > >             let prev = lowercase(n)
> > >       else perform code block shifting (Section 2: Code Block Shifting)
> > >       let diff = prev XOR n            (n after code block shifting)
> > >       if diff<=0x7F --------------------------------------+
> > >          and if this is the first code point (Section 4.2)|
> > >          then output 15-bit form: <b32><b32><b32>         |
> > >          else, output 7-bit form: <b4><b32>               |
> > >       if 0x80<=diff<=0x7FFF                               +-(Section 4:
> > >          output 15-bit form: <b32><b32><b32>              |   Base-4
> > >       if 0x8000<=diff<=0x1FFFF                            | Characters)
> > >          and if this is the first code point (Section 4.2)|
> > >          output 17-bit form: w<b4><b32><b32><b32>         |
> > >       if 0x20000<=diff<=0xFFFFF                           |
> > >          output 20-bit form: ww<b32><b32><b32><b32>       |
> > >       if 0x100000<=diff<=0x10FFFF                         |
> > >          output 22-bit form: <b4>w<b32><b32><b32><b32> ---+
> > >       let prev = n
> > >    end and obtain next n and return to: "for each input code point = n"
> > >
> > >    The following is a more comprehensive pseudo code:
> > >
> > >    let prev = 0x00
> > >    for each input integer n (in order) do begin
> > >       if n = "-" or "0..9" or "A..Z" or "a..z"
> > >       then output "hyphen"+"char(n)"
> > >Chung & Leung                                                  [Page 9]
> > >ACE37       ACE Utilizing All 37 Alphanumeric Characters      July 2001
> > >
> > >          if prev = 0x00
> > >             let prev = lowercase(n)
> > >
> > >       else begin
> > >          if n = 0x00
> > >             then error and abort
> > >          if n <= 9FFF
> > >          n = n - 0x30
> > >             if n < 0
> > >             then n = 9FFF + n
> > >
> > >          let diff = prev XOR n
> > >
> > >          if diff <= 0x7F
> > >             if prev = 0x00
> > >             then output with 3 base-32 characters
> > >          else, output first 2 bits with a base-4 character {wxyz}
> > >             and remaining 5 bits with 1 base-32 character
> > >
> > >          if 0x80 <= diff <= 0x7FFF
> > >          then output all 15 bits with base-32 characters
> > >
> > >          if 0x8000 <= diff <= 0xFFFF
> > >             if prev = 0x00
> > >             then output first 2 bits with a base-4 {xyz} (except w)
> > >             and output remaining 15 bits with base-32
> > >          else, output "w"
> > >             and output first 2 bits with a base-4 {xyz} (except w]
> > >             and output remaining 15 bits with base-32
> > >
> > >          if 0x10000 <= diff <= 0x1FFFF
> > >          then output "w"
> > >             and output first 2 bits with a base-4 {xyz} (except w)
> > >             and output remaining 15 bits with base-32
> > >
> > >          if 0x20000 <= diff <= 0xFFFFFF
> > >          then output "w"
> > >             and output all 20 bits with base-32 characters
> > >
> > >          if 0x100000 <= diff <= 0x10FFFF
> > >          then output first 2 bits with a base-4 {xyz} (except w)
> > >             and output "w"
> > >             and output remaining 15 bits with base-32
> > >
> > >          let prev = n
> > >       end
> > >    end
> > >
> > >    Nameprep [NAMEPREP] is not discussed in this document, but is
> > >    expected that it be implemented for IDN.  Hence, regardless of the
> > >    code point presented, an encoder MUST not produce an incorrect
> > >    output.  The encoder must fail if it encounters a negative input
> > >    value.
> > >Chung & Leung                                                 [Page 10]
> > >ACE37       ACE Utilizing All 37 Alphanumeric Characters      July 2001
> > >
> > >
> > >    The initial value used is 0x00 so that all domains beginning with a
> > >    CJK ideograph or within row 0 (U+0000..U+0FFF) will be shorter.
> > >    Note that after the code block shifting (Section 2), the entire Han
> > >    library is within 0x0000..0x6FFF, while row 0 is fitted to
> > >    0x7000..0x7FFF.  Therefore by using an initial value of 0x00 the
> > >    diff for all Han and row 0 characters will be less than 0x7FFF.  The
> > >    initial value is also used as a check point for the first code point
> > >    considerations (Section 4.2).
> > >
> > >    Additionally, an optional mixed-case annotation mechanism is
> > >    discussed in Appendix B.
> > >
> > >7. Decoding Procedure
> > >
> > >    A thorough description of the decoding rules, except for the final
> > >    reversal of the code block shifting has been presented in Sections
> > >    4.1 and 4.2.  The following description is a brief representation of
> > >    the decoding procedure:
> > >
> > >    let prev = 0x00
> > >    while the input string is not exhausted
> > >       if present character = hyphen               (Section 5: LDH
> > >          discard and output next character         Considerations)
> > >       else, depending on the presented form       (Section 4)
> > >          convert into duplets and quintets        (Section 4 & 3)
> > >          and concatenate to form diff
> > >       let prev = prev XOR diff
> > >       reverse code block shifting:                (Section 2)
> > >          if prev<=0x9FFF
> > >             and if prev<=0x6FFF
> > >                    output character = prev + 0x3000
> > >             else, output character = prev - 0x7000
> > >          else output character = prev
> > >       output character
> > >    End
> > >
> > >    The following is a more comprehensive pseudo code for the decoding
> > >    precedure:
> > >
> > >    let prev = 0x00
> > >    while the input string is not exhausted do begin
> > >       if present character = hyphen    /*Section 5:LDH Considerations*/
> > >       then consume and discard hyphen
> > >          and obtain the next character
> > >          and output character
> > >          if prev = 0x00                /*Section 4.2:First Code Point*/
> > >             let prev = code block shifted lowercase output character
> > >
> > >       else,
> > >          if present character = Base-32 characters (0..v)
> > >             consume present character and next 2 characters
> > >             and convert them to quintets according to Base-32
> > >Chung & Leung                                                 [Page 11]
> > >ACE37       ACE Utilizing All 37 Alphanumeric Characters      July 2001
> > >
> > >             concatenate the resulting quintets to form diff
> > >             /*15 bit form, 0x80<=diff<=0x7FFF*/
> > >
> > >          if present character = Base-4 characters {xyz} and NOT w
> > >             consume present character
> > >                and convert it to a duplet according to Base-4
> > >
> > >             if prev = 0x00
> > >                obtain and consume next 3 characters
> > >                and convert them to quintets according to Base-32
> > >                concatenate duplet with the 3 quintets to form diff
> > >                /*first code point: 17 bit form, 0x8000<=diff<=0x1FFFF*/
> > >
> > >             else, if next character = Base-32 character (0..v)
> > >                then consume and convert to quintet according to Base-32
> > >                concatenate duplet with the quintet to form diff
> > >                /*7 bit form, diff<=0x7F*/
> > >
> > >             else, obtain next character
> > >             if next character = Base-4 characters {xyz} and NOT w
> > >                then fail and indicate error
> > >
> > >             else, if next character = w
> > >                then consume and discard w and obtain next 4 characters
> > >                consume and convert characters to
> > >                   quintets according to Base-32
> > >                concatenate duplet with the 4 quintets to form diff
> > >                /*22 bit form, 0x100000<=diff<=0x10FFFF*/
> > >
> > >          if present character = w
> > >             discard "w" and obtain next character
> > >
> > >             if next character = Base-4 characters {xyz} and NOT w
> > >
> > >                and if prev = 0x00
> > >                    obtain and consume next 4 characters
> > >                    and convert characters to quintets based on Base-32
> > >                    concatenate the 4 quintets to form diff
> > >                    /*first code point: 20 bit form,*/
> > >                    /*0x20000<=diff<=0xFFFFFF       */
> > >
> > >                else, consume and convert to duplet according to Base-4
> > >                   and obtain and consume next 3 characters
> > >                   and convert to quintets according to Base-32
> > >                   concatenate duplet with the 3 quintets to form diff
> > >                   /*17 bit form, 0x8000<=diff<=0x1FFFF*/
> > >
> > >             else, if next character = w
> > >                then consume and discard w
> > >                and obtain and consume next 4 characters
> > >                   and convert to quintets according to Base-32
> > >                concatenate duplet the 4 quintets to form diff
> > >                /*20 bit form, 0x20000<=diff<=0xFFFFFF*/
> > >Chung & Leung                                                 [Page 12]
> > >ACE37       ACE Utilizing All 37 Alphanumeric Characters      July 2001
> > >
> > >
> > >             else, if next character = Base-32 character (0..v)
> > >                then convert to quintet according to Base-32
> > >                set quintet to diff
> > >                /*7 bit form, diff<=0x7F*/
> > >
> > >          fail upon encountering a non-ACE37 character
> > >             or end-of-input
> > >
> > >          let prev = prev XOR diff
> > >
> > >          if prev <= 0x9FFF                /*reversal of the code    */
> > >             and if prev <= 6FFF           /*block shifting described*/
> > >             output = prev + 0x3000        /*in Section 2            */
> > >             else, output = prev - 0x7000
> > >          else, output prev
> > >       end
> > >    end
> > >    encode the output sequence and compare it to the input string
> > >    fail if they do not match (case insensitively)
> > >
> > >8. Examples
> > >
> > >    ACE37 is likely to be implemented with an ACE prefix in the form
> > >    "xx--".  The actual prefix to be used is not discussed in this
> > >    document.  The following examples are taken from the mailing list as
> > >    well as from DUDE-02 and the AMC series.  The resulting ACE37 string
> > >    is compared with that using DUDE:
> > >
> > >    (A) JPNIC (the registry of .jp domain)
> > >
> > >    Unicode: U+793E U+56E3 U+6CD5 U+4EBA U+65E5 U+672C U+30CD U+30C3
> > >             U+30C8 U+30EF U+30FC U+30AF U+30A4 U+30F3 U+30D5 U+30A9
> > >             U+30E1 U+30FC U+30B7 U+30E7 U+30F3 U+30BB U+30F3 U+30BF
> > >             U+30FC
> > >      ACE37: i9urut6hm8jfaqv0m9dv1wewbx7wjyjwbynx6zsy8wtybygwky8y8ycy3
> > >             (57 char)
> > >    DUDE-02: (error: result string exceeds 59 characters*)
> > >             Note: 59 characters is the maximum allowable when the ACE
> > >             prefix "xx--" is included
> > >
> > >
> > >    (B) A health-insurance organization in Tokyo
> > >
> > >    Unicode: U+6771 U+4EAC U+90FD U+60C5 U+5831 U+30B5 U+30FC U+30D3
> > >             U+30B9 U+7523 U+696D U+5065 U+5EB7 U+4FDD U+967A U+7D44
> > >             U+5408
> > >      ACE37: drhaetvihk1o67ka44y9xfzahcqv2e6883micbaud7apuqac (48 char)
> > >    DUDE-02: (error: result string exceeds 59 characters)
> > >
> > >
> > >
> > >
> > >Chung & Leung                                                 [Page 13]
> > >ACE37       ACE Utilizing All 37 Alphanumeric Characters      July 2001
> > >
> > >    (C) 6 hangul syllables
> > >
> > >    Unicode: U+C138 U+ACC4 U+C758 U+BAA8 U+B4E0 U+C0AC
> > >      ACE37: xg9orfsqssvfg3i8t2c (19 char)
> > >    DUDE-02: 6txiy79ny53nz79a8wizwwn (23 char)
> > >
> > >
> > >    (D) maji<de>koi<suru>5<byou><mae>  (Latin, hiragana, kanji)
> > >
> > >    Unicode: U+006D U+0061 U+006A U+0069 U+3067 U+006B U+006F U+0069
> > >             U+3059 U+308B U+0035 U+79D2 U+524D
> > >      ACE37: -m-a-j-is0a-k-o-xu06i-5iapqsv (30 char)
> > >    DUDE-02: pnmdvssqvssnegvsva7cvs5qz38hu53r (32 char)
> > >
> > >
> > >    (E) <pafii>de<runba>  (Latin, katakana)
> > >
> > >    Unicode: U+30D1 U+30D5 U+30A3 U+30FC U+0064 U+0065 U+30EB U+30F3
> > >             U+30D0
> > >      ACE37: 06hw4zmyv-d-ewnwox3 (19 char)
> > >    DUDE-02: vs5bezgxrvs3ibvs2qtiud (22 char)
> > >
> > >
> > >    (F) <sono><supiido><de>  (hiragana, katakana)
> > >
> > >    Unicode: U+305D U+306E U+30B9 U+30D4 U+30FC U+30C9 U+3067
> > >      ACE37: 02txj06nzdx8xl05e (17 char)
> > >    DUDE-02: vsvpvd7hypuivf4q (16 char)
> > >
> > >
> > >    (G) 2 Arbitrary Plane Two Code Points
> > >
> > >    Unicode: U+261AF U+261BF
> > >      ACE37: w4odfwg (7 char)
> > >    DUDE-02: uyt6rta (7 char)
> > >
> > >
> > >    (H) Czech: Pro<ccaron>prost<ecaron>nemluv<iacute><ccaron>esky
> > >
> > >    Unicode: U+0050 U+0072 U+006F U+010D U+0070 U+0072 U+006F U+0073
> > >             U+0074 U+011B U+006E U+0065 U+006D U+006C U+0075 U+0076
> > >             U+00ED U+010D U+0065 U+0073 U+006B U+0079
> > >      ACE37: -p-r-o0bt-p r-o-s-twm-n-e-m-l-u-v0fm0f0-e-s-k-y (47 char)
> > >    DUDE-02: vauctptyctzpctptnhtyrtzfmibtjd3mt8atyitgtitc (44 char)
> > >
> > >
> > >    (I) Chinese
> > >
> > >    Unicode: U+4ED5 U+5011 U+7232 U+4EC0 U+9EBD U+4E0D U+8AAA U+4E2D
> > >             U+6587
> > >      ACE37: 7mmfm7oh3n7is3ts5gh57h47ata (27 char)
> > >    DUDE-02: w85gt86huuudv69c7szp7s5a6w4h6w2hu54k (36 char)
> > >
> > >Chung & Leung                                                 [Page 14]
> > >ACE37       ACE Utilizing All 37 Alphanumeric Characters      July 2001
> > >
> > >9. Summary & Comparisons
> > >
> > >    In summary, ACE37 is based on the DUDE-02 process with an improved
> > >    compression scheme for code point sequences that are less likely to
> > >    cluster too closely together, such as CJK ideographs.
> > >
> > >    Since it is the design team's indication that generally 30
> > >    characters should be good enough and that there are a lot of concern
> > >    from the Asian community that 14-15 characters is definitely
> > >    limiting and that few indication from the Latin community that
> > >    length is really a concern, ACE37 have set its objective to increase
> > >    the possible number of characters in a worse case scenario closer to
> > >    20 characters.
> > >
> > >    ACE37 have succeeded in creating a very simple variation based on
> > >    the primary ACEs identified by the design team to create an ACE that
> > >    achieves dramatically better performance for CJK characters while
> > >    maintaining the simplicity of DUDE.
> > >
> > >    Key Improvements of ACE37 over DUDE-02
> > >    - much more spacious for Han characters.  Improved worst-case
> > >      scenario to 21 Han ideographs by introducing code block shifting
> > >      and utilizing fully base-32 characters
> > >    - no need to arbitrarily pre-pend flagging bits to identify code
> > >      point brackets.  Instead base-4 characters and diff forms are used
> > >    - base-32 and base-4 characters can be easily computed instead of
> > >      mapped using lookup tables
> > >
> > >    Key Improvements of ACE37 over the AMC series
> > >    - a more simple process, utilizing the one-pass differential
> > >      mechanism from DUDE-02
> > >    - a much more simple code block shifting process is used in ACE37 to
> > >      achieve a similar goal for the complex multiple reference point
> > >      system used by the AMC series
> > >    - base-32 and base-4 characters can be easily computed instead of
> > >      mapped using lookup tables
> > >
> > >    Key Improvements of ACE37 over LACE
> > >    - a more simple process, utilizing the one-pass differential
> > >      mechanism from DUDE-02
> > >    - much more spacious for Han characters.  Improved worst-case
> > >      scenario to 21 Han ideographs by introducing code block shifting
> > >      and utilizing fully base-32 characters
> > >    - base-32 and base-4 characters can be easily computed instead of
> > >      mapped using lookup tables
> > >
> > >    Two Excel spreadsheet for ACE37 encoding and decoding can be found
> > >    at http://www.dnsii.org/ace37/ace37-encode.xls and
> > >    http://www.dnsii.org/ace37/ace37-decode.xls respectively.  This
> > >    illustrates the simplicity of ACE37 and provides a handy tool for
> > >    checking ACE37 encoding and decoding algorithms.  The ACE37-encode
> > >    spreadsheet also includes a DUDE-encode worksheet.
> > >
> > >Chung & Leung                                                 [Page 15]
> > >ACE37       ACE Utilizing All 37 Alphanumeric Characters      July 2001
> > >
> > >10. Security Considerations
> > >
> > >    This document does not talk about DNS security issues, and it is
> > >    believed that the proposal does not introduce additional security
> > >    problems not already existent and/or anticipated by adding
> > >    multilingual characters to DNS and/or using ACE.
> > >
> > >11. References
> > >
> > >    [AMC-W]   Adam M. Costello, "AMC-ACE-W version 0.1.0", May 31, 2001.
> > >
> > >    [AMC-V]   Adam M. Costello, "AMC-ACE-V version 0.1.0", May 31, 2001.
> > >
> > >    [DUDE-02] Mark Welter, Brian W. Spolarich & Adam M.
> > >              Costello, "Differential Unicode Domain Encoding (DUDE)",
> > >              June 7, 2001.
> > >
> > >    [LACE]    Mark Davis, IBM & Paul Hoffman, IMC & VPNC, "LACE: Length-
> > >              based ASCII Compatible Encoding for IDN", January 5, 2001.
> > >
> > >    [Nameprep]Paul Hoffman, IMC & VPNC & Marc Blanchet, ViaGenie,
> > >              "Preparation of Internationalized Host Names", February
> > >              24, 2001
> > >
> > >Appendix A. Acknowledgements
> > >
> > >    The ACE37 draft is a combination of DUDE-02, the AMC series and
> > >    LACE, and takes into consideration the report of the ACE design
> > >    team.  The authors would therefore like to thank the authors of
> > >    DUDE-02 - Mark Welter, Brian W. Spolarich & Adam M. Costello; the
> > >    authors of the AMC series - Adam M.Costello; the authors of LACE -
> > >    Mark Davis & Paul Hoffman; and, the ACE design team and its advisors
> > >    - Adam M. Costello, Paul Hoffman, Makoto Ishisone, David Laurence,
> > >    Brian Spolarich, Rick Wesson, Marc Blanchet, Patrik Faltstrom and
> > >    Erik Nordmark for their inspirations.
> > >
> > >Appendix B. Mixed-case annotation
> > >
> > >    This section is taken from DUDE and modified for ACE37
> > >
> > >    In order to use ACE37 to represent case-insensitive Unicode strings,
> > >    higher layers need to case-fold the Unicode strings prior to ACE37
> > >    encoding.  The encoded string can, however, use mixed-case base-4
> > >    characters as an annotation telling how to convert the folded
> > >    Unicode string into a mixed-case Unicode string for display
> > >    purposes.
> > >
> > >    Each Unicode code point (unless it is an LDH) is represented by a
> > >    sequence of base-4 and base-32 characters, the first of which is
> > >    mostly a base-4 character, which is always a letter {wxyz} (as
> > >    opposed to a digit).  If that letter is uppercase, it is a
> > >    suggestion that the Unicode character be mapped to uppercase (if
> > >
> > >Chung & Leung                                                 [Page 16]
> > >ACE37       ACE Utilizing All 37 Alphanumeric Characters      July 2001
> > >
> > >    possible); if the letter is lowercase, it is a suggestion that the
> > >    Unicode character be mapped to lowercase (if possible).
> > >
> > >    If the code point is an LDH, for example "a", it will be represented
> > >    as "-a".  To mark the case for an LDH, simply set the LDH to the
> > >    desired case following the "-".  Fir example if an uppercase "A" is
> > >    desired, the encoded form SHOULD be "-A".
> > >
> > >    Note that there is a possibility that no base-4 character is present
> > >    for a code point representation.  That is the case for a 15-bit diff
> > >    form.  In this case, the base-32 characters will be used for case
> > >    suggestion (if possible), similar to that discussed for using a
> > >    base-4 character.  However, also note that there is a very remote
> > >    possibility that all 3 base-32 characters are digits.  If this
> > >    happens, case unfolding will be aborted.  Since case annotation is
> > >    an optional feature and used for display purposes only, this is not
> > >    considered to be a major concern.  Moreover, the possibility of this
> > >    happening is truly remote at only (32639/27)/1114109 or just 0.1%
> > >    chance of happening.
> > >
> > >    ACE37 encoders and decoders are not required to support these
> > >    annotations, and higher layers need not use them.
> > >
> > >    For example:  In order to suggest that example (H) in Section 8:
> > >    "Examples" be displayed as:
> > >    Czech: Pro<ccaron(uppercase)>prost<ecaron(uppercase)>
> > >           nemLUV<iacute(lowercase)><ccaron(lowercase)>esky
> > >
> > >    one could capitalize the ACE37 encoding as:
> > >      ACE37: -P-r-o0BT-p-r-o-s-tWM-n-e-m-L-U-V0fm0f0-e-s-k-y (47 char)
> > >
> > >Authors:
> > >
> > >Edmon Chung
> > >Neteka Inc.
> > >2462 Yonge St. Toronto,
> > >Ontario, Canada M4P 2H5
> > >edmon@neteka.com
> > >
> > >David Leung
> > >Neteka Inc.
> > >2462 Yonge St. Toronto,
> > >Ontario, Canada M4P 2H5
> > >david@neteka.com
> > >
> > >
> > >
> > >
> > >
> > >
> > >
> > >
> > >
> > >Chung & Leung                                                 [Page 17]
> >