[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
[idn] Re: permission <draft-ietf-idn-ace37-00.txt (attach)
At/À 11:48 2001-07-05 -0400, Edmon you wrote/vous écriviez:
>Hi all,
>
>I was unaware that the workgroup no longer accepts new drafts.
see:
Message-Id: <5.1.0.14.1.20010626000156.03d85e50@mail.viagenie.qc.ca>
Date: Tue, 26 Jun 2001 00:05:28 -0400
To: idn@ops.ietf.org
From: Marc Blanchet <Marc.Blanchet@viagenie.qc.ca>
Subject: [idn] wg next steps
and:
Message-Id: <5.1.0.14.1.20010629080012.02042a10@mail.viagenie.qc.ca>
Date: Fri, 29 Jun 2001 08:06:14 -0400
To: idn@ops.ietf.org
From: Marc Blanchet <Marc.Blanchet@viagenie.qc.ca>
Subject: [idn] document pools active
And, as I wrote in the email, you are _encouraged_ to submit as individual
submission. The only difference is filename and no listing in the ietf idn
wg charter web page.
Marc.
> Anyway, I
>have drafted a new ACE based on the simplicity of DUDE which has hugely
>improved compression. Worst case scenario CJK could have 21 han characters!
>Attached below is a copy of the draft (for my original submission), you can
>also find it at http://www.dnsii.org/idn-ace37-00.txt (easier to read) and
>hopefully in the i-d-n.net website soon.
>
>ACE37 is based on the one-pass one-mode scheme of DUDE (diiferential XOR),
>then utilizes a simple code block shifting (similar to the reference points
>in the AMC series) to hugely increase the capacity for CJK (worst case
>scenario 21 han characters!) and then utilizes base-32 for compression (as
>in LACE) (DUDE and AMC-w/v uses base-32 only for flagging). In addition to
>base-32, a base-4 scheme is introduced by using the remaining characters
>{wxyz}. These contain 2 bits of character information and doubles as an
>indicator for codepoint brackets. All the while, the algorithm is kept to
>be as simple as DUDE.
>
>Hopefully you might find that it is interesting and appropriate to be
>considered as an ACE within the IETF. Afterall, it was intended to be an
>integrated version of the three primary ACEs: DUDE, LACE and the AMC series,
>identified by the ACE design team report.
>
>Looking forward to all your inputs.
>
>Edmon
>
>PS. I have created an Excel worksheet to illustrate the Encoding and
>Decoding procedures as well you can find them at
>http://www.dnsii.org/ace37/ace37-encode.xls and
>http://www.dnsii.org/ace37/ace37-decode.xls respectively.
>
>
>
>----- Original Message -----
>From: "Marc Blanchet" <Marc.Blanchet@viagenie.qc.ca>
>To: "Natalia Syracuse" <nsyracus@ietf.org>; <edmon@neteka.com>;
><david@neteka.com>
>Cc: <jseng@pobox.org.sg>
>Sent: Thursday, July 05, 2001 8:50 AM
>Subject: Re: permission <draft-ietf-idn-ace37-00.txt (attach)
>
>
> > I'm sorry but the new wg policy is to not accept draft unless there is a
> > demonstrated support. But drafts are _highly_ encouraged to be published
>as
> > individual submissions. I would recommend to put idn in the filename and
> > use this filenaming convention: draft-<yourname>-idn-ace37-00.txt. After
> > publication in the internet-draft, the author should announce it in the wg
> > mailing list and I'll put a reference to it in the wg web page.
> >
> > So please publish it as individual submission.
> >
> > Marc.
> >
> > At/À 08:34 2001-07-05 -0400, Natalia Syracuse you wrote/vous écriviez:
> > >
> > >
> > >
> > >Internet Draft Edmon Chung, Neteka Inc.
> > ><draft-ietf-idn-ace37-00.txt> David Leung, Neteka Inc.
> > > June 2001
> > >
> > >
> > >
> > > ACE Utilizing All 37 Alphanumeric Characters (ACE37)
> > >
> > >
> > >STATUS OF THIS MEMO
> > >
> > > This document is an Internet-Draft and is in full conformance with
> > > all provisions of Section 10 of RFC2026.
> > >
> > > Internet-Drafts are working documents of the Internet Engineering
> > > Task Force (IETF), its areas, and its working groups. Note that
> > > other groups may also distribute working documents as Internet-
> > > Drafts. Internet-Drafts are draft documents valid for a maximum of
> > > six months and may be updated, replaced, or obsoleted by other
> > > documents at any time. It is inappropriate to use Internet-Drafts
> > > as reference material or to cite them other than as "work in
> > > progress."
> > >
> > > The reader is cautioned not to depend on the values that appear in
> > > examples to be current or complete, since their purpose is primarily
> > > educational. Distribution of this memo is unlimited.
> > >
> > > The list of current Internet-Drafts can be accessed at
> > > http://www.ietf.org/ietf/1id-abstracts.txt
> > > The list of Internet-Draft Shadow Directories can be accessed at
> > > http://www.ietf.org/shadow.html.
> > >
> > >Abstract
> > >
> > > ACE37 is a combination of DUDE-02, AMC-W/V and LACE. ACE37 utilizes
> > > the simple one pass algorithm of DUDE, the character block
> > > considerations of AMC-W/V and the Base-32 compression of LACE. It
> > > also fully utilizes entire LDH set currently allowed in the DNS (A-
> > > z, 0-9 and "-") within its character repertoire to optimize
> > > performance and compression. Even for the worst-case scenario in
> > > ACE37, any name can have 21 characters including Chinese, Japanese
> > > and Korean names. Two Excel spreadsheets for ACE37 encoding and
> > > decoding can be found at http://www.dnsii.org/ace37/ace37-encode.xls
> > > and http://www.dnsii.org/ace37/ace37-decode.xls respectively.
> > >
> > > While DUDE-02 provides a very efficient differential mechanism, its
> > > compression is inefficient as it fails to take advantage of the
> > > base-32 scheme in using all 5-bits for character information. The
> > > AMC series is highly efficient in compression but requires
> > > complicated mode changes and therefore inefficient in process. LACE
> > > is rather moderate and requires a two-pass mechanism but utilizes
> > > base-32 for good compression.
> > >
> > >
> > >Chung & Leung [Page 1]
> > >ACE37 ACE Utilizing All 37 Alphanumeric Characters July 2001
> > >
> > > ACE37 uses simple character block shifting to achieve the
> > > compression efficiency of the AMC series, retains the one-pass and
> > > one mode XOR differential mechanism used by DUDE while embracing the
> > > base-32 compression used by LACE for efficient character bit
> > > information.
> > >
> > >Terminology
> > >
> > > The key words "MUST", "SHALL", "REQUIRED", "SHOULD", "RECOMMENDED",
> > > and "MAY" in this document are to be interpreted as described in RFC
> > > 2119 [RFC2119].
> > >
> > > LDH: Letters, Digits and Hyphens: a string of characters that
> > > consists only hyphens ("-"), English letters (A-z) and digits (0-9),
> > > which might not be a result of an algorithm for transcoding
> > > multilingual characters. For example: whatever-you-want.example
> > >
> > > ACE - ASCII Compatible Encoding: a string of characters resulting
> > > from a particular algorithm for transforming multilingual character
> > > information into an alphanumeric form acceptable by the existing
> > > DNS. For example: bq--3bhc2zmh.tld. In essence, ACE is a subset of
> > > LDH.
> > >
> > > Hexadecimal values are shown preceeded by "0x". For example, 0x60
> > > is decimal 96. Binary values are shown preceeded by "0b" for
> > > example "0b1000" is decimal 8. As in the Unicode Standard
> > > [UNICODE], Unicode code points are denoted by "U+" followed by four
> > > to six hexadecimal digits, while a range of code points (or
> > > hexadecimal numbers) is denoted by two hexadecimal numbers separated
> > > by "..", with no prefixes.
> > >
> > > Octets: sequences of 8 bits; Quintets: sequences of 5 bits;
> > > Quartets: sequences of 4 bits; Duplets: sequences of 2 bits.
> > >
> > > XOR: bitwise exclusive or. Given 2 nonnegative integers A and B, A
> > > XOR B is the nonnegative integer value whose binary representation
> > > is 1 wherever A and B disagrees, and 0 wherever they agree.
> > >
> > >Table Of Contents
> > >
> > > 1. Introduction....................................................3
> > > 2. Code Block Shifting.............................................4
> > > 3. Base-32 Characters..............................................5
> > > 4. Base-4 Characters...............................................6
> > >
> > > 5. LDH Considerations..............................................9
> > > 6. Encoding Procedure..............................................9
> > > 7. Decoding Procedure.............................................11
> > > 8. Examples.......................................................13
> > > 9. Summary & Comparisons..........................................15
> > > 10. Security Considerations.......................................16
> > > 11. References....................................................16
> > >
> > >Chung & Leung [Page 2]
> > >ACE37 ACE Utilizing All 37 Alphanumeric Characters July 2001
> > >
> > >1. Introduction
> > >
> > > ACE37 takes into account the recommendations and findings of the ACE
> > > design team to create a "super-ACE" that incorporates the key
> > > advantages of the various considered ACEs without complicated mode
> > > changes. The encoding (Section 6) and decoding (Section 7) process
> > > is largely similar to and as simple as DUDE-02. The encoding
> > > processes for ACE37 in comparison with DUDE-02 could be summarized:
> > >
> > > ACE37 Encoding Procedure | DUDE Encoding Procedure
> > > ---------------------------------+---------------------------------
> > > (1) let initial prev = 0x00 | (1) let initial prev = 0x60
> > > (2) if n = LDH output "-n" | (2) if n = hyphen output "-"
> > > (3) code block shift to obtain | (3) diff = prev XOR n
> > > ACE37 shifted n (Section 2)| (4) prepend "0" to the last
> > > (4) diff = prev XOR n | quartet and "1" to others
> > > (5) output in appropriate base-4 | (5) output a base-32 character
> > > and base-32 form | for each corresponding
> > > (Sections 3&4) | quintet
> > > (6) let prev = n | (6) let prev = n
> > >
> > > Similarly, the decoding process can be described and compared:
> > >
> > > ACE37 Decoding Procedure | DUDE Decoding Procedure
> > > ---------------------------------+---------------------------------
> > > (1) let initial prev = 0x00 | (1) let initial prev = 0x60
> > > (2) if char = hyphen discard "-" | (2) if char = hyphen consume
> > > and output next char | and output 0x002D
> > > (3) consume and convert char into| (3) consume and convert to
> > > duplets and quintets | quintets until encoun-
> > > (according to Sections 3&4)| erring a quintet with "0"
> > > (4) concatenate to form diff | as first bit
> > > (based on Sections 4.1&4.2)| (4) strip all first bits off
> > > (5) let prev = prev XOR diff | (5) concatente to form diff
> > > (6) reverse code block shifting | (6) let prev = prev XOR diff
> > > (7) output Unicode code point | (7) output Unicode code point
> > >
> > > The features of ACE37 include:
> > >
> > > Unique & Reversible - the ACE37 encoding scheme yields a unique and
> > > consistent result string for a given set of Unicode code points.
> > > The encoded string could be decoded back to the original Unicode
> > > code points without loss of character data.
> > >
> > > Simple - ACE37 utilizes a one-pass system and the XOR differential
> > > function to encode and decode. Code block shifting is done by a
> > > simple calculation instead of mapping or creation of arbitrary
> > > reference points. Complex mode changes are not required.
> > >
> > > Spacious - With the code block shifting coupled with a base-32
> > > scheme, ACE37 can accommodate up to 21 unique Han characters
> > > (including CJK) within the 63 octets allowed by the DNS. Other
> > > Latin based scripts can reach up to 31 characters.
> > >Chung & Leung [Page 3]
> > >ACE37 ACE Utilizing All 37 Alphanumeric Characters July 2001
> > >
> > >
> > > Completeness - any sequence of Unicode code points
> > > (U+0000..U+10FFFF) could be encoded. Restrictions of allowed code
> > > points is not discussed, but is expected that Nameprep [Nameprep]
> > > will be used prior to ACE37 encoding.
> > >
> > > In essence, it captures the focus criterions discussed by the
> > > workgroup ACE design team - reversibility, simplicity and
> > > compression capability. Moreover, ACE37 utilizes a very simple code
> > > block shifting (Section 2) mechanism to allow up to any 21 CJK
> > > ideographs to be encoded within the 63-octet constraint.
> > >
> > >2. Code Block Shifting
> > >
> > > While the DNS was not originally designed for multilingual
> > > characters, Unicode was not designed with the DNS in mind and
> > > therefore code points were apparently not allocated in an ACE-
> > > friendly way.
> > >
> > > The AMC series [AMC-W & AMC-V] utilizes a number of reference points
> > > to achieve better compression efficiency by anticipating and
> > > minimizing delta between characters. For ACE37, a much simpler
> > > rendering is used. More specifically, the entire character block
> > > U+3000..U+9FFF for CJK ideographs is shifted down by 0x3000. That
> > > is U+3000 will become 0x0000, U+4000 becomes 0x1000, and so on. To
> > > compensate for the downwards shift, the general script and symbol
> > > characters in U+0000..U+2FFF will be shifted upwards by 0x7000.
> > > Therefore, U+0100 will become 0x7100, U+2000 becomes 0x9000, and so
> > > on. All other code points (U+A000..U+10FFFF) are unchanged.
> > >
> > > Original Unicode Allocation | ACE37 Code Block Shifted
> > > --------------------------------|-------------------------------
> > > General Scripts U+0000 -+ | +- 0x0000 CJK Misc
> > > U+1000 | | | 0x1000 CJK Ideographs
> > > +- | -> | 0x2000
> > > Symbols U+2000 -+ \ | / | 0x3000
> > > \ |/ | 0x4000
> > > CJK Misc U+3000 -+ \/ | 0x5000
> > > CJK Ideographs U+4000 | /\ +- 0x6000
> > > U+5000 | / |\
> > > U+6000 +-- | \ +- 0x7000 General Scripts
> > > U+7000 | | -> | 0x8000
> > > U+8000 | | |
> > > U+9000 -+ | +- 0x9000 Symbols
> > > |
> > > Hangul U+A000 -+ | +- 0xA000 Hangul
> > > U+B000 | | | 0xB000
> > > U+C000 +----|---> | 0xC000
> > > U+D000 | | | 0xD000
> > > : : -+ | +- : :
> > > |
> > >
> > >
> > >Chung & Leung [Page 4]
> > >ACE37 ACE Utilizing All 37 Alphanumeric Characters July 2001
> > >
> > > This shifting effectively moves the entire Han library to within
> > > 0x6FFF and therefore could be represented in 15-bits or exactly 3
> > > base-32 characters. (details on base-32 characters in Section 3)
> > >
> > > For example, the Chinese character for <change> with the original
> > > Unicode code point at U+8F49, will be shifted to 0x5F49 and can be
> > > represented in 3 quintets, and in turn with 3 base-32 characters:
> > >
> > > Character: <change>
> > > Unicode Code Point: U+8F49
> > > ACE37 Shifted: 0x5F49
> > > Corresponding Quartets: 0101 1111 0100 1001
> > > Resulting Quintets: 10111 11010 01001
> > > Base-32: nq9 (further discussed in Section 3)
> > >
> > > This in turn means that any Chinese character could be represented
> > > with 3 base-32 characters making the total possible characters
> > > within a label, even without further compression introduced by the
> > > XOR differential process (Section 6), to be at least 21. The ACE37
> > > code block shifting process could be described as follows:
> > >
> > > for each input code point = n
> > > if n <= 9FFF
> > > n = n - 0x3000 /*downwards shifting*/
> > > if n <= 0
> > > n = 0x9FFF + n /*compensation for U+0000..U+2FFF*/
> > >
> > > The character block shifting introduced here is extremely simple and
> > > utilizes simple calculation that requires no mapping function. At
> > > the same time, it achieves the goal in adjusting the Unicode
> > > allocation so that it becomes more ACE friendly.
> > >
> > >3. Base-32 Characters
> > >
> > > Base-32 characters are used in LACE for compression, while DUDE-02
> > > and the AMC series only utilizes it for quartet flagging to indicate
> > > the last quartet of each encoded code point. ACE37 utilizes base-32
> > > characters for compression while base-4 characters, which will be
> > > introduced in Section 4, determine the compressed code point
> > > brackets.
> > >
> > > The following table shows the 32 base-32 characters and their
> > > corresponding quintets:
> > >
> > > Base-32 Character =to= Corresponding Quintet
> > > 0 = 00000 8 = 01000 g = 10000 o = 11000
> > > 1 = 00001 9 = 01001 h = 10001 p = 11001
> > > 2 = 00010 a = 01010 i = 10010 q = 11010
> > > 3 = 00011 b = 01011 j = 10011 r = 11011
> > > 4 = 00100 c = 01100 k = 10100 s = 11100
> > > 5 = 00101 d = 01101 l = 10101 t = 11101
> > > 6 = 00110 e = 01110 m = 10110 u = 11110
> > > 7 = 00111 f = 01111 n = 10111 v = 11111
> > >Chung & Leung [Page 5]
> > >ACE37 ACE Utilizing All 37 Alphanumeric Characters July 2001
> > >
> > >
> > > With this layout of base-32 characters, it is also possible to
> > > implement a computation based base-32 conversion instead of having
> > > to resort to mapping and lookup tables:
> > >
> > > For each quintet = q
> > > if q <= 0x0F
> > > then hex dump q to form base-32 character
> > > if 0x10 <= q <= 0x1F
> > > then q = q - 0x10
> > > and char(q + 0x67) to form base-32 character
> > >
> > > Note that 0x67 is the code value for the letter "g". Therefore, for
> > > example if the quintet is 0b10001 its base-32 character can be
> > > obtained by:
> > >
> > > 0x10 <= q=0b10001=0x11 <= 0x1F
> > > therefore q = q - 0x10 = 0x11 - 0x10 = 0x01
> > > and base-32 character = char(0x01 + 0x67)
> > > char(0x68) = "h"
> > >
> > >4. Base-4 Characters
> > >
> > > ACE37 goes beyond the 32 characters (base-32) to include the
> > > remaining 4 characters {w,x,y,z} in the alphabet. These base-4
> > > characters enable ACE37 to better utilize the existing "resources"
> > > (the allowed characters) to represent IDN character information,
> > > therefore making it's encoding more efficient.
> > >
> > > The set of base-4 characters are {w,x,y,z} and will be used to
> > > represent the following duplets (duplets are groups containing 2
> > > bits):
> > >
> > > Base-4 Character =to= Corresponding Duplet
> > > w = 00
> > > x = 01
> > > y = 10
> > > z = 11
> > >
> > >4.1 Base-4 Indicators
> > >
> > > Base-4 characters while carrying character information, also doubles
> > > as an indicator for code point brackets. In DUDE-02, an extra bit
> > > was pre-pended to each quartet. The last quartet of each encoded
> > > code point will be pre-pended with "0", marking the end of the code
> > > point. In ACE37, base-4 characters will determine the length
> > > (number of ACE37 characters) of the encoded code point. Actually,
> > > to be more precise, the encoded bits are in fact the "diff" and not
> > > the code point itself (diff carries the same meaning as in DUDE-02
> > > and is further discussed in Sections 6 & 7)
> > >
> > >
> > >
> > >Chung & Leung [Page 6]
> > >ACE37 ACE Utilizing All 37 Alphanumeric Characters July 2001
> > >
> > > The following table explains how base-4 characters are combined with
> > > base-32 characters to form a representation of a diff (key: b4=base-
> > > 4, b32=base-32):
> > >
> > > diff value |bits| ACE37 Form
> > > -------------------------|----|----------------------------
> > > diff<=0x7F | 7 | <b4><b32>
> > > 0x80<=diff<=0x7FFF | 15 | <b32><b32><b32>
> > > 0x8000<=diff<=0x1FFFF | 17 | w<b4><b32><b32><b32>
> > > 0x20000<=diff<=0xFFFFF | 20 | ww<b32><b32><b32><b32>
> > > 0x100000<=diff<=0x10FFFF | 22 | <b4>w<b32><b32><b32><b32>
> > >
> > > Note that the "bits" column represents the maximum number of
> > > significant bits for the given diff value. For example when
> > > diff<=0x7F, the maximum value is 0b1111111, therefore the number of
> > > significant bits is 7.
> > >
> > > Note also that to encode a 17-bit diff, the letter "w" is used as an
> > > indicator to distinguish the sequence from the 7 bit diff where a
> > > base-32 character is expected to follow a base-4 character. Since
> > > "w" represents "00" that has no value, it will not be used in the
> > > base-4 representation for a 17-bit diff (if a "00" is used, it means
> > > that there are only 15 significant bits and therefore should use the
> > > 15 bit diff form). This is the case for the 20-bit form as well.
> > > The "w" is used as an arbitrary indicator in the 22-bit form and
> > > MUST be discarded during decoding.
> > >
> > > By analyzing the ACE37 form, an encoded string could be successfully
> > > returned to its original form. There is no overlap and the form can
> > > be determined precisely. The following 5 rules dictate the 5
> > > different ACE37 forms:
> > >
> > > (1) Encode: if diff<=0x7F
> > > Decode: if first character is <b4> AND next character NOT <b4>
> > > Then it MUST be in 7-bit form: <b4><b32>
> > >
> > > (2) Encode: if 0x80<=diff<=0x7FFF
> > > Decode: if first character is <b32>
> > > Then it MUST be a 15-bit form: <b32><b32><b32>
> > >
> > > (3) Encode: if 0x8000<=diff<=0x1FFFF
> > > Decode: if first character is "w" AND next character is <b4>
> > > AND NOT "w"
> > > Then it MUST be in 17-bit form: w<b4><b32><b32><b32>
> > >
> > > (4) Encode: if 0x20000<=diff<=0xFFFFF
> > > Decode: if first character is "w" AND next character is "w"
> > > Then it MUST be in 20-bit form: ww<b32><b32><b32><b32>
> > >
> > > (5) Encode: if 0x80<=diff<=0x7FFF
> > > Decode: if first character is <b4> AND NOT "w"
> > > AND next character is "w"
> > > Then it MUST be 22-bit form: <b4>w<b32><b32><b32><b32>
> > >Chung & Leung [Page 7]
> > >ACE37 ACE Utilizing All 37 Alphanumeric Characters July 2001
> > >
> > >
> > > Note that the ACE37 scheme can effectively encode a diff of up to 22
> > > significant bits or 0x3FFFFF. The Unicode code points are expected
> > > to range only between 0x0000..0x10FFFF, therefore ACE37 will be able
> > > to handle any Unicode code point.
> > >
> > > Additionally, base-4 characters (and sometimes base-32 characters)
> > > could be used for mixed-case annotation. This optional mixed-case
> > > annotation mechanism is discussed in Appendix B.
> > >
> > >4.2 First Code Point Considerations
> > >
> > > There are additional considerations for the first code point that is
> > > encoded or decoded to ensure that if the first code point is within
> > > the first Unicode plane (U+0000..U+FFFF), it will not occupy more
> > > than 4 ACE37 characters.
> > >
> > > This special consideration affects only Rules (1), (3) and (4)
> > > explained in Section 4.1. Rule (1) is discarded for the first code
> > > point, therefore any diff under 0x7FFF will be in the form
> > > <b32><b32><b32>. The form for Rule (3) becomes simply
> > > <b4><b32><b32><b32> without the "w" indicator. Similarly, the form
> > > for Rule (4) becomes w<b32><b32><b32><b32> with one less "w".
> > >
> > > The first code point considerations can be summarized in the
> > > following 4 rules:
> > >
> > > (a) Encode: if diff<=0x7FFF
> > > Decode: if first character is <b32>
> > > Then it MUST be in 15-bit form: <b32><b32><b32>
> > >
> > > (b) Encode: if 0x8000<=diff<=0x1FFFF
> > > Decode: if first character is <b4> AND NOT "w"
> > > Then it MUST be in 17-bit form: <b4><b32><b32><b32>
> > >
> > > (c) Encode: if 0x20000<=diff<=0xFFFFF
> > > Decode: if first character is "w"
> > > Then it MUST be in 20-bit form: w<b32><b32><b32><b32>
> > >
> > > (d) Encode & Decode: same as Rule (5) in Section 4.1
> > >
> > > Besides special considerations for base-4 character usage, prev
> > > setting is also specially considered for the first code point. As
> > > laid out in Section 6, in order to detect for the first code point,
> > > the prev is evaluated. If prev = 0x00, it is assumed that it is the
> > > first code point as 0x00 SHOULD not be a permitted character for
> > > input. When an LDH is the first code point, there is a need to make
> > > a special consideration. Regularly, if n = LDH is encountered
> > > (Section 5), it will be output as "-n" and prev is not changed.
> > > However, if the first code point is an LDH, after outputting "-n",
> > > prev is updated to = lowercase(n). This is to ensure and maintain
> > > that only the first code point coming in will have a prev = 0x00.
> > >
> > >Chung & Leung [Page 8]
> > >ACE37 ACE Utilizing All 37 Alphanumeric Characters July 2001
> > >
> > >5. LDH Considerations
> > >
> > > Finally, the 37th character of the entire LDH repertoire, the hyphen
> > > will be used to indicate LDH exceptions. Extending the hyphen
> > > consideration of DUDE-02, ACE37 gives special consideration for the
> > > entire LDH repertoire. All LDH characters will be encoded "as is"
> > > with the addition of a leading hyphen. For example, the character
> > > "a" will be encoded within ACE37 as "-a". The hyphen character "-"
> > > will be encoded as "--".
> > >
> > > This ensures that each LDH character will only take up 2 character
> > > spaces within an ACE37 encoded string and also will allow
> > > administrators to see the actual characters, similar to the AMC
> > > series. Unlike the AMC series however, the hyphen is not used to
> > > indicate an ongoing mode change, but only the following character.
> > > Therefore retaining the simplicity of the DUDE-02 single-mode,
> > > single-pass philosophy.
> > >
> > >6. Encoding Procedure
> > >
> > > Similar to DUDE, all ordering of bits and quartets is big-endian.
> > > The following describes the encoding procedure:
> > >
> > > Set initial value for prev = 0x00
> > > for each input code point = n
> > > if n is an LDH {A-z, 0-9, -}
> > > output "-n" (Section 5: LDH Considerations)
> > > if prev = 0x00 (Section 4.2: First Code Point)
> > > let prev = lowercase(n)
> > > else perform code block shifting (Section 2: Code Block Shifting)
> > > let diff = prev XOR n (n after code block shifting)
> > > if diff<=0x7F --------------------------------------+
> > > and if this is the first code point (Section 4.2)|
> > > then output 15-bit form: <b32><b32><b32> |
> > > else, output 7-bit form: <b4><b32> |
> > > if 0x80<=diff<=0x7FFF +-(Section 4:
> > > output 15-bit form: <b32><b32><b32> | Base-4
> > > if 0x8000<=diff<=0x1FFFF | Characters)
> > > and if this is the first code point (Section 4.2)|
> > > output 17-bit form: w<b4><b32><b32><b32> |
> > > if 0x20000<=diff<=0xFFFFF |
> > > output 20-bit form: ww<b32><b32><b32><b32> |
> > > if 0x100000<=diff<=0x10FFFF |
> > > output 22-bit form: <b4>w<b32><b32><b32><b32> ---+
> > > let prev = n
> > > end and obtain next n and return to: "for each input code point = n"
> > >
> > > The following is a more comprehensive pseudo code:
> > >
> > > let prev = 0x00
> > > for each input integer n (in order) do begin
> > > if n = "-" or "0..9" or "A..Z" or "a..z"
> > > then output "hyphen"+"char(n)"
> > >Chung & Leung [Page 9]
> > >ACE37 ACE Utilizing All 37 Alphanumeric Characters July 2001
> > >
> > > if prev = 0x00
> > > let prev = lowercase(n)
> > >
> > > else begin
> > > if n = 0x00
> > > then error and abort
> > > if n <= 9FFF
> > > n = n - 0x30
> > > if n < 0
> > > then n = 9FFF + n
> > >
> > > let diff = prev XOR n
> > >
> > > if diff <= 0x7F
> > > if prev = 0x00
> > > then output with 3 base-32 characters
> > > else, output first 2 bits with a base-4 character {wxyz}
> > > and remaining 5 bits with 1 base-32 character
> > >
> > > if 0x80 <= diff <= 0x7FFF
> > > then output all 15 bits with base-32 characters
> > >
> > > if 0x8000 <= diff <= 0xFFFF
> > > if prev = 0x00
> > > then output first 2 bits with a base-4 {xyz} (except w)
> > > and output remaining 15 bits with base-32
> > > else, output "w"
> > > and output first 2 bits with a base-4 {xyz} (except w]
> > > and output remaining 15 bits with base-32
> > >
> > > if 0x10000 <= diff <= 0x1FFFF
> > > then output "w"
> > > and output first 2 bits with a base-4 {xyz} (except w)
> > > and output remaining 15 bits with base-32
> > >
> > > if 0x20000 <= diff <= 0xFFFFFF
> > > then output "w"
> > > and output all 20 bits with base-32 characters
> > >
> > > if 0x100000 <= diff <= 0x10FFFF
> > > then output first 2 bits with a base-4 {xyz} (except w)
> > > and output "w"
> > > and output remaining 15 bits with base-32
> > >
> > > let prev = n
> > > end
> > > end
> > >
> > > Nameprep [NAMEPREP] is not discussed in this document, but is
> > > expected that it be implemented for IDN. Hence, regardless of the
> > > code point presented, an encoder MUST not produce an incorrect
> > > output. The encoder must fail if it encounters a negative input
> > > value.
> > >Chung & Leung [Page 10]
> > >ACE37 ACE Utilizing All 37 Alphanumeric Characters July 2001
> > >
> > >
> > > The initial value used is 0x00 so that all domains beginning with a
> > > CJK ideograph or within row 0 (U+0000..U+0FFF) will be shorter.
> > > Note that after the code block shifting (Section 2), the entire Han
> > > library is within 0x0000..0x6FFF, while row 0 is fitted to
> > > 0x7000..0x7FFF. Therefore by using an initial value of 0x00 the
> > > diff for all Han and row 0 characters will be less than 0x7FFF. The
> > > initial value is also used as a check point for the first code point
> > > considerations (Section 4.2).
> > >
> > > Additionally, an optional mixed-case annotation mechanism is
> > > discussed in Appendix B.
> > >
> > >7. Decoding Procedure
> > >
> > > A thorough description of the decoding rules, except for the final
> > > reversal of the code block shifting has been presented in Sections
> > > 4.1 and 4.2. The following description is a brief representation of
> > > the decoding procedure:
> > >
> > > let prev = 0x00
> > > while the input string is not exhausted
> > > if present character = hyphen (Section 5: LDH
> > > discard and output next character Considerations)
> > > else, depending on the presented form (Section 4)
> > > convert into duplets and quintets (Section 4 & 3)
> > > and concatenate to form diff
> > > let prev = prev XOR diff
> > > reverse code block shifting: (Section 2)
> > > if prev<=0x9FFF
> > > and if prev<=0x6FFF
> > > output character = prev + 0x3000
> > > else, output character = prev - 0x7000
> > > else output character = prev
> > > output character
> > > End
> > >
> > > The following is a more comprehensive pseudo code for the decoding
> > > precedure:
> > >
> > > let prev = 0x00
> > > while the input string is not exhausted do begin
> > > if present character = hyphen /*Section 5:LDH Considerations*/
> > > then consume and discard hyphen
> > > and obtain the next character
> > > and output character
> > > if prev = 0x00 /*Section 4.2:First Code Point*/
> > > let prev = code block shifted lowercase output character
> > >
> > > else,
> > > if present character = Base-32 characters (0..v)
> > > consume present character and next 2 characters
> > > and convert them to quintets according to Base-32
> > >Chung & Leung [Page 11]
> > >ACE37 ACE Utilizing All 37 Alphanumeric Characters July 2001
> > >
> > > concatenate the resulting quintets to form diff
> > > /*15 bit form, 0x80<=diff<=0x7FFF*/
> > >
> > > if present character = Base-4 characters {xyz} and NOT w
> > > consume present character
> > > and convert it to a duplet according to Base-4
> > >
> > > if prev = 0x00
> > > obtain and consume next 3 characters
> > > and convert them to quintets according to Base-32
> > > concatenate duplet with the 3 quintets to form diff
> > > /*first code point: 17 bit form, 0x8000<=diff<=0x1FFFF*/
> > >
> > > else, if next character = Base-32 character (0..v)
> > > then consume and convert to quintet according to Base-32
> > > concatenate duplet with the quintet to form diff
> > > /*7 bit form, diff<=0x7F*/
> > >
> > > else, obtain next character
> > > if next character = Base-4 characters {xyz} and NOT w
> > > then fail and indicate error
> > >
> > > else, if next character = w
> > > then consume and discard w and obtain next 4 characters
> > > consume and convert characters to
> > > quintets according to Base-32
> > > concatenate duplet with the 4 quintets to form diff
> > > /*22 bit form, 0x100000<=diff<=0x10FFFF*/
> > >
> > > if present character = w
> > > discard "w" and obtain next character
> > >
> > > if next character = Base-4 characters {xyz} and NOT w
> > >
> > > and if prev = 0x00
> > > obtain and consume next 4 characters
> > > and convert characters to quintets based on Base-32
> > > concatenate the 4 quintets to form diff
> > > /*first code point: 20 bit form,*/
> > > /*0x20000<=diff<=0xFFFFFF */
> > >
> > > else, consume and convert to duplet according to Base-4
> > > and obtain and consume next 3 characters
> > > and convert to quintets according to Base-32
> > > concatenate duplet with the 3 quintets to form diff
> > > /*17 bit form, 0x8000<=diff<=0x1FFFF*/
> > >
> > > else, if next character = w
> > > then consume and discard w
> > > and obtain and consume next 4 characters
> > > and convert to quintets according to Base-32
> > > concatenate duplet the 4 quintets to form diff
> > > /*20 bit form, 0x20000<=diff<=0xFFFFFF*/
> > >Chung & Leung [Page 12]
> > >ACE37 ACE Utilizing All 37 Alphanumeric Characters July 2001
> > >
> > >
> > > else, if next character = Base-32 character (0..v)
> > > then convert to quintet according to Base-32
> > > set quintet to diff
> > > /*7 bit form, diff<=0x7F*/
> > >
> > > fail upon encountering a non-ACE37 character
> > > or end-of-input
> > >
> > > let prev = prev XOR diff
> > >
> > > if prev <= 0x9FFF /*reversal of the code */
> > > and if prev <= 6FFF /*block shifting described*/
> > > output = prev + 0x3000 /*in Section 2 */
> > > else, output = prev - 0x7000
> > > else, output prev
> > > end
> > > end
> > > encode the output sequence and compare it to the input string
> > > fail if they do not match (case insensitively)
> > >
> > >8. Examples
> > >
> > > ACE37 is likely to be implemented with an ACE prefix in the form
> > > "xx--". The actual prefix to be used is not discussed in this
> > > document. The following examples are taken from the mailing list as
> > > well as from DUDE-02 and the AMC series. The resulting ACE37 string
> > > is compared with that using DUDE:
> > >
> > > (A) JPNIC (the registry of .jp domain)
> > >
> > > Unicode: U+793E U+56E3 U+6CD5 U+4EBA U+65E5 U+672C U+30CD U+30C3
> > > U+30C8 U+30EF U+30FC U+30AF U+30A4 U+30F3 U+30D5 U+30A9
> > > U+30E1 U+30FC U+30B7 U+30E7 U+30F3 U+30BB U+30F3 U+30BF
> > > U+30FC
> > > ACE37: i9urut6hm8jfaqv0m9dv1wewbx7wjyjwbynx6zsy8wtybygwky8y8ycy3
> > > (57 char)
> > > DUDE-02: (error: result string exceeds 59 characters*)
> > > Note: 59 characters is the maximum allowable when the ACE
> > > prefix "xx--" is included
> > >
> > >
> > > (B) A health-insurance organization in Tokyo
> > >
> > > Unicode: U+6771 U+4EAC U+90FD U+60C5 U+5831 U+30B5 U+30FC U+30D3
> > > U+30B9 U+7523 U+696D U+5065 U+5EB7 U+4FDD U+967A U+7D44
> > > U+5408
> > > ACE37: drhaetvihk1o67ka44y9xfzahcqv2e6883micbaud7apuqac (48 char)
> > > DUDE-02: (error: result string exceeds 59 characters)
> > >
> > >
> > >
> > >
> > >Chung & Leung [Page 13]
> > >ACE37 ACE Utilizing All 37 Alphanumeric Characters July 2001
> > >
> > > (C) 6 hangul syllables
> > >
> > > Unicode: U+C138 U+ACC4 U+C758 U+BAA8 U+B4E0 U+C0AC
> > > ACE37: xg9orfsqssvfg3i8t2c (19 char)
> > > DUDE-02: 6txiy79ny53nz79a8wizwwn (23 char)
> > >
> > >
> > > (D) maji<de>koi<suru>5<byou><mae> (Latin, hiragana, kanji)
> > >
> > > Unicode: U+006D U+0061 U+006A U+0069 U+3067 U+006B U+006F U+0069
> > > U+3059 U+308B U+0035 U+79D2 U+524D
> > > ACE37: -m-a-j-is0a-k-o-xu06i-5iapqsv (30 char)
> > > DUDE-02: pnmdvssqvssnegvsva7cvs5qz38hu53r (32 char)
> > >
> > >
> > > (E) <pafii>de<runba> (Latin, katakana)
> > >
> > > Unicode: U+30D1 U+30D5 U+30A3 U+30FC U+0064 U+0065 U+30EB U+30F3
> > > U+30D0
> > > ACE37: 06hw4zmyv-d-ewnwox3 (19 char)
> > > DUDE-02: vs5bezgxrvs3ibvs2qtiud (22 char)
> > >
> > >
> > > (F) <sono><supiido><de> (hiragana, katakana)
> > >
> > > Unicode: U+305D U+306E U+30B9 U+30D4 U+30FC U+30C9 U+3067
> > > ACE37: 02txj06nzdx8xl05e (17 char)
> > > DUDE-02: vsvpvd7hypuivf4q (16 char)
> > >
> > >
> > > (G) 2 Arbitrary Plane Two Code Points
> > >
> > > Unicode: U+261AF U+261BF
> > > ACE37: w4odfwg (7 char)
> > > DUDE-02: uyt6rta (7 char)
> > >
> > >
> > > (H) Czech: Pro<ccaron>prost<ecaron>nemluv<iacute><ccaron>esky
> > >
> > > Unicode: U+0050 U+0072 U+006F U+010D U+0070 U+0072 U+006F U+0073
> > > U+0074 U+011B U+006E U+0065 U+006D U+006C U+0075 U+0076
> > > U+00ED U+010D U+0065 U+0073 U+006B U+0079
> > > ACE37: -p-r-o0bt-p r-o-s-twm-n-e-m-l-u-v0fm0f0-e-s-k-y (47 char)
> > > DUDE-02: vauctptyctzpctptnhtyrtzfmibtjd3mt8atyitgtitc (44 char)
> > >
> > >
> > > (I) Chinese
> > >
> > > Unicode: U+4ED5 U+5011 U+7232 U+4EC0 U+9EBD U+4E0D U+8AAA U+4E2D
> > > U+6587
> > > ACE37: 7mmfm7oh3n7is3ts5gh57h47ata (27 char)
> > > DUDE-02: w85gt86huuudv69c7szp7s5a6w4h6w2hu54k (36 char)
> > >
> > >Chung & Leung [Page 14]
> > >ACE37 ACE Utilizing All 37 Alphanumeric Characters July 2001
> > >
> > >9. Summary & Comparisons
> > >
> > > In summary, ACE37 is based on the DUDE-02 process with an improved
> > > compression scheme for code point sequences that are less likely to
> > > cluster too closely together, such as CJK ideographs.
> > >
> > > Since it is the design team's indication that generally 30
> > > characters should be good enough and that there are a lot of concern
> > > from the Asian community that 14-15 characters is definitely
> > > limiting and that few indication from the Latin community that
> > > length is really a concern, ACE37 have set its objective to increase
> > > the possible number of characters in a worse case scenario closer to
> > > 20 characters.
> > >
> > > ACE37 have succeeded in creating a very simple variation based on
> > > the primary ACEs identified by the design team to create an ACE that
> > > achieves dramatically better performance for CJK characters while
> > > maintaining the simplicity of DUDE.
> > >
> > > Key Improvements of ACE37 over DUDE-02
> > > - much more spacious for Han characters. Improved worst-case
> > > scenario to 21 Han ideographs by introducing code block shifting
> > > and utilizing fully base-32 characters
> > > - no need to arbitrarily pre-pend flagging bits to identify code
> > > point brackets. Instead base-4 characters and diff forms are used
> > > - base-32 and base-4 characters can be easily computed instead of
> > > mapped using lookup tables
> > >
> > > Key Improvements of ACE37 over the AMC series
> > > - a more simple process, utilizing the one-pass differential
> > > mechanism from DUDE-02
> > > - a much more simple code block shifting process is used in ACE37 to
> > > achieve a similar goal for the complex multiple reference point
> > > system used by the AMC series
> > > - base-32 and base-4 characters can be easily computed instead of
> > > mapped using lookup tables
> > >
> > > Key Improvements of ACE37 over LACE
> > > - a more simple process, utilizing the one-pass differential
> > > mechanism from DUDE-02
> > > - much more spacious for Han characters. Improved worst-case
> > > scenario to 21 Han ideographs by introducing code block shifting
> > > and utilizing fully base-32 characters
> > > - base-32 and base-4 characters can be easily computed instead of
> > > mapped using lookup tables
> > >
> > > Two Excel spreadsheet for ACE37 encoding and decoding can be found
> > > at http://www.dnsii.org/ace37/ace37-encode.xls and
> > > http://www.dnsii.org/ace37/ace37-decode.xls respectively. This
> > > illustrates the simplicity of ACE37 and provides a handy tool for
> > > checking ACE37 encoding and decoding algorithms. The ACE37-encode
> > > spreadsheet also includes a DUDE-encode worksheet.
> > >
> > >Chung & Leung [Page 15]
> > >ACE37 ACE Utilizing All 37 Alphanumeric Characters July 2001
> > >
> > >10. Security Considerations
> > >
> > > This document does not talk about DNS security issues, and it is
> > > believed that the proposal does not introduce additional security
> > > problems not already existent and/or anticipated by adding
> > > multilingual characters to DNS and/or using ACE.
> > >
> > >11. References
> > >
> > > [AMC-W] Adam M. Costello, "AMC-ACE-W version 0.1.0", May 31, 2001.
> > >
> > > [AMC-V] Adam M. Costello, "AMC-ACE-V version 0.1.0", May 31, 2001.
> > >
> > > [DUDE-02] Mark Welter, Brian W. Spolarich & Adam M.
> > > Costello, "Differential Unicode Domain Encoding (DUDE)",
> > > June 7, 2001.
> > >
> > > [LACE] Mark Davis, IBM & Paul Hoffman, IMC & VPNC, "LACE: Length-
> > > based ASCII Compatible Encoding for IDN", January 5, 2001.
> > >
> > > [Nameprep]Paul Hoffman, IMC & VPNC & Marc Blanchet, ViaGenie,
> > > "Preparation of Internationalized Host Names", February
> > > 24, 2001
> > >
> > >Appendix A. Acknowledgements
> > >
> > > The ACE37 draft is a combination of DUDE-02, the AMC series and
> > > LACE, and takes into consideration the report of the ACE design
> > > team. The authors would therefore like to thank the authors of
> > > DUDE-02 - Mark Welter, Brian W. Spolarich & Adam M. Costello; the
> > > authors of the AMC series - Adam M.Costello; the authors of LACE -
> > > Mark Davis & Paul Hoffman; and, the ACE design team and its advisors
> > > - Adam M. Costello, Paul Hoffman, Makoto Ishisone, David Laurence,
> > > Brian Spolarich, Rick Wesson, Marc Blanchet, Patrik Faltstrom and
> > > Erik Nordmark for their inspirations.
> > >
> > >Appendix B. Mixed-case annotation
> > >
> > > This section is taken from DUDE and modified for ACE37
> > >
> > > In order to use ACE37 to represent case-insensitive Unicode strings,
> > > higher layers need to case-fold the Unicode strings prior to ACE37
> > > encoding. The encoded string can, however, use mixed-case base-4
> > > characters as an annotation telling how to convert the folded
> > > Unicode string into a mixed-case Unicode string for display
> > > purposes.
> > >
> > > Each Unicode code point (unless it is an LDH) is represented by a
> > > sequence of base-4 and base-32 characters, the first of which is
> > > mostly a base-4 character, which is always a letter {wxyz} (as
> > > opposed to a digit). If that letter is uppercase, it is a
> > > suggestion that the Unicode character be mapped to uppercase (if
> > >
> > >Chung & Leung [Page 16]
> > >ACE37 ACE Utilizing All 37 Alphanumeric Characters July 2001
> > >
> > > possible); if the letter is lowercase, it is a suggestion that the
> > > Unicode character be mapped to lowercase (if possible).
> > >
> > > If the code point is an LDH, for example "a", it will be represented
> > > as "-a". To mark the case for an LDH, simply set the LDH to the
> > > desired case following the "-". Fir example if an uppercase "A" is
> > > desired, the encoded form SHOULD be "-A".
> > >
> > > Note that there is a possibility that no base-4 character is present
> > > for a code point representation. That is the case for a 15-bit diff
> > > form. In this case, the base-32 characters will be used for case
> > > suggestion (if possible), similar to that discussed for using a
> > > base-4 character. However, also note that there is a very remote
> > > possibility that all 3 base-32 characters are digits. If this
> > > happens, case unfolding will be aborted. Since case annotation is
> > > an optional feature and used for display purposes only, this is not
> > > considered to be a major concern. Moreover, the possibility of this
> > > happening is truly remote at only (32639/27)/1114109 or just 0.1%
> > > chance of happening.
> > >
> > > ACE37 encoders and decoders are not required to support these
> > > annotations, and higher layers need not use them.
> > >
> > > For example: In order to suggest that example (H) in Section 8:
> > > "Examples" be displayed as:
> > > Czech: Pro<ccaron(uppercase)>prost<ecaron(uppercase)>
> > > nemLUV<iacute(lowercase)><ccaron(lowercase)>esky
> > >
> > > one could capitalize the ACE37 encoding as:
> > > ACE37: -P-r-o0BT-p-r-o-s-tWM-n-e-m-L-U-V0fm0f0-e-s-k-y (47 char)
> > >
> > >Authors:
> > >
> > >Edmon Chung
> > >Neteka Inc.
> > >2462 Yonge St. Toronto,
> > >Ontario, Canada M4P 2H5
> > >edmon@neteka.com
> > >
> > >David Leung
> > >Neteka Inc.
> > >2462 Yonge St. Toronto,
> > >Ontario, Canada M4P 2H5
> > >david@neteka.com
> > >
> > >
> > >
> > >
> > >
> > >
> > >
> > >
> > >
> > >Chung & Leung [Page 17]
> >