[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
[idn] Re: permission <draft-ietf-idn-ace37-00.txt (attach)
Hi all,
I was unaware that the workgroup no longer accepts new drafts. Anyway, I
have drafted a new ACE based on the simplicity of DUDE which has hugely
improved compression. Worst case scenario CJK could have 21 han characters!
Attached below is a copy of the draft (for my original submission), you can
also find it at http://www.dnsii.org/idn-ace37-00.txt (easier to read) and
hopefully in the i-d-n.net website soon.
ACE37 is based on the one-pass one-mode scheme of DUDE (diiferential XOR),
then utilizes a simple code block shifting (similar to the reference points
in the AMC series) to hugely increase the capacity for CJK (worst case
scenario 21 han characters!) and then utilizes base-32 for compression (as
in LACE) (DUDE and AMC-w/v uses base-32 only for flagging). In addition to
base-32, a base-4 scheme is introduced by using the remaining characters
{wxyz}. These contain 2 bits of character information and doubles as an
indicator for codepoint brackets. All the while, the algorithm is kept to
be as simple as DUDE.
Hopefully you might find that it is interesting and appropriate to be
considered as an ACE within the IETF. Afterall, it was intended to be an
integrated version of the three primary ACEs: DUDE, LACE and the AMC series,
identified by the ACE design team report.
Looking forward to all your inputs.
Edmon
PS. I have created an Excel worksheet to illustrate the Encoding and
Decoding procedures as well you can find them at
http://www.dnsii.org/ace37/ace37-encode.xls and
http://www.dnsii.org/ace37/ace37-decode.xls respectively.
----- Original Message -----
From: "Marc Blanchet" <Marc.Blanchet@viagenie.qc.ca>
To: "Natalia Syracuse" <nsyracus@ietf.org>; <edmon@neteka.com>;
<david@neteka.com>
Cc: <jseng@pobox.org.sg>
Sent: Thursday, July 05, 2001 8:50 AM
Subject: Re: permission <draft-ietf-idn-ace37-00.txt (attach)
> I'm sorry but the new wg policy is to not accept draft unless there is a
> demonstrated support. But drafts are _highly_ encouraged to be published
as
> individual submissions. I would recommend to put idn in the filename and
> use this filenaming convention: draft-<yourname>-idn-ace37-00.txt. After
> publication in the internet-draft, the author should announce it in the wg
> mailing list and I'll put a reference to it in the wg web page.
>
> So please publish it as individual submission.
>
> Marc.
>
> At/À 08:34 2001-07-05 -0400, Natalia Syracuse you wrote/vous écriviez:
> >
> >
> >
> >Internet Draft Edmon Chung, Neteka Inc.
> ><draft-ietf-idn-ace37-00.txt> David Leung, Neteka Inc.
> > June 2001
> >
> >
> >
> > ACE Utilizing All 37 Alphanumeric Characters (ACE37)
> >
> >
> >STATUS OF THIS MEMO
> >
> > This document is an Internet-Draft and is in full conformance with
> > all provisions of Section 10 of RFC2026.
> >
> > Internet-Drafts are working documents of the Internet Engineering
> > Task Force (IETF), its areas, and its working groups. Note that
> > other groups may also distribute working documents as Internet-
> > Drafts. Internet-Drafts are draft documents valid for a maximum of
> > six months and may be updated, replaced, or obsoleted by other
> > documents at any time. It is inappropriate to use Internet-Drafts
> > as reference material or to cite them other than as "work in
> > progress."
> >
> > The reader is cautioned not to depend on the values that appear in
> > examples to be current or complete, since their purpose is primarily
> > educational. Distribution of this memo is unlimited.
> >
> > The list of current Internet-Drafts can be accessed at
> > http://www.ietf.org/ietf/1id-abstracts.txt
> > The list of Internet-Draft Shadow Directories can be accessed at
> > http://www.ietf.org/shadow.html.
> >
> >Abstract
> >
> > ACE37 is a combination of DUDE-02, AMC-W/V and LACE. ACE37 utilizes
> > the simple one pass algorithm of DUDE, the character block
> > considerations of AMC-W/V and the Base-32 compression of LACE. It
> > also fully utilizes entire LDH set currently allowed in the DNS (A-
> > z, 0-9 and "-") within its character repertoire to optimize
> > performance and compression. Even for the worst-case scenario in
> > ACE37, any name can have 21 characters including Chinese, Japanese
> > and Korean names. Two Excel spreadsheets for ACE37 encoding and
> > decoding can be found at http://www.dnsii.org/ace37/ace37-encode.xls
> > and http://www.dnsii.org/ace37/ace37-decode.xls respectively.
> >
> > While DUDE-02 provides a very efficient differential mechanism, its
> > compression is inefficient as it fails to take advantage of the
> > base-32 scheme in using all 5-bits for character information. The
> > AMC series is highly efficient in compression but requires
> > complicated mode changes and therefore inefficient in process. LACE
> > is rather moderate and requires a two-pass mechanism but utilizes
> > base-32 for good compression.
> >
> >
> >Chung & Leung [Page 1]
> >ACE37 ACE Utilizing All 37 Alphanumeric Characters July 2001
> >
> > ACE37 uses simple character block shifting to achieve the
> > compression efficiency of the AMC series, retains the one-pass and
> > one mode XOR differential mechanism used by DUDE while embracing the
> > base-32 compression used by LACE for efficient character bit
> > information.
> >
> >Terminology
> >
> > The key words "MUST", "SHALL", "REQUIRED", "SHOULD", "RECOMMENDED",
> > and "MAY" in this document are to be interpreted as described in RFC
> > 2119 [RFC2119].
> >
> > LDH: Letters, Digits and Hyphens: a string of characters that
> > consists only hyphens ("-"), English letters (A-z) and digits (0-9),
> > which might not be a result of an algorithm for transcoding
> > multilingual characters. For example: whatever-you-want.example
> >
> > ACE - ASCII Compatible Encoding: a string of characters resulting
> > from a particular algorithm for transforming multilingual character
> > information into an alphanumeric form acceptable by the existing
> > DNS. For example: bq--3bhc2zmh.tld. In essence, ACE is a subset of
> > LDH.
> >
> > Hexadecimal values are shown preceeded by "0x". For example, 0x60
> > is decimal 96. Binary values are shown preceeded by "0b" for
> > example "0b1000" is decimal 8. As in the Unicode Standard
> > [UNICODE], Unicode code points are denoted by "U+" followed by four
> > to six hexadecimal digits, while a range of code points (or
> > hexadecimal numbers) is denoted by two hexadecimal numbers separated
> > by "..", with no prefixes.
> >
> > Octets: sequences of 8 bits; Quintets: sequences of 5 bits;
> > Quartets: sequences of 4 bits; Duplets: sequences of 2 bits.
> >
> > XOR: bitwise exclusive or. Given 2 nonnegative integers A and B, A
> > XOR B is the nonnegative integer value whose binary representation
> > is 1 wherever A and B disagrees, and 0 wherever they agree.
> >
> >Table Of Contents
> >
> > 1. Introduction....................................................3
> > 2. Code Block Shifting.............................................4
> > 3. Base-32 Characters..............................................5
> > 4. Base-4 Characters...............................................6
> >
> > 5. LDH Considerations..............................................9
> > 6. Encoding Procedure..............................................9
> > 7. Decoding Procedure.............................................11
> > 8. Examples.......................................................13
> > 9. Summary & Comparisons..........................................15
> > 10. Security Considerations.......................................16
> > 11. References....................................................16
> >
> >Chung & Leung [Page 2]
> >ACE37 ACE Utilizing All 37 Alphanumeric Characters July 2001
> >
> >1. Introduction
> >
> > ACE37 takes into account the recommendations and findings of the ACE
> > design team to create a "super-ACE" that incorporates the key
> > advantages of the various considered ACEs without complicated mode
> > changes. The encoding (Section 6) and decoding (Section 7) process
> > is largely similar to and as simple as DUDE-02. The encoding
> > processes for ACE37 in comparison with DUDE-02 could be summarized:
> >
> > ACE37 Encoding Procedure | DUDE Encoding Procedure
> > ---------------------------------+---------------------------------
> > (1) let initial prev = 0x00 | (1) let initial prev = 0x60
> > (2) if n = LDH output "-n" | (2) if n = hyphen output "-"
> > (3) code block shift to obtain | (3) diff = prev XOR n
> > ACE37 shifted n (Section 2)| (4) prepend "0" to the last
> > (4) diff = prev XOR n | quartet and "1" to others
> > (5) output in appropriate base-4 | (5) output a base-32 character
> > and base-32 form | for each corresponding
> > (Sections 3&4) | quintet
> > (6) let prev = n | (6) let prev = n
> >
> > Similarly, the decoding process can be described and compared:
> >
> > ACE37 Decoding Procedure | DUDE Decoding Procedure
> > ---------------------------------+---------------------------------
> > (1) let initial prev = 0x00 | (1) let initial prev = 0x60
> > (2) if char = hyphen discard "-" | (2) if char = hyphen consume
> > and output next char | and output 0x002D
> > (3) consume and convert char into| (3) consume and convert to
> > duplets and quintets | quintets until encoun-
> > (according to Sections 3&4)| erring a quintet with "0"
> > (4) concatenate to form diff | as first bit
> > (based on Sections 4.1&4.2)| (4) strip all first bits off
> > (5) let prev = prev XOR diff | (5) concatente to form diff
> > (6) reverse code block shifting | (6) let prev = prev XOR diff
> > (7) output Unicode code point | (7) output Unicode code point
> >
> > The features of ACE37 include:
> >
> > Unique & Reversible - the ACE37 encoding scheme yields a unique and
> > consistent result string for a given set of Unicode code points.
> > The encoded string could be decoded back to the original Unicode
> > code points without loss of character data.
> >
> > Simple - ACE37 utilizes a one-pass system and the XOR differential
> > function to encode and decode. Code block shifting is done by a
> > simple calculation instead of mapping or creation of arbitrary
> > reference points. Complex mode changes are not required.
> >
> > Spacious - With the code block shifting coupled with a base-32
> > scheme, ACE37 can accommodate up to 21 unique Han characters
> > (including CJK) within the 63 octets allowed by the DNS. Other
> > Latin based scripts can reach up to 31 characters.
> >Chung & Leung [Page 3]
> >ACE37 ACE Utilizing All 37 Alphanumeric Characters July 2001
> >
> >
> > Completeness - any sequence of Unicode code points
> > (U+0000..U+10FFFF) could be encoded. Restrictions of allowed code
> > points is not discussed, but is expected that Nameprep [Nameprep]
> > will be used prior to ACE37 encoding.
> >
> > In essence, it captures the focus criterions discussed by the
> > workgroup ACE design team - reversibility, simplicity and
> > compression capability. Moreover, ACE37 utilizes a very simple code
> > block shifting (Section 2) mechanism to allow up to any 21 CJK
> > ideographs to be encoded within the 63-octet constraint.
> >
> >2. Code Block Shifting
> >
> > While the DNS was not originally designed for multilingual
> > characters, Unicode was not designed with the DNS in mind and
> > therefore code points were apparently not allocated in an ACE-
> > friendly way.
> >
> > The AMC series [AMC-W & AMC-V] utilizes a number of reference points
> > to achieve better compression efficiency by anticipating and
> > minimizing delta between characters. For ACE37, a much simpler
> > rendering is used. More specifically, the entire character block
> > U+3000..U+9FFF for CJK ideographs is shifted down by 0x3000. That
> > is U+3000 will become 0x0000, U+4000 becomes 0x1000, and so on. To
> > compensate for the downwards shift, the general script and symbol
> > characters in U+0000..U+2FFF will be shifted upwards by 0x7000.
> > Therefore, U+0100 will become 0x7100, U+2000 becomes 0x9000, and so
> > on. All other code points (U+A000..U+10FFFF) are unchanged.
> >
> > Original Unicode Allocation | ACE37 Code Block Shifted
> > --------------------------------|-------------------------------
> > General Scripts U+0000 -+ | +- 0x0000 CJK Misc
> > U+1000 | | | 0x1000 CJK Ideographs
> > +- | -> | 0x2000
> > Symbols U+2000 -+ \ | / | 0x3000
> > \ |/ | 0x4000
> > CJK Misc U+3000 -+ \/ | 0x5000
> > CJK Ideographs U+4000 | /\ +- 0x6000
> > U+5000 | / |\
> > U+6000 +-- | \ +- 0x7000 General Scripts
> > U+7000 | | -> | 0x8000
> > U+8000 | | |
> > U+9000 -+ | +- 0x9000 Symbols
> > |
> > Hangul U+A000 -+ | +- 0xA000 Hangul
> > U+B000 | | | 0xB000
> > U+C000 +----|---> | 0xC000
> > U+D000 | | | 0xD000
> > : : -+ | +- : :
> > |
> >
> >
> >Chung & Leung [Page 4]
> >ACE37 ACE Utilizing All 37 Alphanumeric Characters July 2001
> >
> > This shifting effectively moves the entire Han library to within
> > 0x6FFF and therefore could be represented in 15-bits or exactly 3
> > base-32 characters. (details on base-32 characters in Section 3)
> >
> > For example, the Chinese character for <change> with the original
> > Unicode code point at U+8F49, will be shifted to 0x5F49 and can be
> > represented in 3 quintets, and in turn with 3 base-32 characters:
> >
> > Character: <change>
> > Unicode Code Point: U+8F49
> > ACE37 Shifted: 0x5F49
> > Corresponding Quartets: 0101 1111 0100 1001
> > Resulting Quintets: 10111 11010 01001
> > Base-32: nq9 (further discussed in Section 3)
> >
> > This in turn means that any Chinese character could be represented
> > with 3 base-32 characters making the total possible characters
> > within a label, even without further compression introduced by the
> > XOR differential process (Section 6), to be at least 21. The ACE37
> > code block shifting process could be described as follows:
> >
> > for each input code point = n
> > if n <= 9FFF
> > n = n - 0x3000 /*downwards shifting*/
> > if n <= 0
> > n = 0x9FFF + n /*compensation for U+0000..U+2FFF*/
> >
> > The character block shifting introduced here is extremely simple and
> > utilizes simple calculation that requires no mapping function. At
> > the same time, it achieves the goal in adjusting the Unicode
> > allocation so that it becomes more ACE friendly.
> >
> >3. Base-32 Characters
> >
> > Base-32 characters are used in LACE for compression, while DUDE-02
> > and the AMC series only utilizes it for quartet flagging to indicate
> > the last quartet of each encoded code point. ACE37 utilizes base-32
> > characters for compression while base-4 characters, which will be
> > introduced in Section 4, determine the compressed code point
> > brackets.
> >
> > The following table shows the 32 base-32 characters and their
> > corresponding quintets:
> >
> > Base-32 Character =to= Corresponding Quintet
> > 0 = 00000 8 = 01000 g = 10000 o = 11000
> > 1 = 00001 9 = 01001 h = 10001 p = 11001
> > 2 = 00010 a = 01010 i = 10010 q = 11010
> > 3 = 00011 b = 01011 j = 10011 r = 11011
> > 4 = 00100 c = 01100 k = 10100 s = 11100
> > 5 = 00101 d = 01101 l = 10101 t = 11101
> > 6 = 00110 e = 01110 m = 10110 u = 11110
> > 7 = 00111 f = 01111 n = 10111 v = 11111
> >Chung & Leung [Page 5]
> >ACE37 ACE Utilizing All 37 Alphanumeric Characters July 2001
> >
> >
> > With this layout of base-32 characters, it is also possible to
> > implement a computation based base-32 conversion instead of having
> > to resort to mapping and lookup tables:
> >
> > For each quintet = q
> > if q <= 0x0F
> > then hex dump q to form base-32 character
> > if 0x10 <= q <= 0x1F
> > then q = q - 0x10
> > and char(q + 0x67) to form base-32 character
> >
> > Note that 0x67 is the code value for the letter "g". Therefore, for
> > example if the quintet is 0b10001 its base-32 character can be
> > obtained by:
> >
> > 0x10 <= q=0b10001=0x11 <= 0x1F
> > therefore q = q - 0x10 = 0x11 - 0x10 = 0x01
> > and base-32 character = char(0x01 + 0x67)
> > char(0x68) = "h"
> >
> >4. Base-4 Characters
> >
> > ACE37 goes beyond the 32 characters (base-32) to include the
> > remaining 4 characters {w,x,y,z} in the alphabet. These base-4
> > characters enable ACE37 to better utilize the existing "resources"
> > (the allowed characters) to represent IDN character information,
> > therefore making it's encoding more efficient.
> >
> > The set of base-4 characters are {w,x,y,z} and will be used to
> > represent the following duplets (duplets are groups containing 2
> > bits):
> >
> > Base-4 Character =to= Corresponding Duplet
> > w = 00
> > x = 01
> > y = 10
> > z = 11
> >
> >4.1 Base-4 Indicators
> >
> > Base-4 characters while carrying character information, also doubles
> > as an indicator for code point brackets. In DUDE-02, an extra bit
> > was pre-pended to each quartet. The last quartet of each encoded
> > code point will be pre-pended with "0", marking the end of the code
> > point. In ACE37, base-4 characters will determine the length
> > (number of ACE37 characters) of the encoded code point. Actually,
> > to be more precise, the encoded bits are in fact the "diff" and not
> > the code point itself (diff carries the same meaning as in DUDE-02
> > and is further discussed in Sections 6 & 7)
> >
> >
> >
> >Chung & Leung [Page 6]
> >ACE37 ACE Utilizing All 37 Alphanumeric Characters July 2001
> >
> > The following table explains how base-4 characters are combined with
> > base-32 characters to form a representation of a diff (key: b4=base-
> > 4, b32=base-32):
> >
> > diff value |bits| ACE37 Form
> > -------------------------|----|----------------------------
> > diff<=0x7F | 7 | <b4><b32>
> > 0x80<=diff<=0x7FFF | 15 | <b32><b32><b32>
> > 0x8000<=diff<=0x1FFFF | 17 | w<b4><b32><b32><b32>
> > 0x20000<=diff<=0xFFFFF | 20 | ww<b32><b32><b32><b32>
> > 0x100000<=diff<=0x10FFFF | 22 | <b4>w<b32><b32><b32><b32>
> >
> > Note that the "bits" column represents the maximum number of
> > significant bits for the given diff value. For example when
> > diff<=0x7F, the maximum value is 0b1111111, therefore the number of
> > significant bits is 7.
> >
> > Note also that to encode a 17-bit diff, the letter "w" is used as an
> > indicator to distinguish the sequence from the 7 bit diff where a
> > base-32 character is expected to follow a base-4 character. Since
> > "w" represents "00" that has no value, it will not be used in the
> > base-4 representation for a 17-bit diff (if a "00" is used, it means
> > that there are only 15 significant bits and therefore should use the
> > 15 bit diff form). This is the case for the 20-bit form as well.
> > The "w" is used as an arbitrary indicator in the 22-bit form and
> > MUST be discarded during decoding.
> >
> > By analyzing the ACE37 form, an encoded string could be successfully
> > returned to its original form. There is no overlap and the form can
> > be determined precisely. The following 5 rules dictate the 5
> > different ACE37 forms:
> >
> > (1) Encode: if diff<=0x7F
> > Decode: if first character is <b4> AND next character NOT <b4>
> > Then it MUST be in 7-bit form: <b4><b32>
> >
> > (2) Encode: if 0x80<=diff<=0x7FFF
> > Decode: if first character is <b32>
> > Then it MUST be a 15-bit form: <b32><b32><b32>
> >
> > (3) Encode: if 0x8000<=diff<=0x1FFFF
> > Decode: if first character is "w" AND next character is <b4>
> > AND NOT "w"
> > Then it MUST be in 17-bit form: w<b4><b32><b32><b32>
> >
> > (4) Encode: if 0x20000<=diff<=0xFFFFF
> > Decode: if first character is "w" AND next character is "w"
> > Then it MUST be in 20-bit form: ww<b32><b32><b32><b32>
> >
> > (5) Encode: if 0x80<=diff<=0x7FFF
> > Decode: if first character is <b4> AND NOT "w"
> > AND next character is "w"
> > Then it MUST be 22-bit form: <b4>w<b32><b32><b32><b32>
> >Chung & Leung [Page 7]
> >ACE37 ACE Utilizing All 37 Alphanumeric Characters July 2001
> >
> >
> > Note that the ACE37 scheme can effectively encode a diff of up to 22
> > significant bits or 0x3FFFFF. The Unicode code points are expected
> > to range only between 0x0000..0x10FFFF, therefore ACE37 will be able
> > to handle any Unicode code point.
> >
> > Additionally, base-4 characters (and sometimes base-32 characters)
> > could be used for mixed-case annotation. This optional mixed-case
> > annotation mechanism is discussed in Appendix B.
> >
> >4.2 First Code Point Considerations
> >
> > There are additional considerations for the first code point that is
> > encoded or decoded to ensure that if the first code point is within
> > the first Unicode plane (U+0000..U+FFFF), it will not occupy more
> > than 4 ACE37 characters.
> >
> > This special consideration affects only Rules (1), (3) and (4)
> > explained in Section 4.1. Rule (1) is discarded for the first code
> > point, therefore any diff under 0x7FFF will be in the form
> > <b32><b32><b32>. The form for Rule (3) becomes simply
> > <b4><b32><b32><b32> without the "w" indicator. Similarly, the form
> > for Rule (4) becomes w<b32><b32><b32><b32> with one less "w".
> >
> > The first code point considerations can be summarized in the
> > following 4 rules:
> >
> > (a) Encode: if diff<=0x7FFF
> > Decode: if first character is <b32>
> > Then it MUST be in 15-bit form: <b32><b32><b32>
> >
> > (b) Encode: if 0x8000<=diff<=0x1FFFF
> > Decode: if first character is <b4> AND NOT "w"
> > Then it MUST be in 17-bit form: <b4><b32><b32><b32>
> >
> > (c) Encode: if 0x20000<=diff<=0xFFFFF
> > Decode: if first character is "w"
> > Then it MUST be in 20-bit form: w<b32><b32><b32><b32>
> >
> > (d) Encode & Decode: same as Rule (5) in Section 4.1
> >
> > Besides special considerations for base-4 character usage, prev
> > setting is also specially considered for the first code point. As
> > laid out in Section 6, in order to detect for the first code point,
> > the prev is evaluated. If prev = 0x00, it is assumed that it is the
> > first code point as 0x00 SHOULD not be a permitted character for
> > input. When an LDH is the first code point, there is a need to make
> > a special consideration. Regularly, if n = LDH is encountered
> > (Section 5), it will be output as "-n" and prev is not changed.
> > However, if the first code point is an LDH, after outputting "-n",
> > prev is updated to = lowercase(n). This is to ensure and maintain
> > that only the first code point coming in will have a prev = 0x00.
> >
> >Chung & Leung [Page 8]
> >ACE37 ACE Utilizing All 37 Alphanumeric Characters July 2001
> >
> >5. LDH Considerations
> >
> > Finally, the 37th character of the entire LDH repertoire, the hyphen
> > will be used to indicate LDH exceptions. Extending the hyphen
> > consideration of DUDE-02, ACE37 gives special consideration for the
> > entire LDH repertoire. All LDH characters will be encoded "as is"
> > with the addition of a leading hyphen. For example, the character
> > "a" will be encoded within ACE37 as "-a". The hyphen character "-"
> > will be encoded as "--".
> >
> > This ensures that each LDH character will only take up 2 character
> > spaces within an ACE37 encoded string and also will allow
> > administrators to see the actual characters, similar to the AMC
> > series. Unlike the AMC series however, the hyphen is not used to
> > indicate an ongoing mode change, but only the following character.
> > Therefore retaining the simplicity of the DUDE-02 single-mode,
> > single-pass philosophy.
> >
> >6. Encoding Procedure
> >
> > Similar to DUDE, all ordering of bits and quartets is big-endian.
> > The following describes the encoding procedure:
> >
> > Set initial value for prev = 0x00
> > for each input code point = n
> > if n is an LDH {A-z, 0-9, -}
> > output "-n" (Section 5: LDH Considerations)
> > if prev = 0x00 (Section 4.2: First Code Point)
> > let prev = lowercase(n)
> > else perform code block shifting (Section 2: Code Block Shifting)
> > let diff = prev XOR n (n after code block shifting)
> > if diff<=0x7F --------------------------------------+
> > and if this is the first code point (Section 4.2)|
> > then output 15-bit form: <b32><b32><b32> |
> > else, output 7-bit form: <b4><b32> |
> > if 0x80<=diff<=0x7FFF +-(Section 4:
> > output 15-bit form: <b32><b32><b32> | Base-4
> > if 0x8000<=diff<=0x1FFFF | Characters)
> > and if this is the first code point (Section 4.2)|
> > output 17-bit form: w<b4><b32><b32><b32> |
> > if 0x20000<=diff<=0xFFFFF |
> > output 20-bit form: ww<b32><b32><b32><b32> |
> > if 0x100000<=diff<=0x10FFFF |
> > output 22-bit form: <b4>w<b32><b32><b32><b32> ---+
> > let prev = n
> > end and obtain next n and return to: "for each input code point = n"
> >
> > The following is a more comprehensive pseudo code:
> >
> > let prev = 0x00
> > for each input integer n (in order) do begin
> > if n = "-" or "0..9" or "A..Z" or "a..z"
> > then output "hyphen"+"char(n)"
> >Chung & Leung [Page 9]
> >ACE37 ACE Utilizing All 37 Alphanumeric Characters July 2001
> >
> > if prev = 0x00
> > let prev = lowercase(n)
> >
> > else begin
> > if n = 0x00
> > then error and abort
> > if n <= 9FFF
> > n = n - 0x30
> > if n < 0
> > then n = 9FFF + n
> >
> > let diff = prev XOR n
> >
> > if diff <= 0x7F
> > if prev = 0x00
> > then output with 3 base-32 characters
> > else, output first 2 bits with a base-4 character {wxyz}
> > and remaining 5 bits with 1 base-32 character
> >
> > if 0x80 <= diff <= 0x7FFF
> > then output all 15 bits with base-32 characters
> >
> > if 0x8000 <= diff <= 0xFFFF
> > if prev = 0x00
> > then output first 2 bits with a base-4 {xyz} (except w)
> > and output remaining 15 bits with base-32
> > else, output "w"
> > and output first 2 bits with a base-4 {xyz} (except w]
> > and output remaining 15 bits with base-32
> >
> > if 0x10000 <= diff <= 0x1FFFF
> > then output "w"
> > and output first 2 bits with a base-4 {xyz} (except w)
> > and output remaining 15 bits with base-32
> >
> > if 0x20000 <= diff <= 0xFFFFFF
> > then output "w"
> > and output all 20 bits with base-32 characters
> >
> > if 0x100000 <= diff <= 0x10FFFF
> > then output first 2 bits with a base-4 {xyz} (except w)
> > and output "w"
> > and output remaining 15 bits with base-32
> >
> > let prev = n
> > end
> > end
> >
> > Nameprep [NAMEPREP] is not discussed in this document, but is
> > expected that it be implemented for IDN. Hence, regardless of the
> > code point presented, an encoder MUST not produce an incorrect
> > output. The encoder must fail if it encounters a negative input
> > value.
> >Chung & Leung [Page 10]
> >ACE37 ACE Utilizing All 37 Alphanumeric Characters July 2001
> >
> >
> > The initial value used is 0x00 so that all domains beginning with a
> > CJK ideograph or within row 0 (U+0000..U+0FFF) will be shorter.
> > Note that after the code block shifting (Section 2), the entire Han
> > library is within 0x0000..0x6FFF, while row 0 is fitted to
> > 0x7000..0x7FFF. Therefore by using an initial value of 0x00 the
> > diff for all Han and row 0 characters will be less than 0x7FFF. The
> > initial value is also used as a check point for the first code point
> > considerations (Section 4.2).
> >
> > Additionally, an optional mixed-case annotation mechanism is
> > discussed in Appendix B.
> >
> >7. Decoding Procedure
> >
> > A thorough description of the decoding rules, except for the final
> > reversal of the code block shifting has been presented in Sections
> > 4.1 and 4.2. The following description is a brief representation of
> > the decoding procedure:
> >
> > let prev = 0x00
> > while the input string is not exhausted
> > if present character = hyphen (Section 5: LDH
> > discard and output next character Considerations)
> > else, depending on the presented form (Section 4)
> > convert into duplets and quintets (Section 4 & 3)
> > and concatenate to form diff
> > let prev = prev XOR diff
> > reverse code block shifting: (Section 2)
> > if prev<=0x9FFF
> > and if prev<=0x6FFF
> > output character = prev + 0x3000
> > else, output character = prev - 0x7000
> > else output character = prev
> > output character
> > End
> >
> > The following is a more comprehensive pseudo code for the decoding
> > precedure:
> >
> > let prev = 0x00
> > while the input string is not exhausted do begin
> > if present character = hyphen /*Section 5:LDH Considerations*/
> > then consume and discard hyphen
> > and obtain the next character
> > and output character
> > if prev = 0x00 /*Section 4.2:First Code Point*/
> > let prev = code block shifted lowercase output character
> >
> > else,
> > if present character = Base-32 characters (0..v)
> > consume present character and next 2 characters
> > and convert them to quintets according to Base-32
> >Chung & Leung [Page 11]
> >ACE37 ACE Utilizing All 37 Alphanumeric Characters July 2001
> >
> > concatenate the resulting quintets to form diff
> > /*15 bit form, 0x80<=diff<=0x7FFF*/
> >
> > if present character = Base-4 characters {xyz} and NOT w
> > consume present character
> > and convert it to a duplet according to Base-4
> >
> > if prev = 0x00
> > obtain and consume next 3 characters
> > and convert them to quintets according to Base-32
> > concatenate duplet with the 3 quintets to form diff
> > /*first code point: 17 bit form, 0x8000<=diff<=0x1FFFF*/
> >
> > else, if next character = Base-32 character (0..v)
> > then consume and convert to quintet according to Base-32
> > concatenate duplet with the quintet to form diff
> > /*7 bit form, diff<=0x7F*/
> >
> > else, obtain next character
> > if next character = Base-4 characters {xyz} and NOT w
> > then fail and indicate error
> >
> > else, if next character = w
> > then consume and discard w and obtain next 4 characters
> > consume and convert characters to
> > quintets according to Base-32
> > concatenate duplet with the 4 quintets to form diff
> > /*22 bit form, 0x100000<=diff<=0x10FFFF*/
> >
> > if present character = w
> > discard "w" and obtain next character
> >
> > if next character = Base-4 characters {xyz} and NOT w
> >
> > and if prev = 0x00
> > obtain and consume next 4 characters
> > and convert characters to quintets based on Base-32
> > concatenate the 4 quintets to form diff
> > /*first code point: 20 bit form,*/
> > /*0x20000<=diff<=0xFFFFFF */
> >
> > else, consume and convert to duplet according to Base-4
> > and obtain and consume next 3 characters
> > and convert to quintets according to Base-32
> > concatenate duplet with the 3 quintets to form diff
> > /*17 bit form, 0x8000<=diff<=0x1FFFF*/
> >
> > else, if next character = w
> > then consume and discard w
> > and obtain and consume next 4 characters
> > and convert to quintets according to Base-32
> > concatenate duplet the 4 quintets to form diff
> > /*20 bit form, 0x20000<=diff<=0xFFFFFF*/
> >Chung & Leung [Page 12]
> >ACE37 ACE Utilizing All 37 Alphanumeric Characters July 2001
> >
> >
> > else, if next character = Base-32 character (0..v)
> > then convert to quintet according to Base-32
> > set quintet to diff
> > /*7 bit form, diff<=0x7F*/
> >
> > fail upon encountering a non-ACE37 character
> > or end-of-input
> >
> > let prev = prev XOR diff
> >
> > if prev <= 0x9FFF /*reversal of the code */
> > and if prev <= 6FFF /*block shifting described*/
> > output = prev + 0x3000 /*in Section 2 */
> > else, output = prev - 0x7000
> > else, output prev
> > end
> > end
> > encode the output sequence and compare it to the input string
> > fail if they do not match (case insensitively)
> >
> >8. Examples
> >
> > ACE37 is likely to be implemented with an ACE prefix in the form
> > "xx--". The actual prefix to be used is not discussed in this
> > document. The following examples are taken from the mailing list as
> > well as from DUDE-02 and the AMC series. The resulting ACE37 string
> > is compared with that using DUDE:
> >
> > (A) JPNIC (the registry of .jp domain)
> >
> > Unicode: U+793E U+56E3 U+6CD5 U+4EBA U+65E5 U+672C U+30CD U+30C3
> > U+30C8 U+30EF U+30FC U+30AF U+30A4 U+30F3 U+30D5 U+30A9
> > U+30E1 U+30FC U+30B7 U+30E7 U+30F3 U+30BB U+30F3 U+30BF
> > U+30FC
> > ACE37: i9urut6hm8jfaqv0m9dv1wewbx7wjyjwbynx6zsy8wtybygwky8y8ycy3
> > (57 char)
> > DUDE-02: (error: result string exceeds 59 characters*)
> > Note: 59 characters is the maximum allowable when the ACE
> > prefix "xx--" is included
> >
> >
> > (B) A health-insurance organization in Tokyo
> >
> > Unicode: U+6771 U+4EAC U+90FD U+60C5 U+5831 U+30B5 U+30FC U+30D3
> > U+30B9 U+7523 U+696D U+5065 U+5EB7 U+4FDD U+967A U+7D44
> > U+5408
> > ACE37: drhaetvihk1o67ka44y9xfzahcqv2e6883micbaud7apuqac (48 char)
> > DUDE-02: (error: result string exceeds 59 characters)
> >
> >
> >
> >
> >Chung & Leung [Page 13]
> >ACE37 ACE Utilizing All 37 Alphanumeric Characters July 2001
> >
> > (C) 6 hangul syllables
> >
> > Unicode: U+C138 U+ACC4 U+C758 U+BAA8 U+B4E0 U+C0AC
> > ACE37: xg9orfsqssvfg3i8t2c (19 char)
> > DUDE-02: 6txiy79ny53nz79a8wizwwn (23 char)
> >
> >
> > (D) maji<de>koi<suru>5<byou><mae> (Latin, hiragana, kanji)
> >
> > Unicode: U+006D U+0061 U+006A U+0069 U+3067 U+006B U+006F U+0069
> > U+3059 U+308B U+0035 U+79D2 U+524D
> > ACE37: -m-a-j-is0a-k-o-xu06i-5iapqsv (30 char)
> > DUDE-02: pnmdvssqvssnegvsva7cvs5qz38hu53r (32 char)
> >
> >
> > (E) <pafii>de<runba> (Latin, katakana)
> >
> > Unicode: U+30D1 U+30D5 U+30A3 U+30FC U+0064 U+0065 U+30EB U+30F3
> > U+30D0
> > ACE37: 06hw4zmyv-d-ewnwox3 (19 char)
> > DUDE-02: vs5bezgxrvs3ibvs2qtiud (22 char)
> >
> >
> > (F) <sono><supiido><de> (hiragana, katakana)
> >
> > Unicode: U+305D U+306E U+30B9 U+30D4 U+30FC U+30C9 U+3067
> > ACE37: 02txj06nzdx8xl05e (17 char)
> > DUDE-02: vsvpvd7hypuivf4q (16 char)
> >
> >
> > (G) 2 Arbitrary Plane Two Code Points
> >
> > Unicode: U+261AF U+261BF
> > ACE37: w4odfwg (7 char)
> > DUDE-02: uyt6rta (7 char)
> >
> >
> > (H) Czech: Pro<ccaron>prost<ecaron>nemluv<iacute><ccaron>esky
> >
> > Unicode: U+0050 U+0072 U+006F U+010D U+0070 U+0072 U+006F U+0073
> > U+0074 U+011B U+006E U+0065 U+006D U+006C U+0075 U+0076
> > U+00ED U+010D U+0065 U+0073 U+006B U+0079
> > ACE37: -p-r-o0bt-p r-o-s-twm-n-e-m-l-u-v0fm0f0-e-s-k-y (47 char)
> > DUDE-02: vauctptyctzpctptnhtyrtzfmibtjd3mt8atyitgtitc (44 char)
> >
> >
> > (I) Chinese
> >
> > Unicode: U+4ED5 U+5011 U+7232 U+4EC0 U+9EBD U+4E0D U+8AAA U+4E2D
> > U+6587
> > ACE37: 7mmfm7oh3n7is3ts5gh57h47ata (27 char)
> > DUDE-02: w85gt86huuudv69c7szp7s5a6w4h6w2hu54k (36 char)
> >
> >Chung & Leung [Page 14]
> >ACE37 ACE Utilizing All 37 Alphanumeric Characters July 2001
> >
> >9. Summary & Comparisons
> >
> > In summary, ACE37 is based on the DUDE-02 process with an improved
> > compression scheme for code point sequences that are less likely to
> > cluster too closely together, such as CJK ideographs.
> >
> > Since it is the design team's indication that generally 30
> > characters should be good enough and that there are a lot of concern
> > from the Asian community that 14-15 characters is definitely
> > limiting and that few indication from the Latin community that
> > length is really a concern, ACE37 have set its objective to increase
> > the possible number of characters in a worse case scenario closer to
> > 20 characters.
> >
> > ACE37 have succeeded in creating a very simple variation based on
> > the primary ACEs identified by the design team to create an ACE that
> > achieves dramatically better performance for CJK characters while
> > maintaining the simplicity of DUDE.
> >
> > Key Improvements of ACE37 over DUDE-02
> > - much more spacious for Han characters. Improved worst-case
> > scenario to 21 Han ideographs by introducing code block shifting
> > and utilizing fully base-32 characters
> > - no need to arbitrarily pre-pend flagging bits to identify code
> > point brackets. Instead base-4 characters and diff forms are used
> > - base-32 and base-4 characters can be easily computed instead of
> > mapped using lookup tables
> >
> > Key Improvements of ACE37 over the AMC series
> > - a more simple process, utilizing the one-pass differential
> > mechanism from DUDE-02
> > - a much more simple code block shifting process is used in ACE37 to
> > achieve a similar goal for the complex multiple reference point
> > system used by the AMC series
> > - base-32 and base-4 characters can be easily computed instead of
> > mapped using lookup tables
> >
> > Key Improvements of ACE37 over LACE
> > - a more simple process, utilizing the one-pass differential
> > mechanism from DUDE-02
> > - much more spacious for Han characters. Improved worst-case
> > scenario to 21 Han ideographs by introducing code block shifting
> > and utilizing fully base-32 characters
> > - base-32 and base-4 characters can be easily computed instead of
> > mapped using lookup tables
> >
> > Two Excel spreadsheet for ACE37 encoding and decoding can be found
> > at http://www.dnsii.org/ace37/ace37-encode.xls and
> > http://www.dnsii.org/ace37/ace37-decode.xls respectively. This
> > illustrates the simplicity of ACE37 and provides a handy tool for
> > checking ACE37 encoding and decoding algorithms. The ACE37-encode
> > spreadsheet also includes a DUDE-encode worksheet.
> >
> >Chung & Leung [Page 15]
> >ACE37 ACE Utilizing All 37 Alphanumeric Characters July 2001
> >
> >10. Security Considerations
> >
> > This document does not talk about DNS security issues, and it is
> > believed that the proposal does not introduce additional security
> > problems not already existent and/or anticipated by adding
> > multilingual characters to DNS and/or using ACE.
> >
> >11. References
> >
> > [AMC-W] Adam M. Costello, "AMC-ACE-W version 0.1.0", May 31, 2001.
> >
> > [AMC-V] Adam M. Costello, "AMC-ACE-V version 0.1.0", May 31, 2001.
> >
> > [DUDE-02] Mark Welter, Brian W. Spolarich & Adam M.
> > Costello, "Differential Unicode Domain Encoding (DUDE)",
> > June 7, 2001.
> >
> > [LACE] Mark Davis, IBM & Paul Hoffman, IMC & VPNC, "LACE: Length-
> > based ASCII Compatible Encoding for IDN", January 5, 2001.
> >
> > [Nameprep]Paul Hoffman, IMC & VPNC & Marc Blanchet, ViaGenie,
> > "Preparation of Internationalized Host Names", February
> > 24, 2001
> >
> >Appendix A. Acknowledgements
> >
> > The ACE37 draft is a combination of DUDE-02, the AMC series and
> > LACE, and takes into consideration the report of the ACE design
> > team. The authors would therefore like to thank the authors of
> > DUDE-02 - Mark Welter, Brian W. Spolarich & Adam M. Costello; the
> > authors of the AMC series - Adam M.Costello; the authors of LACE -
> > Mark Davis & Paul Hoffman; and, the ACE design team and its advisors
> > - Adam M. Costello, Paul Hoffman, Makoto Ishisone, David Laurence,
> > Brian Spolarich, Rick Wesson, Marc Blanchet, Patrik Faltstrom and
> > Erik Nordmark for their inspirations.
> >
> >Appendix B. Mixed-case annotation
> >
> > This section is taken from DUDE and modified for ACE37
> >
> > In order to use ACE37 to represent case-insensitive Unicode strings,
> > higher layers need to case-fold the Unicode strings prior to ACE37
> > encoding. The encoded string can, however, use mixed-case base-4
> > characters as an annotation telling how to convert the folded
> > Unicode string into a mixed-case Unicode string for display
> > purposes.
> >
> > Each Unicode code point (unless it is an LDH) is represented by a
> > sequence of base-4 and base-32 characters, the first of which is
> > mostly a base-4 character, which is always a letter {wxyz} (as
> > opposed to a digit). If that letter is uppercase, it is a
> > suggestion that the Unicode character be mapped to uppercase (if
> >
> >Chung & Leung [Page 16]
> >ACE37 ACE Utilizing All 37 Alphanumeric Characters July 2001
> >
> > possible); if the letter is lowercase, it is a suggestion that the
> > Unicode character be mapped to lowercase (if possible).
> >
> > If the code point is an LDH, for example "a", it will be represented
> > as "-a". To mark the case for an LDH, simply set the LDH to the
> > desired case following the "-". Fir example if an uppercase "A" is
> > desired, the encoded form SHOULD be "-A".
> >
> > Note that there is a possibility that no base-4 character is present
> > for a code point representation. That is the case for a 15-bit diff
> > form. In this case, the base-32 characters will be used for case
> > suggestion (if possible), similar to that discussed for using a
> > base-4 character. However, also note that there is a very remote
> > possibility that all 3 base-32 characters are digits. If this
> > happens, case unfolding will be aborted. Since case annotation is
> > an optional feature and used for display purposes only, this is not
> > considered to be a major concern. Moreover, the possibility of this
> > happening is truly remote at only (32639/27)/1114109 or just 0.1%
> > chance of happening.
> >
> > ACE37 encoders and decoders are not required to support these
> > annotations, and higher layers need not use them.
> >
> > For example: In order to suggest that example (H) in Section 8:
> > "Examples" be displayed as:
> > Czech: Pro<ccaron(uppercase)>prost<ecaron(uppercase)>
> > nemLUV<iacute(lowercase)><ccaron(lowercase)>esky
> >
> > one could capitalize the ACE37 encoding as:
> > ACE37: -P-r-o0BT-p-r-o-s-tWM-n-e-m-L-U-V0fm0f0-e-s-k-y (47 char)
> >
> >Authors:
> >
> >Edmon Chung
> >Neteka Inc.
> >2462 Yonge St. Toronto,
> >Ontario, Canada M4P 2H5
> >edmon@neteka.com
> >
> >David Leung
> >Neteka Inc.
> >2462 Yonge St. Toronto,
> >Ontario, Canada M4P 2H5
> >david@neteka.com
> >
> >
> >
> >
> >
> >
> >
> >
> >
> >Chung & Leung [Page 17]
>