[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [idn] summary of reordering discussion



The re-ordering discussion has been a little difficult to follow,
and I am sure that many wg members could use some help
sorting through some of the complex issues involved so they
can participate in the straw poll.

I have prepared a formal analysis of the discussion so far,
which will at least serve as a starting point for those wanting
to research it in more detail. It is attached.

I have tried to be as neutral as possible, but of course
both sides will likely dispute this. :)

Enjoy!

Bruce

                                                            Bruce Thomson
                                                            Future Media Network
                                                            bthomson@fm-net.ne.jp
                                                            Nov. 12, 2001

                An Analysis of idn Re-ordering

Purpose

  To condense the idn workgroup's lengthy discussion on
  re-ordering down to readable length, clarify the arguments,
  and focus on the most significant issues.

  The draft itself is at:

  http://www.ietf.org/internet-drafts/draft-ietf-idn-lsb-ace-02.txt

Method

  I have started with the arguments presented in the
  re-ordering draft and on the mailing list, and
  presented them in a more ordered fashion. I found that
  some of the arguments, while sound, were not explained
  well, so I have in many cases re-formulated them while
  trying to preserve the intent of the authors. I have
  added my own conclusions to the arguments as to their
  significance, while attempting to be relatively neutral.
  My intent here was not to sway the reader to a different
  point of view, but rather to focus his attention on the
  points that are genuinely in doubt so that a consensus
  can be reached on them.

  The discussion and conclusions are based on the
  published draft but also include the effects of
  improvements suggested by the author in response to
  comments by others. "Could" and "probably" are used in
  the text where an unpublished improvement is being
  discussed.

  I have not mentioned the names of the various authors,
  because I felt that a more impersonal approach was
  preferable.

Advantages and Disadvantages of Re-ordering that were Argued

- Advantages

1) Re-ordering provides up to a 31% improvement in encoding
efficiency over the non-re-ordered case.

  True.

  Re-ordering significantly reduces the ACE-encoded length of
  scripts with large numbers of code points. The main beneficiaries
  are Hangeul and Han scripts, which see improvements of 31.07%
  and 27.47% respectively. Other scripts see improvements of 5% to 14%.

  Note: The above figures are from the I-D and are derived from
  Verisign testbed registrations. Other measurements have given
  results such as 14% and 20%, while certain names with frequently
  used characters are improved by up to 40%. The testbed figures
  are probably reasonable for our purposes.

2) Non-re-ordered ACE encodings are "unfair" to non-Latin script blocks,
particularly Asian ones. Re-ordering rectifies this.

  Untrue.  There is nothing particularly "unfair" about the ACE
  encoding of Asian scripts.

  These script blocks are longer and each character contains
  more information, and so characters and will naturally
  require longer encodings. Viewed quantitatively,
  Han characters require 3.09 octets in AMC-ACE-Z compared
  to only 1 octet for Latin-1, but remember that each Han also
  requires 2 octets in JIS or in UCS-2. There are 20,902 Han in
  Unicode, requiring about 1.8 octets/Han in theory for random
  strings with optimal packing.

  The Book of Genesis represented in various languages and
  converted into AMC-ACE-Z without re-ordering would require
  the following number of octets:

     English  3,088 * 1    = 3,088
     Han        778 * 3.09 = 2,402 (22% better than English)
     Hangeul  1,201 * 3.04 = 3,651 (18% worse than English)

  So the most we can say about un-re-ordered ACE encodings is
  that we are not taking advantage of a possible optimization
  that would make Asian languages more efficient than English.

3) Reduced encoding lengths will make it less likely that an
encoded name will be too long.

  True, but perhaps not of practical significance.

  The maximum length of a Hangeul domain name is limited by the
  restriction that a DNS label name be 63 characters or less.
  Therefore, without re-ordering and assuming a label of the
  form:

  www.bq--<ACE-encoded Hangeul>.com.

  we can calculate the maximum number of Hangeul that can be
  used in a domain name as:

  maxHangeul = (maxLabel - prefLength - suffixLength) / encodeRatio
             = (63 - 8 - 5) / 3.04
             = 16

      Note: The legal maximum length of domain names could end up
            being defined a little differently. The above is a
            practical limit.

  16 Hangeul is a huge amount of information, however. Using the
  language-efficiency figures from the Book of Genesis translation
  above, this domain name is equivalent to

  16 * (3,088 / 1,201) = 41 characters

  in English. If this is an accurate representation of the
  information-carrying power of a hostname, it means that we can
  represent the equivalent of such names as:

  www.all-the-spaghetti-you-could-ever-want-to-eat.com

  in Hangeul even without re-ordering.

  In a discussion of long names, the I-D author states that
  9-letter Han/Hangeul domains are very common (based on
  his research of testbed-registered domains), implying that
  names much longer than 9 are not so common.

4) Shorter encoded names are better for users who actually type
in the ACE encoding.

  True to some extent.

  It is not known to what extent users will actually type
  in ACE names; for example, reading them off a business
  card or an e-mail. Domains that need to be accessed by
  users unfamiliar with the script would be likely to have
  an alternate ASCII name. There will be cases where an
  e-mail return address will appear as ACE, but the user
  could always press the "Reply" button. There could also
  be cases where someone e-mails the ACE-encoded mail
  address of a third party to a recipient who cannot
  read the script.

  Most users that type or view ACE names will be likely
  be performing maintenance functions: renewing the
  domain name, performing a WHOIS lookup, or entering
  the name by hand into a zone file. So the question is
  whether making the job of the people involved in
  these tasks for Hangeul or Han names a little easier
  by making the names 30% shorter is worth it.

  Cut-and-paste is commonly used to ease the burden of
  dealing with ACE-encoded names, and that method would
  be preferred by the user where possible.

- Disadvantages

1) Might cause problems if new code points were added later.

  Not a problem.

  The algorithm only re-orders existing code points, and will not cause
  conflicts with new definitions.

2) Would not be able to be improved if new scripts were added or sub-optimalities
discovered.

  Not a problem.

  The algorithm would probably be frozen for all time. It would significantly
  improve Hangeul and Han names, and like all heuristics, would be good but
  not perfect. Probably no new scripts would ever be added. The compression
  achieved for scripts other than Han and Hangeul is marginal anyway, and
  versioning of the algorithm cannot be justified in terms of cost/benefits.
  No new scripts with 10,000+ code points are likely to spring into being.

3) Doesn't reference an outside authority.

  Significant issue for many wg members, and perhaps the deciding one.

  Originally, the idn concept was that details of code points, etc. were
  best left to other standards organizations that specialize in such
  things. Why, after all, re-invent the wheel unnecessarily, infringing
  on other people's charters and making our own task more difficult?
  An idn document that says "as defined in the standards organization XXX
  document YYY" is so much more elegant than one that has pages of numbers.

  The re-ordering algorithm could be made to incorporate character
  usage statistics from reputable authorities, and testing has shown the
  re-ordering to be effective. However, those numbers do not have
  the status of a "standard"; they are just statistics, published like
  census data. They would have to be elevated to the status of a standard
  by being incorporated into an idn-related RFC, and for stability of the
  encoding it would really be necessary to avoid a "pointer reference"
  to the statistics and to physically insert all the desired codepoints
  into the idn document. This would make idn much more complex than
  it has to be. In the ideal world, another organization would publish
  an improved re-ordered Unicode which could then be incorporated by
  reference. However, this has not happened yet.

  In order to implement re-ordering without the help of an outside
  organization, it is apparently necessary to move all the code points
  around by fiat, one at a time, to get them in the order needed.
  A bit messy?

3-A) Could the tables could be handled by IANA?

  Not so useful.

  The principal benefit of having definitions handled by IANA
  is that they have established procedures for updating them.
  However, in this case it is likely that the tables would be
  frozen anyway; there is not a significant benefit in updating
  them in the future. And IANA has no intrinsic expertise in this
  area, any more than the idn wg.

4) Isn't necessarily optimal for certain scripts; could actually be worse
in some cases.

  Not an important problem.

  As with any compression algorithm, there will always be some inputs that
  actually get longer when the algorithm is applied. However, the only
  case which is really important is where a name becomes too long to be
  registered. A name would have to have the equivalent of almost 40 English
  characters in information-carrying content, and have an unusual concentration
  of rare characters to go over the limit. This is not a case that will cause
  users to complain about the algorithm.

5) Too much complexity for not enough benefits.

  Controversial.

  We have to decide if the benefits are worth the effort.

5-A) Nameprep is complicated too.

  Not relevant.

  Some wg members feel that nameprep complexity is not
  justified either; others feel that it has big benefits
  that make it all worthwhile. In any case, complexity vs.
  benefits is an engineering tradeoff that needs to be weighed
  separately for each technology that is considered.