[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [idn] summary of reordering discussion
The re-ordering discussion has been a little difficult to follow,
and I am sure that many wg members could use some help
sorting through some of the complex issues involved so they
can participate in the straw poll.
I have prepared a formal analysis of the discussion so far,
which will at least serve as a starting point for those wanting
to research it in more detail. It is attached.
I have tried to be as neutral as possible, but of course
both sides will likely dispute this. :)
Enjoy!
Bruce
Bruce Thomson
Future Media Network
bthomson@fm-net.ne.jp
Nov. 12, 2001
An Analysis of idn Re-ordering
Purpose
To condense the idn workgroup's lengthy discussion on
re-ordering down to readable length, clarify the arguments,
and focus on the most significant issues.
The draft itself is at:
http://www.ietf.org/internet-drafts/draft-ietf-idn-lsb-ace-02.txt
Method
I have started with the arguments presented in the
re-ordering draft and on the mailing list, and
presented them in a more ordered fashion. I found that
some of the arguments, while sound, were not explained
well, so I have in many cases re-formulated them while
trying to preserve the intent of the authors. I have
added my own conclusions to the arguments as to their
significance, while attempting to be relatively neutral.
My intent here was not to sway the reader to a different
point of view, but rather to focus his attention on the
points that are genuinely in doubt so that a consensus
can be reached on them.
The discussion and conclusions are based on the
published draft but also include the effects of
improvements suggested by the author in response to
comments by others. "Could" and "probably" are used in
the text where an unpublished improvement is being
discussed.
I have not mentioned the names of the various authors,
because I felt that a more impersonal approach was
preferable.
Advantages and Disadvantages of Re-ordering that were Argued
- Advantages
1) Re-ordering provides up to a 31% improvement in encoding
efficiency over the non-re-ordered case.
True.
Re-ordering significantly reduces the ACE-encoded length of
scripts with large numbers of code points. The main beneficiaries
are Hangeul and Han scripts, which see improvements of 31.07%
and 27.47% respectively. Other scripts see improvements of 5% to 14%.
Note: The above figures are from the I-D and are derived from
Verisign testbed registrations. Other measurements have given
results such as 14% and 20%, while certain names with frequently
used characters are improved by up to 40%. The testbed figures
are probably reasonable for our purposes.
2) Non-re-ordered ACE encodings are "unfair" to non-Latin script blocks,
particularly Asian ones. Re-ordering rectifies this.
Untrue. There is nothing particularly "unfair" about the ACE
encoding of Asian scripts.
These script blocks are longer and each character contains
more information, and so characters and will naturally
require longer encodings. Viewed quantitatively,
Han characters require 3.09 octets in AMC-ACE-Z compared
to only 1 octet for Latin-1, but remember that each Han also
requires 2 octets in JIS or in UCS-2. There are 20,902 Han in
Unicode, requiring about 1.8 octets/Han in theory for random
strings with optimal packing.
The Book of Genesis represented in various languages and
converted into AMC-ACE-Z without re-ordering would require
the following number of octets:
English 3,088 * 1 = 3,088
Han 778 * 3.09 = 2,402 (22% better than English)
Hangeul 1,201 * 3.04 = 3,651 (18% worse than English)
So the most we can say about un-re-ordered ACE encodings is
that we are not taking advantage of a possible optimization
that would make Asian languages more efficient than English.
3) Reduced encoding lengths will make it less likely that an
encoded name will be too long.
True, but perhaps not of practical significance.
The maximum length of a Hangeul domain name is limited by the
restriction that a DNS label name be 63 characters or less.
Therefore, without re-ordering and assuming a label of the
form:
www.bq--<ACE-encoded Hangeul>.com.
we can calculate the maximum number of Hangeul that can be
used in a domain name as:
maxHangeul = (maxLabel - prefLength - suffixLength) / encodeRatio
= (63 - 8 - 5) / 3.04
= 16
Note: The legal maximum length of domain names could end up
being defined a little differently. The above is a
practical limit.
16 Hangeul is a huge amount of information, however. Using the
language-efficiency figures from the Book of Genesis translation
above, this domain name is equivalent to
16 * (3,088 / 1,201) = 41 characters
in English. If this is an accurate representation of the
information-carrying power of a hostname, it means that we can
represent the equivalent of such names as:
www.all-the-spaghetti-you-could-ever-want-to-eat.com
in Hangeul even without re-ordering.
In a discussion of long names, the I-D author states that
9-letter Han/Hangeul domains are very common (based on
his research of testbed-registered domains), implying that
names much longer than 9 are not so common.
4) Shorter encoded names are better for users who actually type
in the ACE encoding.
True to some extent.
It is not known to what extent users will actually type
in ACE names; for example, reading them off a business
card or an e-mail. Domains that need to be accessed by
users unfamiliar with the script would be likely to have
an alternate ASCII name. There will be cases where an
e-mail return address will appear as ACE, but the user
could always press the "Reply" button. There could also
be cases where someone e-mails the ACE-encoded mail
address of a third party to a recipient who cannot
read the script.
Most users that type or view ACE names will be likely
be performing maintenance functions: renewing the
domain name, performing a WHOIS lookup, or entering
the name by hand into a zone file. So the question is
whether making the job of the people involved in
these tasks for Hangeul or Han names a little easier
by making the names 30% shorter is worth it.
Cut-and-paste is commonly used to ease the burden of
dealing with ACE-encoded names, and that method would
be preferred by the user where possible.
- Disadvantages
1) Might cause problems if new code points were added later.
Not a problem.
The algorithm only re-orders existing code points, and will not cause
conflicts with new definitions.
2) Would not be able to be improved if new scripts were added or sub-optimalities
discovered.
Not a problem.
The algorithm would probably be frozen for all time. It would significantly
improve Hangeul and Han names, and like all heuristics, would be good but
not perfect. Probably no new scripts would ever be added. The compression
achieved for scripts other than Han and Hangeul is marginal anyway, and
versioning of the algorithm cannot be justified in terms of cost/benefits.
No new scripts with 10,000+ code points are likely to spring into being.
3) Doesn't reference an outside authority.
Significant issue for many wg members, and perhaps the deciding one.
Originally, the idn concept was that details of code points, etc. were
best left to other standards organizations that specialize in such
things. Why, after all, re-invent the wheel unnecessarily, infringing
on other people's charters and making our own task more difficult?
An idn document that says "as defined in the standards organization XXX
document YYY" is so much more elegant than one that has pages of numbers.
The re-ordering algorithm could be made to incorporate character
usage statistics from reputable authorities, and testing has shown the
re-ordering to be effective. However, those numbers do not have
the status of a "standard"; they are just statistics, published like
census data. They would have to be elevated to the status of a standard
by being incorporated into an idn-related RFC, and for stability of the
encoding it would really be necessary to avoid a "pointer reference"
to the statistics and to physically insert all the desired codepoints
into the idn document. This would make idn much more complex than
it has to be. In the ideal world, another organization would publish
an improved re-ordered Unicode which could then be incorporated by
reference. However, this has not happened yet.
In order to implement re-ordering without the help of an outside
organization, it is apparently necessary to move all the code points
around by fiat, one at a time, to get them in the order needed.
A bit messy?
3-A) Could the tables could be handled by IANA?
Not so useful.
The principal benefit of having definitions handled by IANA
is that they have established procedures for updating them.
However, in this case it is likely that the tables would be
frozen anyway; there is not a significant benefit in updating
them in the future. And IANA has no intrinsic expertise in this
area, any more than the idn wg.
4) Isn't necessarily optimal for certain scripts; could actually be worse
in some cases.
Not an important problem.
As with any compression algorithm, there will always be some inputs that
actually get longer when the algorithm is applied. However, the only
case which is really important is where a name becomes too long to be
registered. A name would have to have the equivalent of almost 40 English
characters in information-carrying content, and have an unusual concentration
of rare characters to go over the limit. This is not a case that will cause
users to complain about the algorithm.
5) Too much complexity for not enough benefits.
Controversial.
We have to decide if the benefits are worth the effort.
5-A) Nameprep is complicated too.
Not relevant.
Some wg members feel that nameprep complexity is not
justified either; others feel that it has big benefits
that make it all worthwhile. In any case, complexity vs.
benefits is an engineering tradeoff that needs to be weighed
separately for each technology that is considered.