[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [idn] Summary of TS-SC discussion

To: Paul Hoffman / IMC <phoffman@imc.org>, idn@ops.ietf.org
Subject: Re: [idn] Summary of TS-SC discussion
From: Martin Duerst <duerst@w3.org>
Date: Tue, 04 Sep 2001 22:35:55 +0900

Hello Paul,

I mostly agree with your summary. However, there is one important
point that is not explicit enough.

At 21:08 01/09/02 -0700, Paul Hoffman / IMC wrote:
>We have now rehashed every part of the question many times; maybe it is a 
>good time to summarize.
>
>1) Doing traditional-to-simplifed conversion must be done from a table; it 
>cannot be done by algorithm.
>
>2) No widely-accepted official standard table yet exists. Multiple 
>different tables exist.

This can easily be misunderstood. The issue is not that different
(standard) bodies or different individuals disagree about the mappings.

The problem is that the structure of the mappings is complex.
Even if Korean and Japanese are completely ignored, there are
some traditional characters that can be mapped to more than one
simplified character, and some simplified characters that can
be mapped to more than one traditional character.
(one-to-many mappings, for examples see table 1 in
Jack Halpern's paper at http://cjk.org/cjk/c2c/c2cbasis.htm).

To repeat, these one-to-many mappings are not the result of disagreements
between organizations or individuals, but they are agreed by everybody.
Experts both from China (including of course CNNIC), the UTC, as well
as many others can easily agree on which characters can be mapped
one-to-one, which characters don't need mappings (because they are
the same in both traditional and simplified Chinese), and for which
characters there are one-to-many mappings, and where these mappings
go to.

In order to do the conversion correctly for all cases, you need
some*body* who knows Chinese in both writing variants. There will
be few if any disagreement among such people. Trying to
have a machine do the conversion will in many cases be incomplete.

Both the Unicode Technical Committee and draft-ietf-idn-tsconv-00.txt
clearly agree on this point (see in particular chapter 4 of tsconv).
I haven't looked in detail at the data in tsconv, and I don't think
the UTC has looked at it in detail. And if we decide to actually
use such a list, we definitely have to make quite a few more checks
to make sure there are no accidental errors. But as far as the job
is 'list all the one-to-one equivalences between the simplified
Chinese characters (as covered by GB2312, I assume) and the traditional
Chinese characters (as covered by Big5, I assume)', the variation
between any two experts, or bodies of experts, will be negligibly
small, and solvable.

The main point of disagreement (as long as we stay with Chinese only)
is what to do to address the one-to-many problem. Is it better to
address the one-to-one problem in nameprep, or in the DNS itself,
and only leave the one-to-many problem to registration policies and
actual registrations (as tsconv proposes, as far as I understand),
or is it better to handle everything on the registration side
(as, for example, the UTC proposes).

Regards,   Martin.

>3) The same code points that would be converted in some domain name parts 
>would not be converted in other domain name parts because the conversion 
>is only appropriate for Chinese, not Korean or Japanese, which use the 
>same code points. Thus, the language (not the script) must be known in 
>order to do the conversion correctly.
>
>4) There is no way to flag the language in name parts in a consistent way 
>that end users would understand. Heuristics such as the TLD could be used, 
>but doing so would cause some names to not be converted when they should 
>be (such as a Chinese name part under .com) and would cause other names to 
>be converted when they should not be (such as a Korean name part under a 
>Chinese SLD). Hueristics such as language tagging would require that the 
>end user tag each Chinese name part, and would not work at all with names 
>that were cut and paste.
>
>5) The problem can be definitively solved in zone files with multiple 
>records. Some claim that this takes 2^length records, while others claim 
>this takes 2 records.
>
>6) Until the system is deployed, we cannot tell how well the users will 
>adapt. We (the "experts" on domain names) can make predictions about what 
>typical users will and won't be able to do, but we simply don't know, and 
>our past track record at predictions is not all that good.
>
>My conclusion from this is that we cannot standardize on 
>traditional-to-simplified at this time with the protocols that are under 
>discussion in the WG. We might be able to add this later after both a 
>widely-accepted official standard table exists and a language tagging 
>mechanism that makes sense to users is implemented.
>
>Neither IDNA nor nameprep preclude this potential future change, so we can 
>move forward with them now, and leave the door open to this change, and 
>other similar changes, in the future.
>
>--Paul Hoffman, Director
>--Internet Mail Consortium

Prev by Date: Re: [idn] [nameprep] architecture
Next by Date: Re: [idn] Summary of TS-SC discussion
Prev by thread: Re: [idn] Summary of TS-SC discussion
Next by thread: Re: [idn] Summary of TS-SC discussion
Index(es):
- Date
- Thread