[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [idn] Zone rules (was: wg milestones update)
Hi Guonian,
> for chinese user, the TC-SC equivalence rule is just alike the
case-folding rule.
> I'd like to show an example followed using the case-folding rule.
>
> under zone .COM, if manager defines upper case characters equal to lower
case
> characters. users could access IBM' domain with ibm.com, IBm.com, ... in
any
> case.
...
> So I think the TC-SC equivalence rule SHOULD be consistent anywhere .
Since we're on this topic of TC-SC equivalence yet again, I'd like to point
out two issues that I think we should all consider before we go further, if
these have not been reiterated enough in previous postings made by more
qualified individuals than myself in the past:
1) The ruleset for TC-SC equivalence is not a 1-n or n-1 mapping of
abstract characters, as pointed out by Harald Alvestrand. It can
be hideously complex (see draft-ietf-idn-cjk-01.txt for details), with
numerous lexical and contextual considerations. I am sure that you would
know about the "头发" = "頭髮" but "发财" != "髮財" problem.
2) Case-folding is a simple canonical process, and the folding rules are
the same, I believe (someone please correct me if I'm wrong), for most
scripts which are able to be represented in ASCII (i.e. English, Swahili,
Hawaiian,
Malay, etc).
For example, there is no debate as to whether "CAT" or "cat" or "cAt"
refers to the same thing (in English), or whether "KUCHING" or "kuching" or
"kUchiNG" refers to the same thing ("cat" in Malay).
However, Han character canonicalization does not follow the same rules. Han
characters used in Chinese, Japanese and Korean have different equivalence
rules. This is also pointed out in draft-ietf-idn-cjk-01.txt.
For example, "日產" = "日产" in Chinese, but "日產" != "日产" in Japanese.
In both cases, the exact same code points in Unicode are used. This has
been pointed out too, that since Unicode does not recognize the difference
between languages, but only the difference between scripts. Hence, it will
be difficult for any Unicode-based system to function on language-based
equivalence rules.
I do agree with you that the problem that you have pointed out is indeed a
real problem and there is a need for TC-SC equivalence for locating
resources on the Internet.
However, it is a problem that cannot be solved by a Unicode-based DNS
solution within the parameters of the IDN WG and hence this may not be the
right place to address these issues.
maynard