[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [idn] opting out of SC/TC equivalence
Hi, James and all:
After some sleep, I am ready to type more.
I am proposing a [nameprep] case folding table for Chinese
where
+ means concatenation,
Px means xth radical in Latin letters.
zho-
Unicode Unicode Big5 GB StepCode
TC SC TC SC Pinyin+tone+P1+P2
So the TC to SC mapping is embeded in, and IDNA can
decide on how long the StepCode is needed to register
a unique name and the pre-registered form is kept at
zonefile level. So the name can be over 63 octect limit without
burden other protocols.
For Japanese, replace Pinyin with Romaji. So there are
cases that one Unicode Kanji will have 3-4 Romaji entries.
(This is allowed for zho- table too.) Thus:
jap-
Unicode JIS StepCode
Kana Kana Kunrei
Kanji Kanji Kunrei +P1 Kunrei +P2 Kunrei
Where Romaji shall use Kunrei system since it is the adopted
system by Unicode naming.
For IDNA processing: using radicals code in pre-registered
form, and register the short form (by truncate radical code) to keep
the name under 63-octect.
I do not suggest Han characters for Vietnanese, since they seem
happy with the script using many tones. While tone mapping can
be done in a similar way with Arabic. But the option is open
for Vietnanese, since they can create a similar mapping table if
they can show an official adoption for a Romanization mapping.
Liana Ye
On Fri, 17 Aug 2001 17:05:34 +0800 "James Seng/Personal" <James@Seng.cc>
writes:
> Dear Prof Tseng,
>
> > Hi James,
> > I think you can display these chinese characters in
> your
> > system, so you can make
> > the explaination , and tell me the answer how to treat them ?
> > TC(? , SC(?
> > TC(?, SC(?
>
> These example stated at first level simplication of chinese by
> radical.
> They are equivalent in most context of Chinese language so I think we
> can both agree on this. And yes, it is not handled in current
> normalization or nameprep.
>
> How can we solve this? Many ways, each one with its pros and cons. I
> will provide some suggestions but I am sure there are other ways:
>
> 1. Do it inside Normalization Form KC (Standard Track)
>
> Speak to the Unicode Consortium, convience them that these two
> ideograph
> are equivalent and put into NFKC. This will go directly into
> Nameprep so
> long Unicode Consortium agree with it since Nameprep just uses the
> code
> points from Unicode Consortium.
>
> The people at Unicode Consortium would probably question if these
> ideograph are equivalent in Kanji and Hanja or olden Vietnamese so we
> need to prepare for that.
>
> Pros: it would be part of Nameprep standard. And if NFKC accept
> this, it
> would also solve in other I18N efforts in future, and not just IDN.
>
> Cons: we need to go thru the review process in Unicode Consortium.
>
> 2. Do this "optional folding" pre-Nameprep (Informational based)
>
> We would define these mappings within IETF, but published it as
> Informational based as an optional folding for Chinese system only.
>
> Pros: We do this within IETF with probably assistant from other group
> for review. It also open Nameprep for localized foldings depending on
> other set.
>
> Cons: It may be difficult to determine what optional folding rules
> should apply for a name. A Japanese (or Cyrillic) names could be
> entered
> using GBK for example and which rules do we apply? And who has
> priority
> to decide what folding mechanism? The registrant of the name or the
> user
> of the name? Is 中国.com a Chinese domain name "zhongguo.com"
> referring to
> China.com or is it a Japanese domain name "chugoku.com" referring to
> another place in Japan?
>
> 3. Do this in the zonefile (Best Current Practice?)
>
> We would define these mappings in the zonefile for DNS and hence
> irregardless how the user type it in, they will end up with the same
> resource records.
>
> Pros: It is an opertional issues for Chinese domain names.
> Registrant of
> names would controll what is equivalent and what is not and that may
> be
> defined as a policy on a per-zone basis.
>
> Cons: There would be multiples entries in the zonefiles but they can
> be
> solve by software implementation to generate these entries on
> loading.
>
> Therefore, there are many solutions to the TC/SC problem. Which path
> to
> take would depend on the tsconv author decision and the wg
> consensus. No
> solution is perfect and it is all engineering trade-off.
>
> Speaking for myself, I would love to see this get done in (1)
> because it
> means it will solve it for other protocol, not just domain names in
> future. But I am not sure how to address Unicode Consortium concern.
> I
> am strongly against (2) approach because it will solve the problem by
> creating other problems. Implementation experience have been proven
> to
> be very headache to maintain and 'guess' optional foldings to be
> applied. I believe (3) is a reasonable approach altho not a perfect
> solution either.
>
> > They are the same chinese characters in pairs but they are coded
> with
> > different UNICODE .
> > Does they are like the problems of " fi " ?
> > And tell me why a A should be mapped to ASCII "a" or
> "A" ?
>
> Problems like "fi" and "a" vs "A" are handled in Nameprep not because
> IETF decided so, but rather the code points from Unicode Consortium
> have
> these mappings/normalization.
>
> IETF is not in the business to define codepoint because we are not
> script or language expert. We leave it to other groups who have more
> expertise and we reference their work. Thus, this question is most
> appropriate ask to the Unicode Consortium, and not in this WG.
>
> > I don't expect this WG to solve all the equivalence of TC/SC. I
> just
> want
> > to know what is the guideline to reduce the confusing troubles in
> nameprep ?
> > Why so amall set of PRC simplified quick-written scripts are not
> case
> > folding problem ?
>
> God knows I agree with you. :-)
>
> But this is a question which this WG have no answer for since it
> references it code points from other place.
>
> -James Seng
>