[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [idn] Chinese Domain Name Consortium (CDNC) Declaration
Dear Kenneth Whistler:
This is a typical example, everybody is right and
individual event is no problems, but when these are combined and mixed ,
something is unpredictable .
>
> > My friend give me an example about CJK UNICODE , It
is
> > so ambiguous to me to deifferentiate which one is a correct Chinese
> > characters or not ? In our hand writting , each pair are used and
mixed.
> >
> > æ·¸ç???U+6DF8 U+771E U+654E
> > æ·¸ç???U+6DF8 U+771E U+6559
> > æ·¸ç???U+6DF8 U+771F U+654E
> > æ·¸ç???U+6DF8 U+771F U+6559
> > 清ç???U+6E05 U+771E U+654E
> > 清ç???U+6E05 U+771E U+6559
> > 清ç???U+6E05 U+771F U+654E
> > 清ç???U+6E05 U+771F U+6559
>
> Huh? How is this contributing to closure on Last Call on
> the IDNA documents? And why is it cc'd to IESG and IAB?
>
> For those who may be mystified, this is the Chinese word for
> "Islam", qing1zhen1jiao4.
>
What is the worst ? when I view these eight records by IE ,
notepad, word and telnet terminal, all these are displayed with diffenent
scripts font. Someone may be said: that is the bug of MicroSoft . But
unfortunately, the Win2K just use UNICODE internally, the local phonic IME
now can input and selecting partial scripts of Japan, China and Taiwan
altogether.
The basic assumtion: UNICODE character is unified and othogonal is
not totally correct.
> The ordinary way this would appear in a PRC dictionary is:
>
> U+6E05 U+771F U+6559
>
> and not any of the other 7 permutations.
>
> In a more traditional dictionary as might be seen in Taiwan
> or Hong Kong, it might be printed:
>
> U+6DF8 U+771E U+6559
>
> and not any of the other 7 permutations.
>
> However, if you were using a Big-5 computer in Taiwan,
> you would use the same characters as for the PRC for
> this:
>
> U+6E05 U+771F U+6559
>
> and not any of the other 7 permutations. (though the
> fonts might vary in which glyph they show, in any case)
>
The restriction come from the native-code input method mask out
these characters, even your example of GBK and SJIS have both characters,
but one of them are forbidden in newspaper and popular input method. To
push more characters to Domain Name system to support registration is not a
correct approach.
> U+6E05 and U+771F, by the way, are examples of "traditional
> simplifications" reflecting handwritten forms, that
> predate the PRC systematic simplifications. The same two
> forms are also used in Japan.
>
> U+654E is another handwriting alternative for U+6559, but
> it is seldom seen in printed material. U+654E is used in
> the PRC, Taiwan, and in Japan alike.
>
> All 6 characters have G, T, and K sources in 10646, and
> 4 of them have J sources as well. So for this kind of
> overlap of forms, any suggestion to delete G-source-only
> characters from the allowed set does nothing at all.
>
> And lest this example be taken on its face value
> as indicating a problem in "CJK UNICODE", it should be noted
> that the presence of these alternate forms of the "same character"
> in Unicode is due to the same distinctions being made in
> legacy CJK character encodings in Asia. In particular,
> note the following mappings:
>
> For "GBK", Code Page 936 Simplified Chinese:
>
> 0x9C5B 0x6DF8 #CJK UNIFIED IDEOGRAPH
> 0xC7E5 0x6E05 #CJK UNIFIED IDEOGRAPH
> 0xB177 0x771E #CJK UNIFIED IDEOGRAPH
> 0xD5E6 0x771F #CJK UNIFIED IDEOGRAPH
> 0x949C 0x654E #CJK UNIFIED IDEOGRAPH
> 0xBDCC 0x6559 #CJK UNIFIED IDEOGRAPH
>
> And for "Shift-JIS", Code Page 932 Japanese:
>
> 0xEDE4 0x6DF8 #CJK UNIFIED IDEOGRAPH
> 0xFB43 0x6DF8 #CJK UNIFIED IDEOGRAPH
> 0x90B4 0x6E05 #CJK UNIFIED IDEOGRAPH
> 0xE1C1 0x771E #CJK UNIFIED IDEOGRAPH
> 0x905E 0x771F #CJK UNIFIED IDEOGRAPH
> 0xEDB1 0x654E #CJK UNIFIED IDEOGRAPH
> 0xFACD 0x654E #CJK UNIFIED IDEOGRAPH
> 0x8BB3 0x6559 #CJK UNIFIED IDEOGRAPH
>
> So if you are working on a Windows system in either of
> these legacy code pages, in China or Japan, you
> already have the same options for representational
> ambiguity, without invoking Unicode at all.
Right, that is the problem of these legacy code pages, but these
sripts are used in printing of publication or electronic books which do not
need to differentiate them precisiely. That is not working as a name
identifier.
Even now, CJK area have not enough experience to use their
characters as the ML-domain name under ccTLD, using mixed characters in
gTLD will be a big march in DNS history.
L.M.Tseng