[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [idn] opting out of SC/TC equivalence
Hi, James,
Thank you for the implementation analysis. It is very
helpful for bringing this discussion down to another
level. I have serveral thoughts on this.
The SC character set has been used for decades and has went
through extensive nationwide testing in China. SC is stable and they are
properly reflected in Unicode standard. The question is a definition:
is TC/SC a case folding? It seems that in this WG, there has no
consensus on this definition yet.
The primary reason, I guess, is everyone knows that CJK is a piece of
very hot sweet potato, and quite easy to get burned from it, but it
smells
so good, that none of us wants to give it up at this momment. :-))
One solution is to push the hot sweet potato to zonefile, but as you have
pointed out, only Chinese zone will do this. If I am in US, it may be
hopeless for me to see Chinese characters on my domain name
forever.
However, the implementation as you have described can be
implemented in [nameprep] with Unicode as the primary reference
code, provided the character mapping issue has been settled with
your option 1. And we are back to the case folding definition.
I offer my definition: case folding is from a key with one look up
into a data table, and you can obtain another key from the search.
Example:
Unicode folds to
points
col-1 col-2
A a (case 1)
TC-1 SC-1 (case 2)
IPA letter ts t s (case 3)
Kanji-1 Kana-1 (case 4)
Kanji-1 Kana-2
Kanji-1 Kana-3
Hanja-1 Hangul-1 (case 5)
Hanja-2 Hangul-1
Hanja-3 Hangul-1
Case 1 is the current [nameprep],
Case 2 is the long lasting discussion of TC/SC,
Case 3 is possible mnemonic ACE assignment,
Case 4 and 5 is the extent of allowing TC/SC as
case folding into [nameprep].
With the above list, it is questionable about the above are
case folding. Because, from col-1 key Kanji-1, there are
three lookups to get Kana-3 back in Case 4 and three
lookups from Hangul-1 to Hanja-3 in Case 5.
The problem is not so much with TC/SC even they are
one-one mapping, it is with Kanji and Hanja. Using
transliteration only, as in Case 3, is obvious not good enough,
as you have pointed out in your option 3. Since what IDN
is realy wanted is an ACE to go into DNS, we can obtain
such an mnemonic ACE by assign another column to
this table, thus:
col-1 col-2 col-3 ACE
A a a (case 1)
White Space nil nil
SC-1 Pinyin-1+part1+part2+part3 (case 2)
TC-1 SC-1 Pinyin-1+part1+part2+part3
IPA letter ts t s ts (case 3)
Kana-1 Romaji-1
Kanji-1 Kana-1 Romaji-1+part1+part2 (case 4)
Kanji-1 Kana-2 Romaji-2+part1+part2 (ruled out)
Kanji-1 Kana-3 Romaji-3+part1+part2 (ruled out)
Hangul-1 Hangul-1
Hanja-1 Hangul-1 Hangul-1+part-1.1 (case 5)
Hanja-2 Hangul-1 Hangul-1+part-2.1
Hanja-3 Hangul-1 Hangul-1+part-3.1
Now any key search from col-1 will get a unique ACE from col-3,
except Kanji-1 to Kana-3 has to be ruled out by one search only
case folding. If we can assign any identifier to it, the user has
to take a one sound per Kanji within its Romaji system. This
is the "language" aspect that case folding can not solve. Well, the
alternative is a compressed ACE. But from col-3 to col-1, there is
always one success search.
While any key from col-3 will get a display code from col-1. This
table and search is within the pure technical "case folding"
definition given above.
What if Hanja-1 = Kanji-1? This can be checked at registration
time in addition to its ACE, which is checked with DNS. I am
not certain, if IDN registration is part of this WG charter goal or
not, and so I shall stop any discussing on its relation with
[nameprep] or any related issues and hope that no one getting a
hot sweet potato burn.
One issue on mnemonic ACE is its value assignment, since
Unicode has given each symbol a Latin name, but the name
is not intented to be used in case folding, and as a pure
technical WG, we can assign any value to an identifier. How
much freedom do we have in such an assignment?
In any case, I agree with you that this WG needs a list to collect
each SCRIPT user/engineers' input regarding user's wish list.
I've come up with the following questionary for comments.
1. Your script name (refer to a name defined in Unicode):
2. How familar are you with the script? How often do you
use the script?
3. What do you expecting your IDN hostname look like?
4. Does your script is used interchagebly with another script?
If yes, then which ones? Are they used as a mixed string?
5. How does your script dealing with foreign concept?
6. How does your script dealing with foreign sounds?
7. Do you wish your script could mix with other scripts?
If you do, then which script do you think is most useful
for your script to accommodating?
8. As an engineer, what part of the use of your script is
relatively universal such that there will be no serious
controversal against it to be fixed by IDN implementation?
9. As an engineer, what part of the use of your script is
often controversal, such that IDN shall avoid it in
IDN implementation?
Liana Ye