[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [idn] Re: Unicode is not usable in international context.
> People have posted cases where the number of TC characters that make
> up a word is different from the number of SC characters that make up
> the same word. People have also posted cases where the number of
> characters remains the same but the mapping depends on context.
You are talking about combinations of glyphs, we call them
words or Labels in IDN context. A label may contain only one
character sometimes, and it is easier for us to discuss without
have to show the glyphs on our screen here on this list. This may
be the reason for the above confusion.
Let me try to elaborate a little on Chinese character processing,
I mean one glyph at a time as in Unicode table.
There are 20,000+ often used symbols worldwide as they are
collected in Plane 0 of UCS. Mainland China use about 7,000
reguarly, Taiwan uses 13,000 reguarly. We call these are
frequently used characters.
Among these frequently used characters, there are always
semantic differences among any two given characters, due to
history and locality of the usage, as you can image.
This posts the necessity of organization works in characters
and thus dictionary editing, or standard works through out
Chinese writen language histroy. Especially when computer
comes along.
The first classification is certainly those semanticly
distinct and writen form distinct characters. Many semanticly
distinct but form similar characters are in the carefully explained
category in education sector as well as in written language critics,
which can be analogies to Latin spelling checking as an
educational activity, but not as an equivalent symbol set concern
which is discussed here on this list. This is certainly a spelling
checking feature as input concerns.
The second classification are those semanticly overlap but not
the same. We translated the category for these character as
synonymies or "same meaning characters". But they are
not thesaurus, thesaurus are introduced into China in recent
years only. These are not the subject for any unification or mix.
The correct usage in a text has to be differenciated within context,
thus word dictionaries are used to help, this is an AI feature in an
Editor software.
The third classificantion of characters are semanticly "identical"
but has many different forms, especially through the long history
of keeping these characters. The majority of characters in UCS
beyong the above 20,000 frequently used characters are belong to
this category. And the majority of TC/SC also belong to this category.
As I have said, if you want to find the details of something different
you will always be successful among these characters.
This third classification is what we are concerning in preference
of display and possible inclusion of more character forms beyong
the frenquently used character set or exclusion of them from such
an equivalent set, which has addressed by Japanese users.
Notice, that I said the majority of TC/SC belongs to this category.
This is what you have heard some user do not agree with this
"identical" classification. This is the matter of life that the Han
user community has to be precise about which symbol
in which set. So that they can have a standard to work from.
Excluding this equivalent set from the basic [nameprep] profile
is definitly marking the failure of IDN and causes more "trademark"
conflicts on the way.
> People have stated that conversion between TC and SC requires a
> dictionary of words, rather than a table of characters. All these
> show that TC/SC is analogous to a spelling difference.
Correct. This is to deal with the small number (on the scale of
10 vs. 2000) of TC/SC in input and display level, which should not
over take the face that TC/SC has to be equivalent identifiers in
[nameprep].
As the matter of using characters as identifiers in IDN, the job we
have to be concerned is to reduce these semantic "identical"
characters from whatever number down to a "no trademark confilct"
level of clearance to be viable symbol set which we can permit
in IDN for identifier matching. In this sense, it is like
case-insensitive
treatment of Latin symbols. Yes, we do want uppercase too, but
on identifier level, they are the same!
> I don't clain that TC/SC conversion or equivalence is not a problem.
I hope that the above explaination has shown a feasible solution to this
problem to your satisfaction.
> Neither do I claim that the potential confusion between <GREEK
> CAPITAL LETTER ALPHA> and <LATIN CAPITAL LETTER A> is
> not a problem.
This is a problem of IDN. This problem is opposite with TC/SC
equivalence. Because of these symbols are picked up/ pasted/
typed from a mixture of applications and user interfaces, anyone of
them can be the bad guy hidden from someone's eye, and the
machines only know about bits. If you add more forms of encoding,
such as UTF8/16 or input keystroke sequences, then the problem
can escalate quickly.
The solutions that I can think of at this moment would be two:
1. Unification of symbols like CJK unification with equivalent symbol
set defined;
2. Transparent language tag to enforce each label to be consistent
with its tag through out the system include DNS.
If we work out CJK in IDN problem, then this will be a piece
of cake at the end of our IDN banquet :-)
> Neither
> do I claim that the potential confusion between English "theatre"
> and American "theater" or English "lift" and American "elevator"
> are not problems. But I believe that all these problems are
> outside the scope of IDN.
Correct, this problem is outside the scope of IDN.
Regards,
Liana Ye