[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
[idn] Re: Characters, scripts and words (was: Re: nameprep andothers: hangeulchar)
--On Tuesday, August 28, 2001 12:48 PM -0700
liana.ydisg@juno.com wrote:
>> > nothing is going to prevent labels of ECHAK, or any of its
>> > permutations.
>> > ...
>> Why is it the job of an iDNS standard to prevent users from
>> defining silly
>> strings, as long as the strings do not break the DNS?
>>...
>> At 09:35 AM 8/28/2001, Mark Davis wrote:
>> > I believe, as I have said before, that it is too difficult
>> > to
>> separate out
>> > the legitimate mixes of scripts and symbols from the
>> > 'questionable'
>> ones for
>> > us to do anything. And adding language tags would simply
>> > make it
>> more
>> > complicated, not less.
>>
>> exactly correct.
>>
>> Too difficult and, to paraphrase Jeff Case, simply not our
>> problem.
>...
> Do you mean nobody else can come up with a feasible
> solution within the list?
Liana,
Let me risk answering for Dave as well as for myself.
The working group (and sometimes its various design teams) keeps
falling into the trap of believing that one can make
language-specific rules for what characters can be handled in
the DNS and how they can (or should) be handled. By and large,
the DNS will work well as long as we can examine, and match, one
character at a time, rather than trying to do things by words.
One of the advantages of Unicode is that it is one character
set, and not a collection of independent different ones with
per-script or per-language tagging (identification) information.
Despite the considerable deployment of character encoding
systems that use coding system identification methods (several
of the ISO 2022-based techniques for representing Japanese come
to mind), we have found that such systems have very poor
interchange properties among disparate systems. But Unicode's
weakness is also that it is one character set: the designers
have made many decisions to not include some characters in areas
belonging to a particular script because those characters appear
elsewhere. Conversely, they have permitted duplication in other
areas. As far as I can tell, all of those decisions were
rational, in the sense that there were good reasons for them.
Whether they were right or wrong at the time is a separate
question, but one that is largely irrelevant to our needs: there
is simply no alternative to Unicode on the table that could
serve as the basis for a multilingual (actually multi-script)
interchange format.
So we need to keep reminding ourselves that we are dealing with
character sequences, with the characters taken one at a time.
Whether the results have any linguistic meaning, whether a label
can be pronounced, whether the string makes any cognitive sense
at all, are all irrelevant. And, if we try to ignore that
irrelevancy, we get ourselves, not into "silly strings" but to
extremely silly states in which we would need to worry about,
e.g., the "meaning" of an Arabic non-joining zero-width
separator in the middle of a string of Chinese characters or a
Korean filler in the middle of a strong of Roman-based ones. It
also implies that a single label might contain a mix of Chinese,
Korean, and Japanese "words", using characters (code points)
that are not differentiated among the three languages. Trying
to translate or transcode such a string on the assumption that
it is homogeneous in one of the languages could easily be the
beginning of big trouble.
We've had a lot of names for this problem over the last year or
so. Sometimes we call it the "identifier versus word" problem,
sometimes a "script versus language" one, sometimes other
things. Each of these problems is a bit different, but all have
the fundamental problem outlined above at their core.
> Or you do not care about any solution?
I can't speak for others, but I care, and care deeply. The
difficulty is that, for all of these problems that are really
language-specific (or even specific to the use of particular
scripts in particular languages, it seems to me that there are
four types of things we can do.
(i) We can decide the problems are not worth trying to
solve. I can't accept that, I don't think you can either,
but I can imagine that some sensible people would take that
position.
(ii) We can try to trick the DNS into doing these things.
We've seen many solutions of this type suggested to the WG
at one time or another, and I am sure there are many others
out there. Different approaches involve clever heuristics,
naming restrictions that bind the use of particular
languages to particular TLDs, tricks for sneaking in
language tags, and so on. My belief is that none of them
will work very well because the DNS still requires exact
matching and natural languages are an inherently fuzzy
business.
(iii) We can try to localize the problem somehow --
preprocessors that are installed in particular environments,
keyword systems hooked into browsers, search mechanisms that
are semi-invisible to the users and that use non-DNS
databases.
(iv) Or we can try to layer something above the DNS that
deals with the "language" problems while the DNS deals with
the "character" ones. Many systems like this are possible,
and we call them by many names: keywords (although some of
the systems in the third group are also keyword ones),
word-mapping and transation, searching, and so on The
difference between these approaches and the third is that it
appears that we can make them work globally. For example,
if you travel to a far away place, and ask the same question
as you would at home -- admittedly a more complex question
than you would ask the DNS -- you will get the same answer.
And, given appropriate input and rendering software and
hardware, these systems should be able to access the same
naming structures from all over the world and regardless of
where those names appear in the DNS. The systems in the
third group typically don't have these properties.
Now, both the way I've presented the above and the things I've
been saying for the last many months probably make it clear what
I believe about this. I don't think anyone who takes the
problem seriously is giving up; we are just trying to recast and
relocate it technically into a layer and environment in which we
have better tools and more information than the DNS offers.
john