[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [idn] Re: Characters, scripts and words (was: Re: nameprep and others: hangeulchar)
Thanks for the overview run on the wg history. I'd
like to put the script tagging idea up for discussion.
In addition to a script tag definition,
a tag has tagged range, tagged conversion,
tagged reversion and tagged display modules
associated to it. Tagged input is belong to regisration
part of IDN, that I only hear font confusion discussion on
it. This following is an excerpt from my to-be draft. I'd
like to hear critics.
Liana
2.1 Tagged Range
Name tagging is the primary method in IDN to police illegitimate use
of [ISO 10646] codepoints. Trade names are created by people who
may need to use several scripts in one trade name, for example
Japanese users. However, there are unlikely that any legitimate users
who have to use symbols outside their native scripts to mix in their
native names. With tagged range, the IDN local developer MUST set up
its supported script in term of code blocks of [ISO 10646], and
exclude any codepoints outside its tagged range as well as any
non-IDN-character inside its tagged range to catch unintended
codepoints. It is certain that some developers may provide more
friendly user interface for certain user groups than others, while
tagged range defines limited scope for each tagged blocks to be policed.
A tagged range MUST have a least one non-zero code block as its primary
range, and is RECOMMENDED to test for operational complexity before
increasing its associating number of blocks.
On Tue, 28 Aug 2001 19:36:08 -0400 John C Klensin <klensin@jck.com>
writes:
> --On Tuesday, August 28, 2001 12:48 PM -0700
> liana.ydisg@juno.com wrote:
>
> >> > nothing is going to prevent labels of ECHAK, or any of its
> >> > permutations.
> >> > ...
> >> Why is it the job of an iDNS standard to prevent users from
> >> defining silly
> >> strings, as long as the strings do not break the DNS?
> >>...
> >> At 09:35 AM 8/28/2001, Mark Davis wrote:
> >> > I believe, as I have said before, that it is too difficult
> >> > to
> >> separate out
> >> > the legitimate mixes of scripts and symbols from the
> >> > 'questionable'
> >> ones for
> >> > us to do anything. And adding language tags would simply
> >> > make it
> >> more
> >> > complicated, not less.
> >>
> >> exactly correct.
> >>
> >> Too difficult and, to paraphrase Jeff Case, simply not our
> >> problem.
> >...
> > Do you mean nobody else can come up with a feasible
> > solution within the list?
>
> Liana,
>
> Let me risk answering for Dave as well as for myself.
>
> The working group (and sometimes its various design teams) keeps
> falling into the trap of believing that one can make
> language-specific rules for what characters can be handled in
> the DNS and how they can (or should) be handled. By and large,
> the DNS will work well as long as we can examine, and match, one
> character at a time, rather than trying to do things by words.
>
> One of the advantages of Unicode is that it is one character
> set, and not a collection of independent different ones with
> per-script or per-language tagging (identification) information.
> Despite the considerable deployment of character encoding
> systems that use coding system identification methods (several
> of the ISO 2022-based techniques for representing Japanese come
> to mind), we have found that such systems have very poor
> interchange properties among disparate systems. But Unicode's
> weakness is also that it is one character set: the designers
> have made many decisions to not include some characters in areas
> belonging to a particular script because those characters appear
> elsewhere. Conversely, they have permitted duplication in other
> areas. As far as I can tell, all of those decisions were
> rational, in the sense that there were good reasons for them.
> Whether they were right or wrong at the time is a separate
> question, but one that is largely irrelevant to our needs: there
> is simply no alternative to Unicode on the table that could
> serve as the basis for a multilingual (actually multi-script)
> interchange format.
>
> So we need to keep reminding ourselves that we are dealing with
> character sequences, with the characters taken one at a time.
> Whether the results have any linguistic meaning, whether a label
> can be pronounced, whether the string makes any cognitive sense
> at all, are all irrelevant. And, if we try to ignore that
> irrelevancy, we get ourselves, not into "silly strings" but to
> extremely silly states in which we would need to worry about,
> e.g., the "meaning" of an Arabic non-joining zero-width
> separator in the middle of a string of Chinese characters or a
> Korean filler in the middle of a strong of Roman-based ones. It
> also implies that a single label might contain a mix of Chinese,
> Korean, and Japanese "words", using characters (code points)
> that are not differentiated among the three languages. Trying
> to translate or transcode such a string on the assumption that
> it is homogeneous in one of the languages could easily be the
> beginning of big trouble.
>
> We've had a lot of names for this problem over the last year or
> so. Sometimes we call it the "identifier versus word" problem,
> sometimes a "script versus language" one, sometimes other
> things. Each of these problems is a bit different, but all have
> the fundamental problem outlined above at their core.
>
> > Or you do not care about any solution?
>
> I can't speak for others, but I care, and care deeply. The
> difficulty is that, for all of these problems that are really
> language-specific (or even specific to the use of particular
> scripts in particular languages, it seems to me that there are
> four types of things we can do.
>
> (i) We can decide the problems are not worth trying to
> solve. I can't accept that, I don't think you can either,
> but I can imagine that some sensible people would take that
> position.
>
> (ii) We can try to trick the DNS into doing these things.
> We've seen many solutions of this type suggested to the WG
> at one time or another, and I am sure there are many others
> out there. Different approaches involve clever heuristics,
> naming restrictions that bind the use of particular
> languages to particular TLDs, tricks for sneaking in
> language tags, and so on. My belief is that none of them
> will work very well because the DNS still requires exact
> matching and natural languages are an inherently fuzzy
> business.
>
> (iii) We can try to localize the problem somehow --
> preprocessors that are installed in particular environments,
> keyword systems hooked into browsers, search mechanisms that
> are semi-invisible to the users and that use non-DNS
> databases.
>
> (iv) Or we can try to layer something above the DNS that
> deals with the "language" problems while the DNS deals with
> the "character" ones. Many systems like this are possible,
> and we call them by many names: keywords (although some of
> the systems in the third group are also keyword ones),
> word-mapping and transation, searching, and so on The
> difference between these approaches and the third is that it
> appears that we can make them work globally. For example,
> if you travel to a far away place, and ask the same question
> as you would at home -- admittedly a more complex question
> than you would ask the DNS -- you will get the same answer.
> And, given appropriate input and rendering software and
> hardware, these systems should be able to access the same
> naming structures from all over the world and regardless of
> where those names appear in the DNS. The systems in the
> third group typically don't have these properties.
>
> Now, both the way I've presented the above and the things I've
> been saying for the last many months probably make it clear what
> I believe about this. I don't think anyone who takes the
> problem seriously is giving up; we are just trying to recast and
> relocate it technically into a layer and environment in which we
> have better tools and more information than the DNS offers.
>
> john
>
>