[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[idn] The "script" fallacy



Hi

I've been trying to figure out how to formulate this for some
time, and only understood the full problem recently.  My
apologies for both.

Many of the discussions in this working group in recent weeks
and months have focused on the processing of one script or
another.  We've discussed languages written with more than one
character set (Serbo-Croatian and maybe Chinese), rather
specific matching rules for case and things that might be
deemed analogous to case (French in different countries,
perhaps Chinese, etc.), and the need for special break
characters to make writing particular languages reasonable.
Each of these discussions, taken by itself, is reasonable and
helpful.  But I think we have, in the process, lost track of
some principles.

(i) When the Hostname rules were written, they were set up to
permit mneumonic strings to be represented.  Yes, those
strings were limited to ASCII alphabetic characters, plus
digits, plus exactly one special character (hyphen), plus,
later, the period (".") special character used as a DNS label
delimiter.  No control characters, no spacing characters, no
non-alphanumeric symbols other than that hyphen.  As we sweep
in the alphabets and ideograms from Unicode, we need to
review, I believe, the spirit and implications of these
original rules and see if we wish to retain them.  If we do,
then we probably need to take another look at stringprep
(nameprep) to see if, e.g., additional punctuation should be
excluded from being used in names.  If we do not, then we
should review whether the punctuation characters and other
symbols in the "ASCII" subset of Unicode should continue to be
excluded.

(ii) While we have been talking quite a bit about scripts, we
chose Unicode partially because it constituted a _single_
coded character set.  The Hostname and label rules (as updated
by RFC 1123) permitted the legitimate characters to appear in
any sequence and combination as long as hyphen(s), if they
appear, are embedded.  Nothing in our existing documents --
the requirements draft, nameprep, or any of the coding
proposals now or recently before the committee-- impose any
stronger rule.  So the "scripts" that may be thought of as
composing Unicode are just a convenient way to think about
things: they are not a restriction on labels.

We have no prohibition, at present, on a label that consists,
in any order, of

   * A Han character (any Han character)
   * An ASCII character (any ASCII character)
   * An Arabic character (any...)
   * A Hangul character
   * A Hiragana character
   * A Hindi character
   * A Cyrillic character
   * A Thai character
   * A non-ASCII Roman-derived ("Latin-N") character

Consequently, whatever rules we adopt for a given "script" or
language must be reasonable in a mixed-script label.  If we
introduce a non-spacing break to accomodate a problem in
Arabic or Persian, we must be prepared to have that character
appear between a pair of, e.g., ASCII characters.

Similarly, if we are going to do any mapping from one
character to another, that mapping must either depend _only_
on the character itself.  If it must depend on context in any
way, the contextual assumptions must work well even when the
character appears even in a completely abnormal context, such
as surrounded by characters with which it would never appear
in a word or phrase of the language in which it normally
appears.

Note that arguments about how often something has occurred in
existing testbeds, or how often it is likely to occur, are
irrelevant here: either these strange cases are to be permitted or
they are not. If they are, all of the cases must work and
yield predictable results.  If they are not, we must figure
out a way to write rules that prohibit them.

    john