[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Inputting mixed SC/TC (Re: [idn] A question...)

To: "D. J. Bernstein" <djb@cr.yp.to>,<idn@ops.ietf.org>
Subject: Re: Inputting mixed SC/TC (Re: [idn] A question...)
From: "Mark Davis" <mark@macchiato.com>
Date: Mon, 11 Feb 2002 21:15:04 -0800
References: <20020211.093423.-412797.0.liana.ydisg@juno.com> <77460165.1013438681@localhost> <20020211211343.24889.qmail@cr.yp.to> <2279566.1013466535@localhost> <20020211214855.11426.qmail@cr.yp.to> <20020211221340.GE23357@nicemice.net> <20020212025005.26938.qmail@cr.yp.to>
Reply-to: "Mark Davis" <mark@macchiato.com>

>    (3) American-biased equivalences according to Mark Davis's UTR
21,
>        which is _not_ part of the Unicode standard.

(a) These are not American-biased equivalences.
(b) It is hardly "my" UTR. I'm the author, but the content is produced
under direction of the UTC and is approved at every stage by the UTC.
(c) UTR 21 was approved by the UTC for incorporation into Unicode 3.2.
Its new status will be reflected in Unicode 3.2, which will be final
very shortly.

Mark
—————

Πόλλ’ ἠπίστατο ἔργα, κακῶς δ’ ἠπίστατο πάντα — Ὁμήρου Μαργίτῃ
[For transliteration, see http://oss.software.ibm.com/cgi-bin/icu/tr]

http://www.macchiato.com

----- Original Message -----
From: "D. J. Bernstein" <djb@cr.yp.to>
To: <idn@ops.ietf.org>
Sent: Monday, February 11, 2002 18:50
Subject: Re: Inputting mixed SC/TC (Re: [idn] A question...)


> Adam M. Costello writes:
> > The reason IDNA does case-folding is to be consistent with the
existing
> > standard for domain names, which says they are case-insensitive.
>
> What the existing standard actually says is ``domain name
comparisons
> for all present domain functions are done in a case-insensitive
manner,
> assuming an ASCII character set, and a high order zero bit.''
>
> Similarly, the Internet mail standards specifically require that
bytes
> in message headers---including domain names---be interpreted as
ASCII
> characters.
>
> Complete consistency with the existing standards would mean
continuing
> to use only bytes 0-127, continuing to interpret those bytes as
ASCII,
> and continuing to compare names as case-insensitive ASCII names.
>
> But we don't _want_ to follow those rules. We want to see glyphs
that
> simply aren't available in the ASCII character set.
>
> Of course, we have to maintain INTEROPERABILITY with all strings
used
> today, so we'll have to continue accepting A-Z and a-z as
equivalent.
> But there are many possible equivalence rules for non-ASCII strings.
> Here are several examples---certainly not a complete list:
>
>    (1) Exactly what software uses now: no equivalences outside
ASCII.
>
>    (2) Equivalence of characters that have duplicate glyphs but that
>        were kept separate by Unicode for one of the reasons
described in
>        http://www.unicode.org/unicode/standard/where.
>
>    (3) American-biased equivalences according to Mark Davis's UTR
21,
>        which is _not_ part of the Unicode standard.
>
>    (4) German equivalences: for example, o-umlaut equivalent to oe,
and
>        the German ss equivalent to the two-byte Latin sequence SS,
which
>        in turn is equivalent to the two-byte Latin sequence ss.
>
>    (5) Hebrew equivalences: for example, aleph-bar equivalent to
aleph.
>
>    (6) Various Chinese equivalences for the benefit of Chinese
users.
>
>    (7) Some combination of the above.
>
> All of these are INTEROPERABLE with the existing use of ASCII. None
of
> them are CONSISTENT with the existing standards. One of them, #1,
has
> the advantage of being by far the easiest to implement---but
provides
> the most opportunities for confusion and fraud.
>
> What exactly is the rational line between, for example, #3 and #4?
For
> ASCII characters they both boil down to A-Z matching a-z. Why is #3
a
> better extension of the current situation than #4, or #3+#4?
>
> James Seng states that #6 is pointless because ``domain names are
> identifier ... should enter into the computer exactly as they seen
it or
> reference it.'' Under exactly the same principle, #3 and #4 and #5
are
> all pointless, so IDNA has no excuse for the costs of #3.
>
> Another approach, allowing the software simplicity of #1 but
eliminating
> user confusion, is to allow _selected_ non-ASCII characters. We
don't
> have to map all characters to the selected set; we simply have to
make
> sure that the selected characters won't be confused by the users.
This
> neatly dodges the difficulty of defining a broad equivalence rule.
>
> The decisions here have to be based on rational assessments of costs
and
> benefits. Costello's notion of ``consistency'' is obviously not
helpful:
> it leads to such huge costs for Chinese users that it has already
drawn
> objections from _three hundred_ people.
>
> ---D. J. Bernstein, Associate Professor, Department of Mathematics,
> Statistics, and Computer Science, University of Illinois at Chicago
>
>

References:
- Re: Inputting mixed SC/TC (Re: [idn] A question...)
  - From: liana Ye <liana.ydisg@juno.com>
- Re: Inputting mixed SC/TC (Re: [idn] A question...)
  - From: John C Klensin <klensin@jck.com>
- Re: Inputting mixed SC/TC (Re: [idn] A question...)
  - From: "D. J. Bernstein" <djb@cr.yp.to>
- Re: Inputting mixed SC/TC (Re: [idn] A question...)
  - From: Patrik F�ltstr�m <paf@cisco.com>
- Re: Inputting mixed SC/TC (Re: [idn] A question...)
  - From: "D. J. Bernstein" <djb@cr.yp.to>
- Re: Inputting mixed SC/TC (Re: [idn] A question...)
  - From: "Adam M. Costello" <idn.amc+0@nicemice.net.RemoveThisWord>
- Re: Inputting mixed SC/TC (Re: [idn] A question...)
  - From: "D. J. Bernstein" <djb@cr.yp.to>

Prev by Date: [idn] few comments on stringprep
Next by Date: Re: [idn] Interesting links
Previous by thread: Re: Inputting mixed SC/TC (Re: [idn] A question...)
Next by thread: Re: Inputting mixed SC/TC (Re: [idn] A question...)
Index(es):
- Date
- Thread