[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
RE: [idn] Re: 7 bits forever!
> Let's take as an example the "native language" encoding of my name:
>
> From: Valdis Kl=?iso8859-4?Q?=BA?=tnieks <Valdis.Kletnieks@vt.edu>
>
> (That's a "small e with macron", Unicode 0113).
I don't know what you mean by "native language encoding". The encoding
here used 8859-4 with QP, but that is no more "native" nor more
"language" than e.g. (most e-mail programs put the encoding outermost)
=?utf-16be?Q?Valdis_Kl=01=13tnieks?=
(with QP, the encoding reference is of importance only to the QP'd parts,
the rest is 7-bit ASCII; technically, that's ok, even though UTF-16 is not
an ASCII compatible extension; the entire string should after decoding be
represented in either of the Unicode encodings; indeed, technically, the
encoding named may be an EBCDIC one, though the non-QP parts are
ASCII; by 'technically', I here mean registration policy aside).
> If you have a *better* suggestion than 2047-encoding of how
> to pass that
> character in an e-mail header *that will survive passing through an
> intermediary system that enforces strict RFC822*, please clue
> us in....
Well, an encoding independent *and* universal method would have
been better (like character references in HTML and XML). But MIME
slightly predates 10646/Unicode...
> 1) RFC1035 says this in section 3.1:
>
> Although labels can contain any 8 bit values in octets that
> make up a
> label, it is strongly recommended that labels follow the preferred
> syntax described elsewhere in this memo, which is compatible with
> existing host naming conventions. Name servers and resolvers must
> compare labels in a case-insensitive manner (i.e., A=a),
> assuming ASCII
> with zero parity. Non-alphabetic codes must match exactly.
One would like to think that the (implicit!) restriction to ASCII was
in order to reserve the rest for some then unknown "international"
extension. But maybe thinking so would be too kind...
> 2) Why does it get restricted? Consider the parsing issues
> involved if you
> have a domain name that uses raw Unicode and embeds the character
> known as "Malayalam Letter UU". A hint why this is Very Bad
> are available
> at http://www.unicode.org/charts/PDF/U0D00.pdf
No, I don't see what you mean. I particular I don't know what you
mean by "raw Unicode" (do you mean "in UTF-8"? UTF-16 is not
compatible with 7-bit ASCII, so you cannot mean that). And even more
in particular, I don't see why you have singled out U+0D0A at all.
Do you mean that the glyph is similar to another glyph? Parsing by
whom? A program (that sees the codes) or a human (preferably one
who reads Malayalam fluently)? Neither should have any trouble, as
far as I can see. (That most people on this planet cannot read
Malayalam is beside the point, so you cannot mean that, do you.) That
0x0D 0x0A is CR LF in ASCII is of course totally irrelevant (see above),
unless you do the most huge blunder imaginable ;-) (directly mixing
ASCII and UTF-16), and you don't do THAT, do you...
/kent k