[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [idn] idn-uri



-----BEGIN PGP SIGNED MESSAGE-----

Bruce Thomson wrote:
> Martin,
> 
> I'd like to ask for a clarification of the idn-uri draft. As I understand
> it, URIs should be able to contain %-escaped UTF-8 characters
> in their domain name portions.

RFC 2396 doesn't currently allow that. Although it might seem as though
it should be allowed for consistency, actually I don't think it is very
useful. A %-escaped hostname won't be resolvable by most existing
browsers, and the %-escaped characters aren't readable either. There is
no significant advantage in having two different methods of ASCIIfying a
<host> part (and interactions between them, such as %-encoded characters
in an ACE label). Also, control characters (whether ASCII or Unicode) and
other obscure characters won't be allowed in hostnames, so there is no
advantage in being able to escape those.

These are the rules I think should be used:
 - To convert an IRI to an RFC 2396-compliant URI, convert any
   hostname-like parts to ACE, and %-escape the rest.
 - To convert an URI to the form in which it should be shown to a user
   (displayed, printed, etc.), %-unescape any unreserved characters [*],
   and then convert any hostname-like parts to Unicode.

A "hostname-like part" is the <host> part of a generic URI in the
RFC 2396 syntax, or the address of a mailto: URI.

With these rules there is no need to permit processes that *generate*
URIs to %-escape the <host> part.

Also note that for https: URIs, %-escaping should definitely not be
used in the hostname, since that will break the comparison with the
name in the X.509 certificate. New SSL/TLS clients can and should use a
modified comparison algorithm, but the only way to make https: URIs work
with old clients is to ACE-encode both the name in the URI, and the name
in the certificate.

[*] Unescaping reserved characters would be incorrect, but no characters
    >= U+0080 are reserved.

> My question is, since most people write HTML using various local
> code pages, do you envision that it would also be possible to write
> a link in HTML using the local encoding?

You mean "using the document charset". I certainly think this should
be possible (also for XML).

> The browser would then translate to the appropriate escape sequence
> before putting it on the wire. Or, do you want the HTML author to
> handle this?

Initially, the only way to guarantee compatibility when generating HTML
will be to encode hostnames as ACE, and the rest of an URI (the <path>
part) using %-escaping. However, browsers that implement the specification
should be immediately REQUIRED to accept URIs where the <host>, <path>
or both are encoded using the document charset, so that after a
transition period, it will be possible to start using such URIs.

Martin Duerst wrote:
> Here is the behavior according to RFCs/drafts:
> 
> RFC 2396 only: only ASCII in domain name part
> 
> IRI draft only: only ASCII in domain name part
> (http://www.ietf.org/internet-drafts/draft-masinter-url-i18n-08.txt)
> 
> RFC 2396 + idn-uri draft: ASCII and %-escaping (based on UTF-8)
> (http://www.ietf.org/internet-drafts/draft-ietf-idn-uri-01.txt)
> 
> IRI draft + idn-uri draft: ASCII, %-escaping (based on UTF-8)
> and characters encoded based on the encoding of the page

I think there should be a single document that describes internationalisation
of both the domain name part and the rest of an IRI, and that Updates or
Obsoletes RFC 2396. Having some implementations support only one or the
other will just cause confusion, with users expecting things to work that
don't.

- -- 
David Hopwood <david.hopwood@zetnet.co.uk>

Home page & PGP public key: http://www.users.zetnet.co.uk/hopwood/
RSA 2048-bit; fingerprint 71 8E A6 23 0E D3 4C E5  0F 69 8C D4 FA 66 15 01
Nothing in this message is intended to be legally binding. If I revoke a
public key but refuse to specify why, it is because the private key has been
seized under the Regulation of Investigatory Powers Act; see www.fipr.org/rip


-----BEGIN PGP SIGNATURE-----
Version: 2.6.3i
Charset: noconv

iQEVAwUBPBUnpDkCAxeYt5gVAQHrlQf9H4dLgYsYQbHzquvoldlF98dPeHCcne2U
HgVA7N37XzA1uGsgyQBdTaMKFDYixA6nojhBbBxDjRK+xnYCwk67L+uBrSRJKsJL
O2Dq1TlRxyQSwaxFgSqbgmD0yWu+mJZdg0qLwMew0j7ZvF+MEUjq+ESXta3J3t6G
dKtoIzSKTt2T0CvAaHkoyiGjaWXtLd+UpLFwVUlzR69KdOibzpefjbGCxY7mO3bs
98Uj1fvPiAyzYwIyHWQ69b6SELxjSL6+RMAdfwBC/cPwzd4N6ZunpZE/abS/8qdQ
GJKkSsO/xIie1NB/mVzZ5rTHmHMMOZaboDDAMM+VOZ5uIh9TozbiGA==
=GZcA
-----END PGP SIGNATURE-----