[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [idn] idn-uri
David,
Well, my thought is that we do have to nail down how idns will appear in
HTTP headers and reponses. Escaped UTF-8 is one way; ACE is another.
I don't see that we need to allow both, but I don't think we have any definition
at all at the moment.
By the way, currently Internet Explorer 6.0 does support %-escaped hostnames
in redirect HTTP responses from the server, but they are interpreted in the
local encoding, not UTF-8.
RFCs 2396 and 2616 actually "allow" escaped UTF-8 in host names, but just say
that they don't have a method to specify that particular encoding as part of the
protocol; i.e. it would have to be understood from context, or possibly through
an extension to the RFCs that permitted the encoding to be specified each time.
ACE is also "allowed" in that sense.
Using ACE for the hostname portion does seem like a reasonable approach,
but then so does %-escaped UTF-8. Can you present any reasons why one is
better than the other? I assume we should think about:
* Does either way break existing software such as WWW proxies?
* Is either one easier for the web server to interpret or generate?
* Is either one easier for the browser to interpret or generate?
* Is either one more "elegant" for some reason?
* Is either one required or implied by existing RFCs or I-D's with a solid consensus?
Whether ACE or escaped UTF is used here, it seems likely that the browser should
transparently handle the translation to/from the local code. There is no reason to
burden the user.
In any case, I agree that a document is needed for this.
Bruce
David Hopwood wrote:
> >
> > I'd like to ask for a clarification of the idn-uri draft. As I understand
> > it, URIs should be able to contain %-escaped UTF-8 characters
> > in their domain name portions.
>
> RFC 2396 doesn't currently allow that. Although it might seem as though
> it should be allowed for consistency, actually I don't think it is very
> useful. A %-escaped hostname won't be resolvable by most existing
> browsers, and the %-escaped characters aren't readable either. There is
> no significant advantage in having two different methods of ASCIIfying a
> <host> part (and interactions between them, such as %-encoded characters
> in an ACE label). Also, control characters (whether ASCII or Unicode) and
> other obscure characters won't be allowed in hostnames, so there is no
> advantage in being able to escape those.
>
> These are the rules I think should be used:
> - To convert an IRI to an RFC 2396-compliant URI, convert any
> hostname-like parts to ACE, and %-escape the rest.
> - To convert an URI to the form in which it should be shown to a user
> (displayed, printed, etc.), %-unescape any unreserved characters [*],
> and then convert any hostname-like parts to Unicode.
>
> A "hostname-like part" is the <host> part of a generic URI in the
> RFC 2396 syntax, or the address of a mailto: URI.
>
> With these rules there is no need to permit processes that *generate*
> URIs to %-escape the <host> part.
>
> Also note that for https: URIs, %-escaping should definitely not be
> used in the hostname, since that will break the comparison with the
> name in the X.509 certificate. New SSL/TLS clients can and should use a
> modified comparison algorithm, but the only way to make https: URIs work
> with old clients is to ACE-encode both the name in the URI, and the name
> in the certificate.
>
> [*] Unescaping reserved characters would be incorrect, but no characters
> >= U+0080 are reserved.
>
> > My question is, since most people write HTML using various local
> > code pages, do you envision that it would also be possible to write
> > a link in HTML using the local encoding?
>
> You mean "using the document charset". I certainly think this should
> be possible (also for XML).
>
> > The browser would then translate to the appropriate escape sequence
> > before putting it on the wire. Or, do you want the HTML author to
> > handle this?
>
> Initially, the only way to guarantee compatibility when generating HTML
> will be to encode hostnames as ACE, and the rest of an URI (the <path>
> part) using %-escaping. However, browsers that implement the specification
> should be immediately REQUIRED to accept URIs where the <host>, <path>
> or both are encoded using the document charset, so that after a
> transition period, it will be possible to start using such URIs.
>
> Martin Duerst wrote:
> > Here is the behavior according to RFCs/drafts:
> >
> > RFC 2396 only: only ASCII in domain name part
> >
> > IRI draft only: only ASCII in domain name part
> > (http://www.ietf.org/internet-drafts/draft-masinter-url-i18n-08.txt)
> >
> > RFC 2396 + idn-uri draft: ASCII and %-escaping (based on UTF-8)
> > (http://www.ietf.org/internet-drafts/draft-ietf-idn-uri-01.txt)
> >
> > IRI draft + idn-uri draft: ASCII, %-escaping (based on UTF-8)
> > and characters encoded based on the encoding of the page
>
> I think there should be a single document that describes internationalisation
> of both the domain name part and the rest of an IRI, and that Updates or
> Obsoletes RFC 2396. Having some implementations support only one or the
> other will just cause confusion, with users expecting things to work that
> don't.
>
> - --
> David Hopwood <david.hopwood@zetnet.co.uk>
>
> Home page & PGP public key: http://www.users.zetnet.co.uk/hopwood/
> RSA 2048-bit; fingerprint 71 8E A6 23 0E D3 4C E5 0F 69 8C D4 FA 66 15 01
> Nothing in this message is intended to be legally binding. If I revoke a
> public key but refuse to specify why, it is because the private key has been
> seized under the Regulation of Investigatory Powers Act; see www.fipr.org/rip
>
>
> -----BEGIN PGP SIGNATURE-----
> Version: 2.6.3i
> Charset: noconv
>
> iQEVAwUBPBUnpDkCAxeYt5gVAQHrlQf9H4dLgYsYQbHzquvoldlF98dPeHCcne2U
> HgVA7N37XzA1uGsgyQBdTaMKFDYixA6nojhBbBxDjRK+xnYCwk67L+uBrSRJKsJL
> O2Dq1TlRxyQSwaxFgSqbgmD0yWu+mJZdg0qLwMew0j7ZvF+MEUjq+ESXta3J3t6G
> dKtoIzSKTt2T0CvAaHkoyiGjaWXtLd+UpLFwVUlzR69KdOibzpefjbGCxY7mO3bs
> 98Uj1fvPiAyzYwIyHWQ69b6SELxjSL6+RMAdfwBC/cPwzd4N6ZunpZE/abS/8qdQ
> GJKkSsO/xIie1NB/mVzZ5rTHmHMMOZaboDDAMM+VOZ5uIh9TozbiGA==
> =GZcA
> -----END PGP SIGNATURE-----
>
>
>