[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [idn] URL encoding in html page
----- Original Message -----
From: "Soobok Lee" <lsb@postel.co.kr>
> > Not necessary, since the HTML and URI specs already limit the host to
> > ASCII letters, digits, hyphens, and dots.
>
> We experts already knew this. But, many ML.com registrants don't know about this
> poor destiny of ML.com. They want to use native ML.com in their HTML homepage.
>
> If we want to have interoperable URI supporting native IDN, we should revise
> URI spec and HTTP spec BOTH. But, native IDN supports accompany potential
> legacy code versioning and code interoperablility problems.
> Would anyone provide indepth analysis on this caveat ?
>
Even if we stay with current HTTP/1.1 which allows only ASCII host: header values,
still we could revise URI spec to allow native (utf8 or legacy encoding) IDN in URI.
1) With IDNA and HTTP/1.1 , the web browser can encode Native IDN in URI into ACE one , and
then open HTTP 1.1 session into the ACEed hostname with ACE host: value.
2) With IDNA and revised HTTP with utf8 host support, the web browser can encode
utf8 IDN in URI into ACE one, and then open HTTP session into ACE hostname with utf8 host: value.
3) With UTF8-based IDN and revised HTTP with utf8 host support, it can check whether
the native IDN is in utf8, and, if not, convert the iDN into utf8 , and then open
HTTP session into utf8 webhost with utf8 host: value.
2) and 3) may be infeasible due to HTTP's lack of capability negotiation feature like that of ESMTP,
because the new web browser with native IDN URI support can't decide whether the web server supports
native IDN or supports only ASCII(ACE) host in HOST: value before trying that twice with both forms
of host: value (utf8 first, and then ACE if needed). Using ACE host: value is always safe in 1) and 2).
BTW, in 1) and 2), we cannot avoid legacy versioning problems because
most ACE conversion would be done by "ACE(NFKC(CaseFold(legacy-to-Unicode(native label))))".
Most homepages in east asia are in legacy encodings and that monopoly (near 100%) won't change
in the forseeable future.
new legacy codes may be created after IDN-aware browsers are distributed.
old legacy codes may get new code points for newly added characters.
If IDN-aware browsers/applications are not upgraded with new legacy-to-Unicode mappings,
they will occasionally fail to convert legacy-encoded IDN into UNICODE one.
That kind of IDN failure had never seen in LDH DNS.
Soobok Lee