[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [idn] URL encoding in html page




----- Original Message ----- 
From: "Soobok Lee" <lsb@postel.co.kr>
> > Not necessary, since the HTML and URI specs already limit the host to
> > ASCII letters, digits, hyphens, and dots.
> 
> We experts already knew this. But, many ML.com registrants don't know  about this
> poor destiny of ML.com. They want to use native ML.com in their HTML homepage.
> 
> If we want to have interoperable URI supporting native IDN, we should revise
> URI spec and HTTP spec BOTH. But, native IDN supports accompany potential
> legacy code versioning and code interoperablility problems.
> Would anyone provide indepth analysis on this caveat  ?
> 

 
 Even if we stay with current HTTP/1.1 which allows only ASCII host: header values,
 still we could revise  URI spec to allow native (utf8 or legacy encoding) IDN in URI.

 1) With IDNA and HTTP/1.1 , the web browser can encode Native IDN in URI into ACE one , and
 then open HTTP 1.1 session into the ACEed hostname with ACE host: value.

 2) With IDNA and revised HTTP with utf8 host support,  the web browser can encode 
 utf8 IDN in URI into ACE one, and  then open HTTP session into ACE hostname with utf8 host: value.

 3) With UTF8-based IDN and revised HTTP with utf8 host support, it can check whether 
 the native IDN is in utf8, and, if not, convert the iDN into utf8 , and then open
 HTTP session into utf8 webhost with utf8 host: value.


 2) and 3) may be infeasible due to HTTP's lack of capability negotiation feature like that of ESMTP,
 because the new web browser with native IDN URI support  can't decide whether the web server supports 
 native IDN or supports only ASCII(ACE) host in HOST: value   before trying that twice with both forms 
  of host: value (utf8 first, and then ACE if needed). Using ACE host: value is always  safe in 1) and 2).

 BTW, in 1) and 2), we cannot avoid legacy versioning problems because 
  most ACE conversion would be done by "ACE(NFKC(CaseFold(legacy-to-Unicode(native label))))".
  Most homepages in east asia are in legacy encodings and that monopoly (near 100%) won't change
   in the forseeable future.

 new legacy codes may be created after IDN-aware browsers are distributed.
 old legacy codes may get new code points for newly added characters.
 If IDN-aware browsers/applications are not upgraded with new legacy-to-Unicode mappings,
  they will occasionally fail to convert  legacy-encoded IDN into UNICODE one.
  That kind of IDN failure had  never seen in LDH DNS.  

Soobok Lee