[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [idn] An experiment with UTF-8 domain names



> At 01/01/05 15:21 -0500, Keith Moore wrote:
> > > Do you also think that putting up web pages and reading them are the
> > > ``wrong things''?
> >
> >yes. try putting up web pages with links to URLs containing IDNs
> >which are encoded in UTF-8 (using various URL prefixes and various
> >protocols with servers on a variety of platforms) and seeing whether
> >those links work in various browsers.  Then try putting URLs
> >containing IDNs into text files, mailing them around, and using
> >cut-and-paste to enter them into a broswer's "get URL" dialog box.
> >Then try printing URLs containing IDNs on business cards, and typing
> >them in to browsers dialog boxes.
> 
> Keith, are you assuming that we are doing all this work so
> that people can put 'ace--blah' into their URIs, mail them
> around, put them in their 'location:' fields, and so on?

no.  you're correct in saying that - at least in email - all of the
necessary standards are in place.  whether the deployed implementations
support those standards is a completely different question.  

my point is that IDNs (and URIs containing IDNs) will be
passed around from one program to another via a variety of means -
including protocols, cut-and-paste, and human transcription.
due to (in some cases) lack of protocol support (in others)
different things being upgraded at different times, 
some of those programs will be able to input, or display IDNs 
natively, and some will not.  

the primary reason for using ACE is within protocols  - i.e.
in contexts that most humans do not see.  but we of course
realize that ACE-encoded IDNs will leak into contexts that
humans do see, just as native IDNs will leak into protocols
that cannot deal with UTF8.  and those IDNs will get passed
to other protocols.  

the point I was making in my message to Dan is that the success 
or failure of whatever scheme we choose for IDNs will be determined 
by the degree to which things fail when IDNs get passed around 
in a wide variety of ways.  we therefore cannot predict success 
or failure by only looking at one or two test cases with one or 
two implementations of a single protocol.

> >a strategy for minimal disruption is:
> >
> >- affect as few components as possible  (since the effort required to
> >   deal with breakage is on a per-component basis rather than a
> >   per-line-of-code basis)
> 
> If it's by component, I doubt that ACE is better than UTF-8.
> ACE needs a lot of special considerations, scripts that have
> to be fixed, and so on, in order to work.

we need to look at this more carefully then, and choose an ACE 
which is unlikely to break things that currently deal 
successfully with domain names.  this is of paramount concern.

> >- put most of the burden of upgrading on those who benefit most
> 
> I'm not sure I agree with this. First of all, it implies that
> internationalized domain names have to be usable everywhere,
> immediately. I don't think that's really the demand we are
> facing.

depends on what you mean by 'usable'.  I think IDNs need to 
provide minimum disruption everwhere, immediately.  but my 
point is that those who benefit most from being able to input 
or display IDNs are the ones who are most likely to be willing
to upgrade their applications.  if we adopted a scheme that
required *everyone* to upgrade to new applications - regardless
of whether they saw any benefit from IDNs - it would be more 
likely to result in pushback against IDNs.

> >I think it's the other way around.  people will not give up their
> >favorite tools en masse in favor of unfamiliar tools that support UTF-8,
> 
> They don't have to. On unix, the general thing you will have to
> do is to create aliases or wrapper shell scripts for your editor
> to either set the locale to work with UTF-8 or to convert the
> file to the encoding your editor can handle and back.

I suspect that most UNIX users are unwilling to do even this much - 
they'd have to remember to run a different command when editing 
a UTF-8 file than when editing a normal file.  though perhaps emacs 
would acquire a UTF-8 mode that would automagically detect UTF-8 -
that's quite doable.

> Do you expect editors that can handle ACE transparently to become prominent?
> Do you think it is sufficient to handle ACE as ASCII?

Depends on what you mean by 'sufficient'.  But a lot of people, if faced
with the choice between using a new tool (or a new script) and handling
IDNs in ACE format, would choose the latter.
 
> >especially when their existing files are in other formats.  And changing
> >to a new xterm won't automatically make the old tools work (actually
> >it will probably break some tools that expect each character takes one
> >octet).
> 
> First, UTF-8 is *extremely* well designed to work in most cases even
> for such tools (shell scripts, filter programs,...). If you can give
> a concrete example of such a breakage in such kinds of programs,
> I would appreciate it.

I agree that UTF-8 is quite well designed, but it does break the common
assumption that all characters are a single octet.  anything that uses 
'.' in a regular expression to match a single character will fail given 
UTF-8 as input, as will anything that tries to parse input by columns.  

Keith