[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [idn] UTF-8 / RACE



On Mon, 28 May 2001, Keith Moore wrote:

> > Furthermore, a computer can't always recognize what is a domain name
> > and what is not. I think it's pretty darn ugly that the same domain
> > e g in the header of a mail message and in the body of the same
> > message should be displayed in two completely different ways.
> 
> I agree that it's ugly.  However this is a natural result of the fact
> that header information is structured according to one standard
> (RFC 822, now RFC 2822), and textual message bodies are encoding
> with any of a variety of character encoding standards.
> 
> Currently, message headers are specified to be ASCII (with non-ASCII
> portions encoded per RFC 2047), though in reality a variety of
> character sets are used without labelling.  Textual body parts may be in 
> any of several different character encodings, and these are sometimes
> labelled correctly.    UTF-8 is allowed as one of the encodings for 
> textual message bodies , but it is not widely used.

No, it's a natural result of mr Costellos suggestion that domains with
characters that can't be displayed should be displayed as ACE. I fail
to see that your comment is relevant to what I wrote. Perhaps you
should read what I wrote and replied to one more time?


Let's say John Luser sends the following mail to you (where <oe> is o
with umlaut).

  From: John Luser <luser@f<oe>reningen.org>
  To: Keith Moore <moore@cs.utk.edu>
  Subject: Hello
  Content-Type: TEXT/PLAIN; charset=ISO-8859-1
  Content-Transfer-Encoding: 8BIT
  
  Hello Keith,
  
  Why don't you take a look at our new website on f<oe>reningen.org?
  
  Have a nice day!
  
  John


Now, let's say your local system knows about ISO 8859-1 and is able to
convert it to the internal encoding but unable to display <oe>. Thus
it displays it as [] (square box - your systems default character).


With mr Costellos suggestion this would be displayed to you as:

  From: John Luser <luser@px--sdn3fnfuwy4rn5wutn.org>
  To: Keith Moore <moore@cs.utk.edu>
  Subject: Hello
  
  Hello Keith,
  
  Why don't you take a look at our new website on f[]reningen.org?
  
  Have a nice day!
  
  John


(Or whatever the chosen ACE encoding turns it into.)

It's hard to recognize px--sdn3fnfuwy4rn5wutn.org as the same domain
as f[]reningen.org. It's a lot uglier as well.

A less ugly way of displaying it would be:


  From: John Luser <luser@f[]reningen.org>
  To: Keith Moore <moore@cs.utk.edu>
  Subject: Hello
  
  Hello Keith,
  
  Why don't you take a look at our new website on f[]reningen.org?
  
  Have a nice day!
  
  John


You don't know what character [] is but at least it's displayed in the
same way. Since there's no way for your computer to know that
f[]reningen.org in the body is a domain name instead of a number, a
missed space or whatever it can't turn it into ACE.

It doesn't matter that the header domain is encoded in ACE/UTF-8 and
the body in ISO 8859-1 since your system can convert them both to it's
internal format. The set of characters available in your systems
font(s) is still the same no matter what encoding is used.

ACE is always a worse and an uglier thing to display when a system
encounters unknown characters. Thus it's hardly an argument for using
ACE. I can't agree that displaying the domain in the header as ACE,
i e as "px--f+xg-reningen.org", "px--sdn3fnfuwy4rn5wutn.org" or
whatever, is "no uglier than anything else they might display".

The same goes for everything else. Let's say we have a web page where
location is "http://px--sdn3fnfuwy4rn5wutn.org"; and the text on the
page says "Welcome to f[]reningen.org!". And so on...

Do you understand what I'm saying now?


> In order to reduce this "ugliness" of having different encodings
> for message header and message body it would be necessary not only
> to extend message headers to allow UTF-8 but also for UTF-8 to become
> a popular method of encoding textual message bodies.  The latter 
> seems unlikely to happen for a great many years, because even if many
> user agents support the ability to display UTF-8, the number of user 
> agents that can display other common character sets will still be greater.
> Thus, for many languages, it will be safer to send text in some other
> character set besides UTF-8. 

See above. That is not the ugliness I was refering to. What I was
refering to was the suggestion that applications wanting "to display
IDNs containing unsupported characters [...] should display the ACE".


> > ACE has it's advantages but the display problem is not a reason to
> > choose ACE. I think it's obvious that it should eventually be phased
> > out. Once again, in the long perspective we need a common encoding
> > like ASCII: The Unified Encoding that is used for every single piece
> > of text that is transmitted on the Internet.
> 
> I am not sure that we will find that Holy Grail anytime soon.
> Even if we adopt UTF-8 we will still have to deal with various ways
> of encoding "rich" text.

True, but we have to start somewhere. Every journey begins with a
single step. (Although, this journey started with BCP 18. This is just
a very large and important step on the way.)


> But this is all beside the point.  The IDN WG cannot legislate the 
> encodings that are used by other applications; and it cannot legislate
> that existing applications change their behavior.  It can only recommend
> how to solve the I18N problem for domain names.  If the solution that
> IDN recommends doesn't work well for some applications, it will not get
> adopted for those applications - even if the recommended solution gets
> approved as a standard.
> 
> It's very important to be realistic about what can be acheived.

Sure, but it is hardly a reason or excuse for this WG to suggest a bad
solution. And I fail to see what it has to do with displaying domains
as ACE when the application encounters unknown characters.

/Magnus