[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: An argument against multiple character sets



At 22:08 00/01/23 +0100, Harald Tveit Alvestrand wrote:
> At 12:01 23.01.00 -0800, Paul Hoffman / IMC wrote:
> >There has been some discussion on this list about whether or not we should 
> >allow domain names to be created in different character sets. I believe 
> >that there is a simple argument that shows that we can't.
> >
> >Let's say I want to register a domain name that is two letters: LATIN 
> >SMALL LETTER F followed by LATIN SMALL LETTER U WITH OGONEK. If I use ISO 
> >8859-4, that would encoded as 0x46F9. So far so good. You see a billboard 
> >with my domain name on it, and you enter it into a browser. That browser 
> >uses a different character set, let's say Unicode. The browser sends to 
> >the resolver 0x00460173.
> 
> Note: This is the UTF-16 (or UCS-2) representation of Unicode.
> 
> .....
> 
> >In short, I don't see how a solution that allows more than one character 
> >set, or even more than one encoding, will work. If others have 
> >counter-examples, I'm open to hearing them.

Not really counterexamples, but things that have come up in the
discussions about URI internationalization:

- One idea is to add the 'charset' to the actual name, e.g.
  www.foo[iso-8859-1].com. It would have to be separate for
  each label, and I don't hope anybody would want to consider
  that seriously.

- Another idea is to hope that various names in various encodings
  don't collide. This can work in controlled situations, where
  you want to open an upgrade path e.g. from a single legacy
  encoding (legacy encoding meaning encodings besides those
  based on Unicode) and e.g. UTF-8. That's what was done in
  FTP internationalization. But I don't think it's a very
  good idea for DNS, labels can be extremely short, there
  can be more than one legacy encoding (and distinguishing
  e.g. between iso-8859-1 and iso-8859-4 won't work), and
  the stability requirements are much higher. And it would be
  strange if you have to tell somebody 'sorry, you can't register
  U WITH OGONEK here, somebody else registered U WITH GRAVE
  from iso-8859-1.


> Your argument indicates that adding character sets to a list after initial 
> implementation is impossible. It doesn't mean that the initial set needs to 
> be just one, although a server has to be able to compare strings between 
> all the initial character sets - which is clearly a bit simpler if there is 
> just one of them.

And not only the server. Comparing domain names is done in other places,
I guess.

> However, I think the *requirement* you are trying to state is that when a 
> domain name is represented as text on paper, the user who thinks he has 
> access to suitable input devices for that text should be able to query on 
> that string and have returned information about the domain that the text on 
> paper was intended to represent.
> 
> It's clear by now that we probably can't find a solution that accomplishes 
> this for all cases,

Yes, there are  some edge cases. We have to work them out, but basically
they are similar to the current egde cases (e.g. 0/O) in the current
DNS.

> and we probably can never solve it for the case where 
> the producer of the paper version intended to be obfuscating (see the 
> argument about C-Omicron-M or C0M versus COM),

That obfuscating is difficult should be considered a feature.

> but the closer we come, the 
> better off the users are likely to be.

Yes.


Regards,   Martin.


#-#-#  Martin J. Du"rst, World Wide Web Consortium
#-#-#  mailto:duerst@w3.org   http://www.w3.org