[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: An argument against multiple character sets
At 22:08 00/01/23 +0100, Harald Tveit Alvestrand wrote:
> At 12:01 23.01.00 -0800, Paul Hoffman / IMC wrote:
> >There has been some discussion on this list about whether or not we should
> >allow domain names to be created in different character sets. I believe
> >that there is a simple argument that shows that we can't.
> >
> >Let's say I want to register a domain name that is two letters: LATIN
> >SMALL LETTER F followed by LATIN SMALL LETTER U WITH OGONEK. If I use ISO
> >8859-4, that would encoded as 0x46F9. So far so good. You see a billboard
> >with my domain name on it, and you enter it into a browser. That browser
> >uses a different character set, let's say Unicode. The browser sends to
> >the resolver 0x00460173.
>
> Note: This is the UTF-16 (or UCS-2) representation of Unicode.
>
> .....
>
> >In short, I don't see how a solution that allows more than one character
> >set, or even more than one encoding, will work. If others have
> >counter-examples, I'm open to hearing them.
Not really counterexamples, but things that have come up in the
discussions about URI internationalization:
- One idea is to add the 'charset' to the actual name, e.g.
www.foo[iso-8859-1].com. It would have to be separate for
each label, and I don't hope anybody would want to consider
that seriously.
- Another idea is to hope that various names in various encodings
don't collide. This can work in controlled situations, where
you want to open an upgrade path e.g. from a single legacy
encoding (legacy encoding meaning encodings besides those
based on Unicode) and e.g. UTF-8. That's what was done in
FTP internationalization. But I don't think it's a very
good idea for DNS, labels can be extremely short, there
can be more than one legacy encoding (and distinguishing
e.g. between iso-8859-1 and iso-8859-4 won't work), and
the stability requirements are much higher. And it would be
strange if you have to tell somebody 'sorry, you can't register
U WITH OGONEK here, somebody else registered U WITH GRAVE
from iso-8859-1.
> Your argument indicates that adding character sets to a list after initial
> implementation is impossible. It doesn't mean that the initial set needs to
> be just one, although a server has to be able to compare strings between
> all the initial character sets - which is clearly a bit simpler if there is
> just one of them.
And not only the server. Comparing domain names is done in other places,
I guess.
> However, I think the *requirement* you are trying to state is that when a
> domain name is represented as text on paper, the user who thinks he has
> access to suitable input devices for that text should be able to query on
> that string and have returned information about the domain that the text on
> paper was intended to represent.
>
> It's clear by now that we probably can't find a solution that accomplishes
> this for all cases,
Yes, there are some edge cases. We have to work them out, but basically
they are similar to the current egde cases (e.g. 0/O) in the current
DNS.
> and we probably can never solve it for the case where
> the producer of the paper version intended to be obfuscating (see the
> argument about C-Omicron-M or C0M versus COM),
That obfuscating is difficult should be considered a feature.
> but the closer we come, the
> better off the users are likely to be.
Yes.
Regards, Martin.
#-#-# Martin J. Du"rst, World Wide Web Consortium
#-#-# mailto:duerst@w3.org http://www.w3.org