[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [idn] What's wrong with skwan-utf8?



At 00/12/26 09:52 +0100, Patrik F$BgM(Btstr$B‹N(B wrote:
>At 16.19 -0800 00-12-25, D. J. Bernstein wrote:
>>  > Also, there is a question whether UTF-8 is really what we should use.
>>
>>Many systems use UTF-8 internally. It takes less work for them to read
>>and write UTF-8 than for them to handle text in other character sets.
>
>UTF-8 is not a character set. It is an encoding of the Unicode/10646 
>character set.

Please stop using the term 'character set'. The MIME documents have
made that term unusable.


>  I was not questioning Unicode/10646, but if the encoding is right. I 
> personally feel an ACE encoding is easier for the short term, and a 32 
> bit solution for the longterm (but with a dictionary approach).

Why a 32 bit solution? Both Unicode and ISO 10646 are now just 17*(256**2),
roughly a million codepoints or somewhat below 21 bits. For ISO 10646,
that has changed rather recently, so that fact may not yet have propagated
everywhere.


>I don't feel a system based on a weird encoding such as UTF-8 where we 
>"penalize" some characters in the character set is the right way of going 
>if we are to find _THE_RIGHT_ solution.

Well, to use an analogy, maybe we should all become as poor as possible,
to make sure there is no injustice between richer and poorer people
anymore?

Seriously, I think that 32 bits is overengineering by throwing resources
out of the window. ACE is overengineering by bit-fiddlers. The number
of different ACE proposals is a good measure of the arbitrariness of
the ACE approach.

Also, the research I know about typical Internet traffic (i.e. Web pages)
shows that UTF-8 is not less efficient than UTF-16 for e.g. Japanese.
Even if we make everything multilingual, there is still a lot of
ASCII or binary protocol overhead. And going to 32 bits for the
occasional Egyptian hieroglyph or Klingon character that may benefit
from 32 bits when it will be encoded in 5 or 10 years isn't really
that attractive.


>>Quite a few programs will Just Work(tm) if IDNs are defined as UTF-8,
>>while they'll have to be upgraded if IDNs are defined any other way.
>
>This is completely false because the big thing regarding IDN is definitly 
>not what charset to use, or what encoding of Unicode (if it is UTF-8 or 
>ACE encoded) but the need for the nameprep algorithm.

This is true very much on the registration side, and in theory on
the end user side. But in practice, on the end user side it's rather
irrelevant. A typical example is an 'fi' ligature, which should
not be allowed, but mapped to two separate characters. If we help
to make sure nobody registers a name with an 'fi' ligature, then
we are fine, nobody will take the pains to input an 'fi' ligature
from an 'f', 'i' letter sequence on a billboard or name card.

And it's true that it's easier for many programs to work with
UTF-8, because they already support it, rather than with some
as of yet not selected (and maybe not even though off) ACE.
This in particular applies to web browsers, the place where
people are looking most for multilingual domain names.


Regards,   Martin.