[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [idn] UTF-8 / RACE



> Using utf-8 from the system point of view is the same as using cp1256.
> They are both 8bit encoding schemes, and most applications people use
> to publish on the net, use unicode in the backend.

Are we talking about 8-bit vs 7-bit now? If we are, can I raise my hand
to say 8-bit is not sufficient for I18N?

CP1256 and UTF-8 are both 8-bit encodings no doubt. I can also claim
BIG5 and UTF-8 are both 8-bit. Thus, system which works for BIG5
*should* work for UTF-8 (or easily converted). Unfortunately, that is
not true.

While a 8-bit application is already 8-bit clean thus making it easiler,
this does not means the other dependent infrastructure which the
applications relies upon is able to deal with 8-bit characters. Sure, we
can upgrade the infrastructure and going through the pain. The question
now here is that can the WG decide we want to do this...and I am hearing
mixed messages from the members.

> So I am assuming this is what sherine is refering to.  It is easier to
> convert applications using 8bit encoding to use utf8 than doing RACE.
> While RACE will be good for 7bit encoding applications.
>
> I have succeffuly converted some apps we use localy to work with utf8
> and we had to account for the double bytes compared to one when
working
> with cp1256 or iso88596, and it was not a lot of work. I actually did
> this about two years ago.

Actually comparison of UTF-8 is much more complex than handling
double-byte compare. This is why there are Unicode Normalization and its
various forms. Especially for Arabic where there are different
presentation forms beyond the U+0600 to U+06FF which you can only
compare it by normalization. Bit-wise comparison wont work in these
cases.

And if you like double-byte comparison, then try some of the Chinese
double-byte encodings especially Industry-Standard BIG5. You could weep
as you code.

-James Seng