[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [idn] Comments on protocol drafts



At 00:32 08-02-00 , Martin J. Duerst wrote:

>Please don't use CJK as the main example. They use two bytes
>all the time anyway, so using 3 (UTF-8) or 4 (UTF-5) or so
>isn't that a big hit. And label lengths, in terms of characters,
>are going to be much smaller for CJK than for alphabetic
>scripts. The main problem cases are scripts such as Devanagari,
>Bengali, Tamil, Georgian,... which are alphabetic but require
>3 bytes in UTF-8. 

Please consider Vietnamese as another case:
         - official form ("Quoc Ngu") is Romanised
         - common form (== official form) is Romanised
         - Romanised form has been used for centuries,
           while older form has been dead (for non historical uses) for centuries
         - Could fit into 8 bits by itself, but UTF-8 requires much more space

For more reading on Vietnamese, see RFC-1456, which defines a widely
used (e.g. VIQR is common in the Vietnamese culture group on USENET)
quoted-readable encoding for Vietnamese as well as an 8-bit character
set encoding.

Ran
rja@inet.org