[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [idn] reordering strawpoll



In a message dated 2001-11-13 4:22:59 Pacific Standard Time, lsb@postel.co.kr 
writes:

>> However, your argument that it is important to reduce the *average* length
>> of encoded names, certainly doesn't apply to UTF-8 (even if it's accepted
>> that it applies to ACE, which I don't accept).
>
> Yes,
> That argument is just about  justifying  adding hangul syllable code block
> in addition to hangul jamo (alphabet) block :  9 octets -> 3 octets "
> compaction".
>
>> Users will never see (much less type in) UTF-8 octet string encodings
>> except in obscure debugging situations.
>
> But, without hangul syllable block, users will suffer from   
> 3 times more resource consumption for a unicode hangul syllable. 
> 6 hangul syllables ( 6 * 3 * 3 = 54 octets ) are allowed  within utf8 63 
> octets limit !!!!

I missed Soobok's point earlier, that he was talking about inefficient 
representation of jamos in UTF-8.  Of course, Hangul expressed in this way 
does carry a significant UTF-8 performance penalty, just like other 
alphabetic scripts in ranges above U+0800 (including all the Indic scripts, 
Thai, Lao, Georgian, and kana).

I have been carefully avoiding the UTF-8 vs. ACE debate, and have no 
intention of entering it now.  Both approaches have advantages and 
disadvantages, and certainly one disadvantage of the UTF-8 approach is its 
non-optimal compaction of such scripts.  However, this is not a shortcoming 
of UTF-8 in general, just of its use in this specific situation where space 
is at a premium.  Remember that the original design goal of UTF-8 (as 
specified by Ken Thompson in 1992) simply stated that "the transformation 
format should not be extravagant in terms of number of bytes used for 
encoding."  It would be difficult indeed to claim that this goal has not been 
met.

The solution for better compaction of Hangul would appear to be to allow 
precomposed syllables, not merely jamos.

That said, I still feel it is unproductive to claim that Hangul is 
"disadvantaged" or "disfavored" by UTF-8 and/or ACE as though it were the 
result of some kind of linguistic apartheid.  ASCII, for better or worse, 
makes up and will continue to make up the lion's share of encoded text.  
Other small alphabetic scripts in common use, such as Greek, Cyrillic, 
Arabic, and Hebrew, were assigned codes that put them in the two-byte range 
of UTF-8.  Small alphabetic scripts compress more easily than large 
syllabaries or logographic scripts.  These are engineering facts, not 
political decisions.

-Doug Ewell
 Fullerton, California