[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [idn] reordering strawpoll
In a message dated 2001-11-13 4:22:59 Pacific Standard Time, lsb@postel.co.kr
writes:
>> However, your argument that it is important to reduce the *average* length
>> of encoded names, certainly doesn't apply to UTF-8 (even if it's accepted
>> that it applies to ACE, which I don't accept).
>
> Yes,
> That argument is just about justifying adding hangul syllable code block
> in addition to hangul jamo (alphabet) block : 9 octets -> 3 octets "
> compaction".
>
>> Users will never see (much less type in) UTF-8 octet string encodings
>> except in obscure debugging situations.
>
> But, without hangul syllable block, users will suffer from
> 3 times more resource consumption for a unicode hangul syllable.
> 6 hangul syllables ( 6 * 3 * 3 = 54 octets ) are allowed within utf8 63
> octets limit !!!!
I missed Soobok's point earlier, that he was talking about inefficient
representation of jamos in UTF-8. Of course, Hangul expressed in this way
does carry a significant UTF-8 performance penalty, just like other
alphabetic scripts in ranges above U+0800 (including all the Indic scripts,
Thai, Lao, Georgian, and kana).
I have been carefully avoiding the UTF-8 vs. ACE debate, and have no
intention of entering it now. Both approaches have advantages and
disadvantages, and certainly one disadvantage of the UTF-8 approach is its
non-optimal compaction of such scripts. However, this is not a shortcoming
of UTF-8 in general, just of its use in this specific situation where space
is at a premium. Remember that the original design goal of UTF-8 (as
specified by Ken Thompson in 1992) simply stated that "the transformation
format should not be extravagant in terms of number of bytes used for
encoding." It would be difficult indeed to claim that this goal has not been
met.
The solution for better compaction of Hangul would appear to be to allow
precomposed syllables, not merely jamos.
That said, I still feel it is unproductive to claim that Hangul is
"disadvantaged" or "disfavored" by UTF-8 and/or ACE as though it were the
result of some kind of linguistic apartheid. ASCII, for better or worse,
makes up and will continue to make up the lion's share of encoded text.
Other small alphabetic scripts in common use, such as Greek, Cyrillic,
Arabic, and Hebrew, were assigned codes that put them in the two-byte range
of UTF-8. Small alphabetic scripts compress more easily than large
syllabaries or logographic scripts. These are engineering facts, not
political decisions.
-Doug Ewell
Fullerton, California