[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
[idn] information density in various languages (was Re: hi)
Martin Duerst <duerst@w3.org> wrote:
> > I think each Hangul character carries the information of only about
> > 1.5 English letters,
>
> It may be lower than Chinese, but I'm very surprised it should be that
> low.
You're right. My estimate must have been based on an anomalous sample.
Here are the counts for Genesis chapter 1:
King James: 3167 letters
Basic English: 3088 letters
Chinese Union: 778 ideographs
Korean Revised: 1201 Hangul
references:
http://www.ccim.org/bible/
http://bible.wisenet.co.kr/
So it's about 4.0 English letters per Chinese ideograph, and about 2.6
English letters per Korean Hangul.
Each Korean Hangul takes about 2.9 octets in AMC-ACE-Z, which means a
maximal Korean domain label (20 hangul) holds about as much information
as a 52-letter English string, which about 17% less information than
a maximal English domain label (63 letters), and about 38% less
information than a maximal Chinese domain label (19 ideographs).
I now retract this statement:
> Of all the languages I've looked at, Korean is by far the least dense
> when encoded using AMC-ACE-Z.
In light of the new data, I doubt that Korean is the least dense.
AMC