[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[idn] information density in various languages (was Re: hi)



Martin Duerst <duerst@w3.org> wrote:

> > I think each Hangul character carries the information of only about
> > 1.5 English letters,
>
> It may be lower than Chinese, but I'm very surprised it should be that
> low.

You're right.  My estimate must have been based on an anomalous sample.
Here are the counts for Genesis chapter 1:

King James:     3167 letters
Basic English:  3088 letters
Chinese Union:   778 ideographs
Korean Revised: 1201 Hangul

references:
http://www.ccim.org/bible/
http://bible.wisenet.co.kr/

So it's about 4.0 English letters per Chinese ideograph, and about 2.6
English letters per Korean Hangul.

Each Korean Hangul takes about 2.9 octets in AMC-ACE-Z, which means a
maximal Korean domain label (20 hangul) holds about as much information
as a 52-letter English string, which about 17% less information than
a maximal English domain label (63 letters), and about 38% less
information than a maximal Chinese domain label (19 ideographs).

I now retract this statement:

> Of all the languages I've looked at, Korean is by far the least dense
> when encoded using AMC-ACE-Z.

In light of the new data, I doubt that Korean is the least dense.

AMC