[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [idn] information density in various languages (was Re: hi)




----- Original Message ----- 
From: "Adam M. Costello" <idn.amc+0@nicemice.net.RemoveThisWord>
  You're right.  My estimate must have been based on an anomalous sample.
> Here are the counts for Genesis chapter 1:
> 
> King James:     3167 letters
> Basic English:  3088 letters
> Chinese Union:   778 ideographs
> Korean Revised: 1201 Hangul
> 
> references:
> http://www.ccim.org/bible/
> http://bible.wisenet.co.kr/
> 
> So it's about 4.0 English letters per Chinese ideograph, and about 2.6
> English letters per Korean Hangul.

That might have been a hard work. :-)

I know that chinese natural sentences contain many single-chinese-letter
verbs,pronouns,adjectives,adverbs and interrogatives, while most
chinese nouns are in two or three letters. composite chinese nouns
/business names are from combinations of those nouns.

If we take a research on chinese nouns and its corresponding english nouns/
transcriptions, the ratio 4.0 may be reduced to 2.X, i guess.
The ratio 2.6 for Hangul sentence  may be also reduced to lower 2.X, i believe.

I will suggest new source of chinese nouns, later time.
http://search.yahoo.com/bin/search?p=chinese+english will help for a while.

Chinese participanst are welcome to this analysis.

Thanks.

Soobok Lee
 



> 
> Each Korean Hangul takes about 2.9 octets in AMC-ACE-Z, which means a
> maximal Korean domain label (20 hangul) holds about as much information
> as a 52-letter English string, which about 17% less information than
> a maximal English domain label (63 letters), and about 38% less
> information than a maximal Chinese domain label (19 ideographs).
> 
> I now retract this statement:
> 
> > Of all the languages I've looked at, Korean is by far the least dense
> > when encoded using AMC-ACE-Z.
> 
> In light of the new data, I doubt that Korean is the least dense.
> 
> AMC
>