[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [idn] information density in various languages (was Re: hi)
- To: "IETF idn working group" <idn@ops.ietf.org>
- Subject: Re: [idn] information density in various languages (was Re: hi)
- From: "Soobok Lee" <lsb@postel.co.kr>
- Date: Fri, 26 Oct 2001 00:56:15 +0900
----- Original Message -----
From: "Adam M. Costello" <idn.amc+0@nicemice.net.RemoveThisWord>
You're right. My estimate must have been based on an anomalous sample.
> Here are the counts for Genesis chapter 1:
>
> King James: 3167 letters
> Basic English: 3088 letters
> Chinese Union: 778 ideographs
> Korean Revised: 1201 Hangul
>
> references:
> http://www.ccim.org/bible/
> http://bible.wisenet.co.kr/
>
> So it's about 4.0 English letters per Chinese ideograph, and about 2.6
> English letters per Korean Hangul.
That might have been a hard work. :-)
I know that chinese natural sentences contain many single-chinese-letter
verbs,pronouns,adjectives,adverbs and interrogatives, while most
chinese nouns are in two or three letters. composite chinese nouns
/business names are from combinations of those nouns.
If we take a research on chinese nouns and its corresponding english nouns/
transcriptions, the ratio 4.0 may be reduced to 2.X, i guess.
The ratio 2.6 for Hangul sentence may be also reduced to lower 2.X, i believe.
I will suggest new source of chinese nouns, later time.
http://search.yahoo.com/bin/search?p=chinese+english will help for a while.
Chinese participanst are welcome to this analysis.
Thanks.
Soobok Lee
>
> Each Korean Hangul takes about 2.9 octets in AMC-ACE-Z, which means a
> maximal Korean domain label (20 hangul) holds about as much information
> as a 52-letter English string, which about 17% less information than
> a maximal English domain label (63 letters), and about 38% less
> information than a maximal Chinese domain label (19 ideographs).
>
> I now retract this statement:
>
> > Of all the languages I've looked at, Korean is by far the least dense
> > when encoded using AMC-ACE-Z.
>
> In light of the new data, I doubt that Korean is the least dense.
>
> AMC
>