[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[idn] Re: ave length, best compression etc

To: "James Seng/Personal" <James@Seng.cc>, <idn@ops.ietf.org>
Subject: [idn] Re: ave length, best compression etc
From: "Soobok Lee" <lsb@postel.co.kr>
Date: Thu, 12 Jul 2001 10:31:26 +0900

----- Original Message -----
From: "James Seng/Personal" <James@Seng.cc>

 > Thus, the idea is that if we do have a compression algorithm, then we
> want to have the more oftenly occured string to be in the compressed
> set. Thus, an rearrangment algorithm like LSB which basically rearrange
> the certain characters so that it can be compressed better is generally
> a good idea.
>
> OTOH, we do not know whether this rearragement will produce better
> compression in the long run. It may turn out that those strings which
> falls in the expanded strings set is more oftenly used in future.
>

I agree that word frequency distributions could undergo some shifts
  in the long run,   but  character frequency would not  change
significantly.
To make out new words(technical term or business name), we often
combine letters or words that are  easy to remember and typed in.

The most frequent 256 Hangul syllables has cumulative frequency sum
    of 88.2% and for the case of top 512 ones , it reaches 99.9%.

The most frequent 256 Han letters has cumulative frequency sum
    of 58.2% and for the cases of top 512,1024,2048 and 4096 ones,
    it reaches 72.8,85.9,95.4 and 99.4%, respectively.
   4096 is roughly close to the size of Simplified Chinese Character sets.

I think this frequency distribution would not change this century
significantly.


> There is always a holy grail of compression. And we could spend donkey
> years arguing over it and never get to our goal, ie, IDN. Lets not
> forget that and hopefully we can get IDN in a timely fashion.

I fully agree with you  about this concern.

However, as my expreiments shows,
LAMCZ's and  LDUDEs' label efficiencies for long han/hangul domains
approached to that of UCS-2 ( = 2*n) .
IMHO, we are not far from the summit.

Soobok Lee

>
> -James Seng
>
>

Prev by Date: Re: [idn] Reality Check
Next by Date: Re: [idn] Re: ave length, best compression etc
Prev by thread: Re: [idn] pool deadline imminent, DUDE is only ACE in pool
Next by thread: Re: [idn] Re: ave length, best compression etc
Index(es):
- Date
- Thread