[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
[idn] Re: ave length, best compression etc
----- Original Message -----
From: "James Seng/Personal" <James@Seng.cc>
> Thus, the idea is that if we do have a compression algorithm, then we
> want to have the more oftenly occured string to be in the compressed
> set. Thus, an rearrangment algorithm like LSB which basically rearrange
> the certain characters so that it can be compressed better is generally
> a good idea.
>
> OTOH, we do not know whether this rearragement will produce better
> compression in the long run. It may turn out that those strings which
> falls in the expanded strings set is more oftenly used in future.
>
I agree that word frequency distributions could undergo some shifts
in the long run, but character frequency would not change
significantly.
To make out new words(technical term or business name), we often
combine letters or words that are easy to remember and typed in.
The most frequent 256 Hangul syllables has cumulative frequency sum
of 88.2% and for the case of top 512 ones , it reaches 99.9%.
The most frequent 256 Han letters has cumulative frequency sum
of 58.2% and for the cases of top 512,1024,2048 and 4096 ones,
it reaches 72.8,85.9,95.4 and 99.4%, respectively.
4096 is roughly close to the size of Simplified Chinese Character sets.
I think this frequency distribution would not change this century
significantly.
> There is always a holy grail of compression. And we could spend donkey
> years arguing over it and never get to our goal, ie, IDN. Lets not
> forget that and hopefully we can get IDN in a timely fashion.
I fully agree with you about this concern.
However, as my expreiments shows,
LAMCZ's and LDUDEs' label efficiencies for long han/hangul domains
approached to that of UCS-2 ( = 2*n) .
IMHO, we are not far from the summit.
Soobok Lee
>
> -James Seng
>
>