[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
[idn] REORDERING : makes labels shorter or longer ?
----- Original Message -----
From: "Adam M. Costello" <idn.amc+0@nicemice.net.RemoveThisWord>
> Soobok Lee <lsb@postel.co.kr> wrote:
>
> > Even For GROUPS of those rare cases, we get Always SHORTER labels than
> > usual.
>
> As has already been pointed out, this is impossible. Here is one class
> of labels that are made longer by reordering: Labels that use lots of
> "uncommon" code points from the bottom 2K (or whatever) of the block.
> Without reordering, these code points are already close together, but
> with reordering they will be scattered all over the block, resulting in
> a longer ACE. Of course such labels are supposed to be very rare.
>
> All compression schemes must make some inputs shorter and some longer.
> The goal is to makes the more common ones a lot shorter, and the very
> rare ones only slightly longer.
>
Exactly right!
That is what i meant and already answered so in my previous article
you may have missed. Look at uppercased "GROUPS". "rare" characters
are ALL the ones not in the frequent set.
To make it more concrete, I took an experiment.
For 240823 han ML.com labels, I got the following statistics.
From now one, I will focus on only one question at a time
for productive discussions.
I won't miss other pending questions. Please be patient.
Please comment on the next result.
Soobok Lee
====================================================================
AMCZ < LAMCZ : 4764 (Reordering got longer labels for 1.9% han ML.com )
AMCZ = LAMCZ : 24553
AMCZ > LAMCZ : 211506
Total: 240823
AMCZ < LAMCZ : # of overhead chars for each label length N
N labels +overhead
1 141 +1.00
2 2840 +1.13
3 948 +1.15
4 675 +1.19
5 107 +1.18
6 32 +1.53
7 12 +1.33
8 8 +1.38
9 1 +2.00
AMCZ > LAMCZ : # of saved chars for each label length N
N labels -saved
1 1199 -1.00
2 42119 -1.32
3 35280 -1.88
4 85233 -2.79
5 20589 -3.59
6 15009 -4.71
7 5151 -5.58
8 3069 -6.58
9 1412 -7.39
10 1201 -8.49
11 474 -9.26
12 397 -10.28
13 164 -10.86
14 122 -11.79
15 50 -13.06
16 29 -13.55
17 8 -13.00
For N>9, we always get shorter ACE labels.
For each 1=<N<=9, the net gain is always positive.
27. unihan-4096
| N| FREQ| N*FREQ| SUM OF AMCZ(X)| SUM OF LAMCZ(Y)| COMP|
| 1| 4427| 4427| 14957(3.38)| 13899(3.14)| 7.07|
| 2| 57418| 114836| 384468(3.35)| 332156(2.89)|13.61|
| 3| 41335| 124005| 401283(3.24)| 336045(2.71)|16.26|
| 4| 89296| 357184| 1139404(3.19)| 902406(2.53)|20.80|
| 5| 21091| 105455| 332420(3.15)| 258709(2.45)|22.17|
| 6| 15128| 90768| 284134(3.13)| 213522(2.35)|24.85|
| 7| 5181| 36267| 112576(3.10)| 83844(2.31)|25.52|
| 8| 3082| 24656| 76272(3.09)| 56083(2.27)|26.47|
| 9| 1417| 12753| 39319(3.08)| 28883(2.26)|26.54|
| 10| 1203| 12030| 37136(3.09)| 26935(2.24)|27.47|
| 11| 474| 5214| 16072(3.08)| 11684(2.24)|27.30|
| 12| 398| 4776| 14714(3.08)| 10632(2.23)|27.74|
| 13| 164| 2132| 6532(3.06)| 4751(2.23)|27.27|
| 14| 122| 1708| 5232(3.06)| 3794(2.22)|27.48|
| 15| 50| 750| 2283(3.04)| 1630(2.17)|28.60|
| 16| 29| 464| 1419(3.06)| 1026(2.21)|27.70|
| 17| 8| 136| 405(2.98)| 301(2.21)|25.68|
|All| 240823| 897561| 2868626(3.20)| 2286300(2.55)|20.30|
The next part is hangul ML.com experiment.
AMCZ < LAMCZ : 697 ( 0.3 % )
AMCZ = LAMCZ : 6933
AMCZ > LAMCZ : 198674
Total: 206304
AMCZ < LAMCZ :
N labels +overhead
2 463 +1.09
3 175 +1.07
4 52 +1.25
5 5 +1.00
6 2 +2.00
AMCZ > LAMCZ :
N labels -saved
2 13491 -1.41
3 38086 -2.09
4 61900 -3.14
5 39612 -4.19
6 23880 -5.29
7 12447 -6.41
8 5441 -7.40
9 2262 -8.46
10 895 -9.45
11 373 -10.48
12 141 -11.17
13 77 -12.30
14 32 -13.12
15 20 -14.05
16 10 -12.30
17 7 -15.86
9. hangul-1024
| N| FREQ| N*FREQ| SUM OF AMCZ(X)| SUM OF LAMCZ(Y)| COMP|
| 1| 1953| 1953| 7812(4.00)| 7812(4.00)| 0.00|
| 2| 17149| 34298| 124782(3.64)| 106238(3.10)|14.86|
| 3| 39643| 118929| 403205(3.39)| 323801(2.72)|19.69|
| 4| 62285| 249140| 816093(3.28)| 622067(2.50)|23.77|
| 5| 39675| 198375| 636102(3.21)| 470174(2.37)|26.09|
| 6| 23891| 143346| 452483(3.16)| 326242(2.28)|27.90|
| 7| 12448| 87136| 271953(3.12)| 192139(2.21)|29.35|
| 8| 5441| 43528| 134600(3.09)| 94322(2.17)|29.92|
| 9| 2264| 20376| 62405(3.06)| 43266(2.12)|30.67|
| 10| 895| 8950| 27223(3.04)| 18764(2.10)|31.07|
| 11| 373| 4103| 12420(3.03)| 8511(2.07)|31.47|
| 12| 141| 1692| 5080(3.00)| 3505(2.07)|31.00|
| 13| 77| 1001| 2986(2.98)| 2039(2.04)|31.71|
| 14| 32| 448| 1331(2.97)| 911(2.03)|31.56|
| 15| 20| 300| 884(2.95)| 603(2.01)|31.79|
| 16| 10| 160| 460(2.88)| 337(2.11)|26.74|
| 17| 7| 119| 354(2.97)| 243(2.04)|31.36|
|All| 206304| 913854| 2960173(3.24)| 2220974(2.43)|24.97|