[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
[idn] (new) chinese/hangul ML.com statistics with AMCZ/LAMCZ
Hi, Adams and all
AMC-ACE-Z + REORDERING is more efficient than DUDE + REORDERING
at least for Chinese/Hangul. Testing other scripts, too.
For long chinese/hangul domains, the LAMCZ label length approximates to
(1.95~2.15) * (number of code points).
As for label length efficiency,
LAMCZ is 11.30% efficient than LDUDE for chinese 285108 ML.com samples.
LAMCZ is 4.29% efficient than LDUDE for hangeul 207207 ML.com samples.
LAMCZ is the most efficient one that I have ever tested with.
(I have not tested with MACE,ACE37 yet. wait for a while, please.)
(I excluded Latin Ranges from REORDERING due to AMCZ's literal mode).
Cheers,
Soobok Lee
-------------------------------------------------------------------------
For Chinese ML.com samples.
N: length of a domain label ( # of code points)
FREQ: number domains of length N
N*FREQ: sum of # of code points of domains of length N
SUM OF AMCZ: sum of lengths of AMCZ labels
X: SUM OF AMCZ / N * FREQ
SUM OF LAMCZ: sum of lengths of LAMCZ labels
Y: SUM OF LAMCZ / N * FREQ
COMP: (SUM OF LAMCZ - SUM OF AMCZ) / SUM OF AMCZ * 100
| N| FREQ| N*FREQ| SUM OF AMCZ(X)| SUM OF LAMCZ(Y)| COMP|
| 1| 4642| 4642| 15804(3.40)| 14807(3.19)|6.31|
| 2| 59708| 119416| 401549(3.36)| 352022(2.95)|12.33|
| 3| 49471| 148413| 484456(3.26)| 415104(2.80)|14.32|
| 4| 99402| 397608| 1269398(3.19)| 1034646(2.60)|18.49|
| 5| 29974| 149870| 467070(3.12)| 381651(2.55)|18.29|
| 6| 20809| 124854| 384013(3.08)| 304635(2.44)|20.67|
| 7| 8860| 62020| 186347(3.00)| 147111(2.37)|21.06|
| 8| 5251| 42008| 124325(2.96)| 97303(2.32)|21.73|
| 9| 2666| 23994| 69234(2.89)| 54697(2.28)|21.00|
| 10| 2008| 20080| 57887(2.88)| 44270(2.20)|23.52|
| 11| 859| 9449| 26914(2.85)| 20836(2.21)|22.58|
| 12| 671| 8052| 22819(2.83)| 17294(2.15)|24.21|
| 13| 346| 4498| 12217(2.72)| 9581(2.13)|21.58|
| 14| 235| 3290| 9084(2.76)| 6933(2.11)|23.68|
| 15| 117| 1755| 4723(2.69)| 3721(2.12)|21.22|
| 16| 68| 1088| 2884(2.65)| 2258(2.08)|21.71|
| 17| 21| 357| 911(2.55)| 704(1.97)|22.72|
| | 285108| 1121394| 3539635(3.16)| 2907573(2.59)|17.86|
For Korean ML.com samples.
N: length of a domain label ( # of code points)
FREQ: number domains of length N
N*FREQ: sum of # of code points of domains of length N
SUM OF AMCZ: sum of lengths of AMCZ labels
X: SUM OF AMCZ / N * FREQ
SUM OF LAMCZ: sum of lengths of LAMCZ labels
Y: SUM OF LAMCZ / N * FREQ
COMP: (SUM OF LAMCZ - SUM OF AMCZ) / SUM OF AMCZ * 100
| N| FREQ| N*FREQ| SUM OF AMCZ(X)| SUM OF LAMCZ(Y)| COMP|
| 1| 1941| 1941| 7764(4.00)| 7764(4.00)|0.00|
| 2| 16978| 33956| 123248(3.63)| 105628(3.11)|14.30|
| 3| 38852| 116556| 394410(3.38)| 322373(2.77)|18.26|
| 4| 61642| 246568| 803121(3.26)| 625970(2.54)|22.06|
| 5| 40375| 201875| 639079(3.17)| 483118(2.39)|24.40|
| 6| 24561| 147366| 458978(3.11)| 337398(2.29)|26.49|
| 7| 13034| 91238| 280346(3.07)| 203406(2.23)|27.44|
| 8| 5596| 44768| 136452(3.05)| 97248(2.17)|28.73|
| 9| 2421| 21789| 65504(3.01)| 46536(2.14)|28.96|
| 10| 1033| 10330| 29964(2.90)| 21330(2.06)|28.81|
| 11| 427| 4697| 13845(2.95)| 9739(2.07)|29.66|
| 12| 173| 2076| 5905(2.84)| 4261(2.05)|27.84|
| 13| 96| 1248| 3588(2.88)| 2539(2.03)|29.24|
| 14| 32| 448| 1331(2.97)| 921(2.06)|30.80|
| 15| 22| 330| 927(2.81)| 675(2.05)|27.18|
| 16| 15| 240| 606(2.52)| 471(1.96)|22.28|
| 17| 8| 136| 378(2.78)| 267(1.96)|29.37|
| 19| 1| 19| 26(1.37)| 26(1.37)|0.00|
| | 207207| 925581| 2965472(3.20)| 2269670(2.45)|23.46|
----- Original Message -----
From: "Soobok Lee" <lsb@postel.co.kr>
To: <idn@ops.ietf.org>
Sent: Tuesday, July 10, 2001 10:37 PM
Subject: chinese/hangul ML.com statistics with DUDE/LDUDE
>
> The next table is from
> 285108 chinese ML.com samples (old raw data from VGRS).
>
> "COMP" column includes improvement ratios of LDUDE over DUDE.
> "Y" column points that for long chinese domains, LDUDE's label
> length is close to (2.0~2.5)*(input domain length).
>
>
> N: length of a domain label ( # of code points)
> FREQ: number domains of length N
> SUM OF DUDE: sum of lengths of DUDE labels
> X: SUM OF DUDE / N * FREQ
> SUM OF LDUDE: sum of lengths of LDUDE labels
> Y: SUM OF LDUDE / N * FREQ
> COMP: (SUM OF LDUDE - SUM OF DUDE) / SUM OF DUDE * 100
>
> | N| FREQ| N*FREQ| SUM OF DUDE(X)| SUM OF LDUDE(Y)| COMP|
>
> | 1| 4642| 4642| 18568(4.00)| 18568(4.00)| 0.00|
> | 2| 59708| 119416| 462031(3.87)| 415599(3.48)|10.05|
> | 3| 49471| 148413| 566440(3.82)| 477649(3.22)|15.68|
> | 4| 99402| 397608| 1509929(3.80)| 1168378(2.94)|22.62|
> | 5| 29974| 149870| 554237(3.70)| 426226(2.84)|23.10|
> | 6| 20809| 124854| 457412(3.66)| 333416(2.67)|27.11|
> | 7| 8860| 62020| 220880(3.56)| 160563(2.59)|27.31|
> | 8| 5251| 42008| 146822(3.50)| 103903(2.47)|29.23|
> | 9| 2666| 23994| 81433(3.39)| 58657(2.44)|27.97|
> | 10| 2008| 20080| 68385(3.41)| 46708(2.33)|31.70|
> | 11| 859| 9449| 31596(3.34)| 22111(2.34)|30.02|
> | 12| 671| 8052| 27039(3.36)| 18135(2.25)|32.93|
> | 13| 346| 4498| 14306(3.18)| 10088(2.24)|29.48|
> | 14| 235| 3290| 10676(3.24)| 7230(2.20)|32.28|
> | 15| 117| 1755| 5568(3.17)| 3854(2.20)|30.78|
> | 16| 68| 1088| 3383(3.11)| 2376(2.18)|29.77|
> | 17| 21| 357| 1075(3.01)| 750(2.10)|30.23|
>
> | | 285108| 1121394| 4179780(3.73)| 3274211(2.92)|21.67|
>
>
>
> The next table is
> From 207207 hangul ML.com samples (old raw data from VGRS).
>
> | N| FREQ| N*FREQ| SUM OF DUDE(X)| SUM OF LDUDE(Y)| COMP|
>
> | 1| 1941| 1941| 7764(4.00)| 7764(4.00)|0.00|
> | 2| 16978| 33956| 129239(3.81)| 111308(3.28)|13.87|
> | 3| 38852| 116556| 436845(3.75)| 341333(2.93)|21.86|
> | 4| 61642| 246568| 915355(3.71)| 653736(2.65)|28.58|
> | 5| 40375| 201875| 743090(3.68)| 502097(2.49)|32.43|
> | 6| 24561| 147366| 540245(3.67)| 349710(2.37)|35.27|
> | 7| 13034| 91238| 332964(3.65)| 211206(2.31)|36.57|
> | 8| 5596| 44768| 162833(3.64)| 100618(2.25)|38.21|
> | 9| 2421| 21789| 78945(3.62)| 48633(2.23)|38.40|
> | 10| 1033| 10330| 36144(3.50)| 22323(2.16)|38.24|
> | 11| 427| 4697| 16744(3.56)| 10259(2.18)|38.73|
> | 12| 173| 2076| 7178(3.46)| 4578(2.21)|36.22|
> | 13| 96| 1248| 4386(3.51)| 2725(2.18)|37.87|
> | 14| 32| 448| 1656(3.70)| 1006(2.25)|39.25|
> | 15| 22| 330| 1168(3.54)| 750(2.27)|35.79|
> | 16| 15| 240| 757(3.15)| 529(2.20)|30.12|
> | 17| 8| 136| 470(3.46)| 299(2.20)|36.38|
> | 19| 1| 19| 30(1.58)| 30(1.58)|0.00|
>
> | | 207207| 925581| 3415813(3.69)| 2368904(2.56)|30.65|
>
>
>
>
>
>