[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[idn] (new) chinese/hangul ML.com statistics with AMCZ/LAMCZ



Hi, Adams and all

AMC-ACE-Z + REORDERING  is more efficient than DUDE + REORDERING
  at least for Chinese/Hangul. Testing other scripts, too.

For long chinese/hangul domains, the LAMCZ label length  approximates to
(1.95~2.15) * (number of code points).

As for label length efficiency,

LAMCZ is  11.30% efficient than LDUDE for chinese 285108 ML.com samples.
LAMCZ is  4.29% efficient than LDUDE for hangeul  207207 ML.com samples.

LAMCZ is the most efficient one that I have ever tested with.
(I have not tested with MACE,ACE37 yet. wait for a while, please.)
(I excluded Latin Ranges from REORDERING due to AMCZ's literal mode).

Cheers,

Soobok Lee
-------------------------------------------------------------------------


For Chinese ML.com samples.

N:            length of a domain label ( # of code points)
FREQ:         number domains of length N
N*FREQ:       sum of # of code points of domains of length N
SUM OF AMCZ:  sum of lengths of AMCZ labels
X:            SUM OF AMCZ / N * FREQ
SUM OF LAMCZ: sum of lengths of LAMCZ labels
Y:            SUM OF LAMCZ / N * FREQ
COMP:         (SUM OF LAMCZ - SUM OF AMCZ) / SUM OF AMCZ * 100

|  N|    FREQ|    N*FREQ|  SUM OF AMCZ(X)| SUM OF LAMCZ(Y)| COMP|

|  1|    4642|      4642|     15804(3.40)|     14807(3.19)|6.31|
|  2|   59708|    119416|    401549(3.36)|    352022(2.95)|12.33|
|  3|   49471|    148413|    484456(3.26)|    415104(2.80)|14.32|
|  4|   99402|    397608|   1269398(3.19)|   1034646(2.60)|18.49|
|  5|   29974|    149870|    467070(3.12)|    381651(2.55)|18.29|
|  6|   20809|    124854|    384013(3.08)|    304635(2.44)|20.67|
|  7|    8860|     62020|    186347(3.00)|    147111(2.37)|21.06|
|  8|    5251|     42008|    124325(2.96)|     97303(2.32)|21.73|
|  9|    2666|     23994|     69234(2.89)|     54697(2.28)|21.00|
| 10|    2008|     20080|     57887(2.88)|     44270(2.20)|23.52|
| 11|     859|      9449|     26914(2.85)|     20836(2.21)|22.58|
| 12|     671|      8052|     22819(2.83)|     17294(2.15)|24.21|
| 13|     346|      4498|     12217(2.72)|      9581(2.13)|21.58|
| 14|     235|      3290|      9084(2.76)|      6933(2.11)|23.68|
| 15|     117|      1755|      4723(2.69)|      3721(2.12)|21.22|
| 16|      68|      1088|      2884(2.65)|      2258(2.08)|21.71|
| 17|      21|       357|       911(2.55)|       704(1.97)|22.72|

|   |  285108|   1121394|   3539635(3.16)|   2907573(2.59)|17.86|





For Korean ML.com samples.

N:            length of a domain label ( # of code points)
FREQ:         number domains of length N
N*FREQ:       sum of # of code points of domains of length N
SUM OF AMCZ:  sum of lengths of AMCZ labels
X:            SUM OF AMCZ / N * FREQ
SUM OF LAMCZ: sum of lengths of LAMCZ labels
Y:            SUM OF LAMCZ / N * FREQ
COMP:         (SUM OF LAMCZ - SUM OF AMCZ) / SUM OF AMCZ * 100

|  N|    FREQ|    N*FREQ|  SUM OF AMCZ(X)| SUM OF LAMCZ(Y)| COMP|

|  1|    1941|      1941|      7764(4.00)|      7764(4.00)|0.00|
|  2|   16978|     33956|    123248(3.63)|    105628(3.11)|14.30|
|  3|   38852|    116556|    394410(3.38)|    322373(2.77)|18.26|
|  4|   61642|    246568|    803121(3.26)|    625970(2.54)|22.06|
|  5|   40375|    201875|    639079(3.17)|    483118(2.39)|24.40|
|  6|   24561|    147366|    458978(3.11)|    337398(2.29)|26.49|
|  7|   13034|     91238|    280346(3.07)|    203406(2.23)|27.44|
|  8|    5596|     44768|    136452(3.05)|     97248(2.17)|28.73|
|  9|    2421|     21789|     65504(3.01)|     46536(2.14)|28.96|
| 10|    1033|     10330|     29964(2.90)|     21330(2.06)|28.81|
| 11|     427|      4697|     13845(2.95)|      9739(2.07)|29.66|
| 12|     173|      2076|      5905(2.84)|      4261(2.05)|27.84|
| 13|      96|      1248|      3588(2.88)|      2539(2.03)|29.24|
| 14|      32|       448|      1331(2.97)|       921(2.06)|30.80|
| 15|      22|       330|       927(2.81)|       675(2.05)|27.18|
| 16|      15|       240|       606(2.52)|       471(1.96)|22.28|
| 17|       8|       136|       378(2.78)|       267(1.96)|29.37|
| 19|       1|        19|        26(1.37)|        26(1.37)|0.00|

|   |  207207|    925581|   2965472(3.20)|   2269670(2.45)|23.46|

----- Original Message ----- 
From: "Soobok Lee" <lsb@postel.co.kr>
To: <idn@ops.ietf.org>
Sent: Tuesday, July 10, 2001 10:37 PM
Subject: chinese/hangul ML.com statistics with DUDE/LDUDE


>   
> The next table is from
> 285108 chinese ML.com samples (old raw data from VGRS). 
>  
> "COMP" column includes improvement ratios of LDUDE over DUDE.
> "Y" column points that  for long chinese domains, LDUDE's label
> length is close to (2.0~2.5)*(input domain length).
> 
> 
> N:            length of a domain label ( # of code points)
> FREQ:         number domains of length N
> SUM OF DUDE:  sum of lengths of DUDE labels
> X:            SUM OF DUDE / N * FREQ
> SUM OF LDUDE: sum of lengths of LDUDE labels
> Y:            SUM OF LDUDE / N * FREQ
> COMP:         (SUM OF LDUDE - SUM OF DUDE) / SUM OF DUDE * 100
> 
> |  N|    FREQ|    N*FREQ|  SUM OF DUDE(X)| SUM OF LDUDE(Y)| COMP|
> 
> |  1|    4642|      4642|     18568(4.00)|     18568(4.00)| 0.00|
> |  2|   59708|    119416|    462031(3.87)|    415599(3.48)|10.05|
> |  3|   49471|    148413|    566440(3.82)|    477649(3.22)|15.68|
> |  4|   99402|    397608|   1509929(3.80)|   1168378(2.94)|22.62|
> |  5|   29974|    149870|    554237(3.70)|    426226(2.84)|23.10|
> |  6|   20809|    124854|    457412(3.66)|    333416(2.67)|27.11|
> |  7|    8860|     62020|    220880(3.56)|    160563(2.59)|27.31|
> |  8|    5251|     42008|    146822(3.50)|    103903(2.47)|29.23|
> |  9|    2666|     23994|     81433(3.39)|     58657(2.44)|27.97|
> | 10|    2008|     20080|     68385(3.41)|     46708(2.33)|31.70|
> | 11|     859|      9449|     31596(3.34)|     22111(2.34)|30.02|
> | 12|     671|      8052|     27039(3.36)|     18135(2.25)|32.93|
> | 13|     346|      4498|     14306(3.18)|     10088(2.24)|29.48|
> | 14|     235|      3290|     10676(3.24)|      7230(2.20)|32.28|
> | 15|     117|      1755|      5568(3.17)|      3854(2.20)|30.78|
> | 16|      68|      1088|      3383(3.11)|      2376(2.18)|29.77|
> | 17|      21|       357|      1075(3.01)|       750(2.10)|30.23|
> 
> |   |  285108|   1121394|   4179780(3.73)|   3274211(2.92)|21.67|
> 
> 
> 
> The next table is  
> From 207207 hangul ML.com samples (old raw data from VGRS). 
> 
> |  N|    FREQ|    N*FREQ|  SUM OF DUDE(X)| SUM OF LDUDE(Y)| COMP|
> 
> |  1|    1941|      1941|      7764(4.00)|      7764(4.00)|0.00|
> |  2|   16978|     33956|    129239(3.81)|    111308(3.28)|13.87|
> |  3|   38852|    116556|    436845(3.75)|    341333(2.93)|21.86|
> |  4|   61642|    246568|    915355(3.71)|    653736(2.65)|28.58|
> |  5|   40375|    201875|    743090(3.68)|    502097(2.49)|32.43|
> |  6|   24561|    147366|    540245(3.67)|    349710(2.37)|35.27|
> |  7|   13034|     91238|    332964(3.65)|    211206(2.31)|36.57|
> |  8|    5596|     44768|    162833(3.64)|    100618(2.25)|38.21|
> |  9|    2421|     21789|     78945(3.62)|     48633(2.23)|38.40|
> | 10|    1033|     10330|     36144(3.50)|     22323(2.16)|38.24|
> | 11|     427|      4697|     16744(3.56)|     10259(2.18)|38.73|
> | 12|     173|      2076|      7178(3.46)|      4578(2.21)|36.22|
> | 13|      96|      1248|      4386(3.51)|      2725(2.18)|37.87|
> | 14|      32|       448|      1656(3.70)|      1006(2.25)|39.25|
> | 15|      22|       330|      1168(3.54)|       750(2.27)|35.79|
> | 16|      15|       240|       757(3.15)|       529(2.20)|30.12|
> | 17|       8|       136|       470(3.46)|       299(2.20)|36.38|
> | 19|       1|        19|        30(1.58)|        30(1.58)|0.00|
> 
> |   |  207207|    925581|   3415813(3.69)|   2368904(2.56)|30.65|
> 
> 
> 
> 
> 
>