[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[idn] chinese/hangul ML.com statistics with DUDE/LDUDE



  
The next table is from
285108 chinese ML.com samples (old raw data from VGRS). 
 
"COMP" column includes improvement ratios of LDUDE over DUDE.
"Y" column points that  for long chinese domains, LDUDE's label
length is close to (2.0~2.5)*(input domain length).


N:            length of a domain label ( # of code points)
FREQ:         number domains of length N
SUM OF DUDE:  sum of lengths of DUDE labels
X:            SUM OF DUDE / N * FREQ
SUM OF LDUDE: sum of lengths of LDUDE labels
Y:            SUM OF LDUDE / N * FREQ
COMP:         (SUM OF LDUDE - SUM OF DUDE) / SUM OF DUDE * 100

|  N|    FREQ|    N*FREQ|  SUM OF DUDE(X)| SUM OF LDUDE(Y)| COMP|

|  1|    4642|      4642|     18568(4.00)|     18568(4.00)| 0.00|
|  2|   59708|    119416|    462031(3.87)|    415599(3.48)|10.05|
|  3|   49471|    148413|    566440(3.82)|    477649(3.22)|15.68|
|  4|   99402|    397608|   1509929(3.80)|   1168378(2.94)|22.62|
|  5|   29974|    149870|    554237(3.70)|    426226(2.84)|23.10|
|  6|   20809|    124854|    457412(3.66)|    333416(2.67)|27.11|
|  7|    8860|     62020|    220880(3.56)|    160563(2.59)|27.31|
|  8|    5251|     42008|    146822(3.50)|    103903(2.47)|29.23|
|  9|    2666|     23994|     81433(3.39)|     58657(2.44)|27.97|
| 10|    2008|     20080|     68385(3.41)|     46708(2.33)|31.70|
| 11|     859|      9449|     31596(3.34)|     22111(2.34)|30.02|
| 12|     671|      8052|     27039(3.36)|     18135(2.25)|32.93|
| 13|     346|      4498|     14306(3.18)|     10088(2.24)|29.48|
| 14|     235|      3290|     10676(3.24)|      7230(2.20)|32.28|
| 15|     117|      1755|      5568(3.17)|      3854(2.20)|30.78|
| 16|      68|      1088|      3383(3.11)|      2376(2.18)|29.77|
| 17|      21|       357|      1075(3.01)|       750(2.10)|30.23|

|   |  285108|   1121394|   4179780(3.73)|   3274211(2.92)|21.67|



The next table is  
From 207207 hangul ML.com samples (old raw data from VGRS). 

|  N|    FREQ|    N*FREQ|  SUM OF DUDE(X)| SUM OF LDUDE(Y)| COMP|

|  1|    1941|      1941|      7764(4.00)|      7764(4.00)|0.00|
|  2|   16978|     33956|    129239(3.81)|    111308(3.28)|13.87|
|  3|   38852|    116556|    436845(3.75)|    341333(2.93)|21.86|
|  4|   61642|    246568|    915355(3.71)|    653736(2.65)|28.58|
|  5|   40375|    201875|    743090(3.68)|    502097(2.49)|32.43|
|  6|   24561|    147366|    540245(3.67)|    349710(2.37)|35.27|
|  7|   13034|     91238|    332964(3.65)|    211206(2.31)|36.57|
|  8|    5596|     44768|    162833(3.64)|    100618(2.25)|38.21|
|  9|    2421|     21789|     78945(3.62)|     48633(2.23)|38.40|
| 10|    1033|     10330|     36144(3.50)|     22323(2.16)|38.24|
| 11|     427|      4697|     16744(3.56)|     10259(2.18)|38.73|
| 12|     173|      2076|      7178(3.46)|      4578(2.21)|36.22|
| 13|      96|      1248|      4386(3.51)|      2725(2.18)|37.87|
| 14|      32|       448|      1656(3.70)|      1006(2.25)|39.25|
| 15|      22|       330|      1168(3.54)|       750(2.27)|35.79|
| 16|      15|       240|       757(3.15)|       529(2.20)|30.12|
| 17|       8|       136|       470(3.46)|       299(2.20)|36.38|
| 19|       1|        19|        30(1.58)|        30(1.58)|0.00|

|   |  207207|    925581|   3415813(3.69)|   2368904(2.56)|30.65|