[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
[idn] Latin script and REORDERING
This posting is to explain why REORDERING do something good only for non-latin scripts.
I have 47263 latin ML.com samples from various parts of Eastern and Western Europe.
Total # of code points: 464915
# of basic latin letters: 409922
# of extened latin letters: 54993 ( 0x00a0 ~ 0x0370)
This data shows that extended latin letters account for 12% (roughly 1/9 )
of code points in typical latin labels. But the mean average length of latin
labels is 10. Therefore,in most cases, latin labels contain only 1 extended latin letter.
As you know AMC-Z is designed to favor latin letters(called basic code points in AMC-Z),
because AMC-Z encodes basic latin letters in literal mode "as it is"
for which reordering v2.0 does not do anything:
One typical latin domain of length 12: b<diaeresis u>stenhalter
U+0062 U+00FC U+0073 U+0074 U+0065 U+006E U+0068 U+0061 U+006C U+0074 U+0065 U+0072:
AMC-Z: bstenhalter-thB
the non-basic(extended) latin letter <diaeresis u> is encoded into -thB
in the latter part of the ACE label.
AMC-Z+REORDERING: bstenhalter-ymB (the same length)
Reordering v2.0 reorders only the code point values of EXTENED latin letters
based on frequency distribution data from ML.com samples from various countries.
Since most labels contain just 1 extened latin letters, reordering on
extened latin letters don't help, because reordering is designed to reduce
the successive code distances of the 2 or more non-basic code points.
I think it's fair to compensate AMC-Z-imposed disadvantage on non-latin scripts.
the next table in my I-D summarizes the reordering
improvements for each latin ML.com samples of length N:
(How to read the tables )
N: length of a domain label ( # of code points)
FREQ: number domains of length N
N*FREQ: sum of # of code points of domains of length N
SUM OF AMCZ: sum of lengths of AMCZ labels
X: SUM OF AMCZ / N * FREQ
SUM OF LAMCZ: sum of lengths of LAMCZ labels
Y: SUM OF LAMCZ / N * FREQ
COMP: (SUM OF LAMCZ - SUM OF AMCZ) / SUM OF AMCZ * 100
16. latin
| N| FREQ| N*FREQ| SUM OF AMCZ(X)| SUM OF LAMCZ(Y)| COMP|
| 1| 87| 87| 260(2.99)| 259(2.98)| 0.38|
| 2| 1043| 2086| 5140(2.46)| 5134(2.46)| 0.12|
| 3| 1046| 3138| 6274(2.00)| 6241(1.99)| 0.53|
| 4| 1812| 7248| 12750(1.76)| 12715(1.75)| 0.27|
| 5| 3238| 16190| 26129(1.61)| 26047(1.61)| 0.31|
| 6| 3956| 23736| 35894(1.51)| 35802(1.51)| 0.26|
| 7| 4340| 30380| 43756(1.44)| 43633(1.44)| 0.28|
| 8| 4639| 37112| 51351(1.38)| 51286(1.38)| 0.13|
| 9| 4551| 40959| 54994(1.34)| 54873(1.34)| 0.22|
| 10| 4289| 42890| 56159(1.31)| 56058(1.31)| 0.18|
| 11| 3778| 41558| 53227(1.28)| 53157(1.28)| 0.13|
| 12| 2967| 35604| 44820(1.26)| 44754(1.26)| 0.15|
| 13| 2501| 32513| 40264(1.24)| 40197(1.24)| 0.17|
| 14| 2058| 28812| 35212(1.22)| 35174(1.22)| 0.11|
| 15| 1653| 24795| 29947(1.21)| 29918(1.21)| 0.10|
| 16| 1372| 21952| 26264(1.20)| 26224(1.19)| 0.15|
| 17| 1094| 18598| 22053(1.19)| 21994(1.18)| 0.27|
| 18| 839| 15102| 17782(1.18)| 17722(1.17)| 0.34|
| 19| 632| 12008| 14045(1.17)| 13988(1.16)| 0.41|
| 20| 464| 9280| 10778(1.16)| 10721(1.16)| 0.53|
| 21| 312| 6552| 7539(1.15)| 7516(1.15)| 0.31|
| 22| 194| 4268| 4905(1.15)| 4876(1.14)| 0.59|
| 23| 124| 2852| 3242(1.14)| 3234(1.13)| 0.25|
| 24| 71| 1704| 1935(1.14)| 1925(1.13)| 0.52|
| 25| 71| 1775| 2011(1.13)| 2002(1.13)| 0.45|
| 26| 37| 962| 1083(1.13)| 1080(1.12)| 0.28|
| 27| 33| 891| 1004(1.13)| 996(1.12)| 0.80|
| 28| 17| 476| 535(1.12)| 529(1.11)| 1.12|
| 29| 13| 377| 422(1.12)| 420(1.11)| 0.47|
| 30| 9| 270| 298(1.10)| 299(1.11)|-0.34|
| 31| 7| 217| 243(1.12)| 238(1.10)| 2.06|
| 32| 9| 288| 321(1.11)| 316(1.10)| 1.56|
| 33| 4| 132| 146(1.11)| 144(1.09)| 1.37|
| 34| 2| 68| 76(1.12)| 74(1.09)| 2.63|
| 35| 1| 35| 38(1.09)| 38(1.09)| 0.00|
For arabic labels of length >13, the compression ratio is
close to 13%. compare this result with the one above.
1. arabic
| N| FREQ| N*FREQ| SUM OF AMCZ(X)| SUM OF LAMCZ(Y)| COMP|
| 1| 42| 42| 126(3.00)| 126(3.00)| 0.00|
| 2| 59| 118| 258(2.19)| 249(2.11)| 3.49|
| 3| 363| 1089| 2121(1.95)| 1992(1.83)| 6.08|
| 4| 888| 3552| 6359(1.79)| 5811(1.64)| 8.62|
| 5| 1122| 5610| 9550(1.70)| 8529(1.52)|10.69|
| 6| 1009| 6054| 9890(1.63)| 8620(1.42)|12.84|
| 7| 845| 5915| 9309(1.57)| 8134(1.38)|12.62|
| 8| 378| 3024| 4590(1.52)| 3992(1.32)|13.03|
| 9| 263| 2367| 3523(1.49)| 3063(1.29)|13.06|
| 10| 152| 1520| 2230(1.47)| 1941(1.28)|12.96|
| 11| 130| 1430| 2058(1.44)| 1787(1.25)|13.17|
| 12| 110| 1320| 1873(1.42)| 1614(1.22)|13.83|
| 13| 67| 871| 1230(1.41)| 1040(1.19)|15.45|
| 14| 61| 854| 1211(1.42)| 1015(1.19)|16.18|
| 15| 52| 780| 1085(1.39)| 924(1.18)|14.84|
| 16| 34| 544| 743(1.37)| 630(1.16)|15.21|
| 17| 11| 187| 256(1.37)| 218(1.17)|14.84|
| 18| 19| 342| 465(1.36)| 392(1.15)|15.70|
| 19| 8| 152| 201(1.32)| 175(1.15)|12.94|
| 20| 10| 200| 268(1.34)| 235(1.18)|12.31|
| 21| 3| 63| 85(1.35)| 75(1.19)|11.76|
| 22| 4| 88| 116(1.32)| 99(1.12)|14.66|
| 23| 3| 69| 89(1.29)| 76(1.10)|14.61|
| 24| 2| 48| 62(1.29)| 55(1.15)|11.29|
| 25| 5| 125| 165(1.32)| 143(1.14)|13.33|
| 26| 2| 52| 67(1.29)| 56(1.08)|16.42|
| 27| 2| 54| 73(1.35)| 61(1.13)|16.44|
| 33| 1| 33| 41(1.24)| 37(1.12)| 9.76|
| 34| 1| 34| 45(1.32)| 36(1.06)|20.00|
|All| 5646| 36537| 58089(1.59)| 51125(1.40)|11.99|
For hangul, the compression ratio reaches 31%.
8. hangul-1024
| N| FREQ| N*FREQ| SUM OF AMCZ(X)| SUM OF LAMCZ(Y)| COMP|
| 1| 1953| 1953| 7812(4.00)| 7812(4.00)| 0.00|
| 2| 17149| 34298| 124782(3.64)| 106238(3.10)|14.86|
| 3| 39643| 118929| 403205(3.39)| 323801(2.72)|19.69|
| 4| 62285| 249140| 816093(3.28)| 622067(2.50)|23.77|
| 5| 39675| 198375| 636102(3.21)| 470174(2.37)|26.09|
| 6| 23891| 143346| 452483(3.16)| 326242(2.28)|27.90|
| 7| 12448| 87136| 271953(3.12)| 192139(2.21)|29.35|
| 8| 5441| 43528| 134600(3.09)| 94322(2.17)|29.92|
| 9| 2264| 20376| 62405(3.06)| 43266(2.12)|30.67|
| 10| 895| 8950| 27223(3.04)| 18764(2.10)|31.07|
| 11| 373| 4103| 12420(3.03)| 8511(2.07)|31.47|
| 12| 141| 1692| 5080(3.00)| 3505(2.07)|31.00|
| 13| 77| 1001| 2986(2.98)| 2039(2.04)|31.71|
| 14| 32| 448| 1331(2.97)| 911(2.03)|31.56|
| 15| 20| 300| 884(2.95)| 603(2.01)|31.79|
| 16| 10| 160| 460(2.88)| 337(2.11)|26.74|
| 17| 7| 119| 354(2.97)| 243(2.04)|31.36|
|All| 206304| 913854| 2960173(3.24)| 2220974(2.43)|24.97|
REORDERING is consisted of only character mappings. It resembles
legacy-to-UCS2 mapping which preserves the character entity itself
while it assign new different code integer value to it.
It's clear that REORDERING adds just as much complexity as
the legacy-to-UCS mappings done by IDNA-aware applications
before nameprep/ACE process.
Soobok Lee