[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [idn] call for comments for REORDERING
Another answer for your concern.
----- Original Message -----
From: "James Seng/Personal" <jseng@pobox.org.sg>
> The bigger concern I have with re-ordering remains in the fact that
> tables mappings proves efficient with existing IDN names in some
> registries *BUT* it does not indicate what performance it would be like
> in the future. We do not know what happened when the names space get
> saturated and would other names which would have been useable without
> lsb become un-usuable due to lsb.
1) saturations in TLD namespaces would require longer names for which
REORDERING is designed to give greater benefits/compression ratio.
2) future variations on character usage frequency in each script
2.0) the character frequency table are constructed from
Verisign GRS' ML.com testbeds.
Even for chinese han script, their
registrations came from China/TAIWAN/JAPAN/KOREA and other
non-asian squatters.
Each country of the 4 have their own different han character
usage patterns. The reordering table for han , therefore,
cannot for the worst case, the mutual difference in improvement ratios
did not exceed +- 2% around 20%.
2.1) this issue is already answered by latest REORDERING I-D 2.0
see the enclosed excerpts from it. The influence of this
frequency variations is marginal.
My REORDERING I-D contains experiments with various lengths of han
frequency tables : 1024,2048,3072,4096. In order to look into
the influence of INs and OUTs of han characters from the most frequent
2048 han characters.
For the two cases of table size 2048 and 4096, there was merely +- 1% of
differences in achieved improvements. I also reversed the order of the table itself,but it produced nearly the same result.
The partial change in the order of the reordering table does not make big differences. some will lose and some will win. the net effect is near to zero.
I believe most of SJIS and KSC5601 han characters are included in the
most frequent 4096 han character tables, because their governments bodies
selected a few thousands of most frequent subset of entire han characters.
i believe that frequency fluctuation of han characters over time is
WITHIN the frequent set. INs and OUTs from 4096 ones are rare and does not invalidate the validity of most frequent 1024 and 2048 ones.
Moreover, TC/SC/KC characters are put side-by-side to avoid countriy-specific biases in han reordering table.
non-CJK scripts often haver small set of basic alphabets, and their
character usage patterns are more stable than those for han/hangeul.
REORDERING does not recommends reordering on shares latin scripts,
because latin characters are already encoded as it is (in literal mode,
the most efficient form ). latin script for europeans (0.6 billions) are the most favored one in ACE-Z. There shoulbe be some conpensations for
non europeans. Han script: 2 billions, Arabic: 0.7 billion, Hindi: 0.5 billion
This new frequency-based reordering is always more efficient than
original lexicographical ordering in UCS
even with some fluctuation in future script usage patterns.
We are not pursuing elusive "perfection and optimal" solution.
REORDERING tables cannot be modified if it is once freezed as standards.
Therefore,REORDERING is a sub-optimal solution in its nature but will be remain
as a valid and effective solution for a long time .
Unified Han and Hangeul
11172 Hangul syllables and 20912 CJK Unified Han ideographs occupy
roughly two thirds of current assigned unicode code points.
Their lexicographical ordering makes various ACE compression
algorithm work poorly for them, because they are spread evenly
through out those wide code blocks.
According to one usage frequency statistics on hangeul syllables
in general hangeul texts, the most frequent 256 Hangul syllables
have the cumulative frequency sum of 88.2% and for the case of top
512 ones, it reaches 99.9%. That means the maximum variation of
code point values(11172) can be shrinked into 512 in reordered
hangeul block with a probability of 99.9%.
Likewise, the most frequent 256 Han letters have the cumulative
frequency sum of 58.2% and for the cases of top 512,1024,2048 and
4096 ones, it reaches 72.8%,85.9%,95.4% and 99.4%, respectively.
That means the maximum variation of code point values (20912) can
be shrinked into 2048 with a probability of 95.4%.
The han/hangul frequency mapping tables are constructed from
nameprepped ML.com domains from VGRS MultiLingual testbeds.
The frequenet characters in the tables are organized by their
increasing frequency order to minimize the AMC-ACE-Z bootstring
delta values which can be lowered when bigger code distances are
from the lower positions of the sorted labels in AMC-ACE-Z step 2.
In general,character frequency distributions in any script block
may undergo some shifts within the frequent set by the passage of
time, but the in and out of some characters from the frequent set
are very rare. So, their impacts may be as marginal and negligable
as the following comparison of experiment results shows.
Reorering tables based on most frequent 1024,2048,3072 and 4096 han
and hangul letters in increasing frequency order, produced marginal
differences in improvements:
N is the length of sample labels and other
decimal values (in percentage) are the improvement ratios for
all the combinations of all N and 4 reordering tables.
| N| HAN-4096| HAN-3072| HAN-2048| HAN-1024|
| 1| 7.07 | 5.49 | 3.58 | 1.64|
| 2| 13.61 | 13.22 | 11.57 | 8.06|
| 3| 16.26 | 16.05 | 15.10 | 12.26|
| 4| 20.80 | 20.71 | 20.19 | 18.11|
| 5| 22.17 | 22.03 | 21.47 | 19.41|
| 6| 24.85 | 24.77 | 24.41 | 22.48|
| 7| 25.52 | 25.40 | 24.99 | 23.17|
| 8| 26.47 | 26.36 | 26.00 | 24.15|
| 9| 26.54 | 26.46 | 26.04 | 24.26|
| 10| 27.47 | 27.40 | 27.01 | 25.09|
| 11| 27.30 | 27.26 | 26.85 | 25.12|
| 12| 27.74 | 27.64 | 27.41 | 25.60|
| 13| 27.27 | 27.17 | 26.78 | 25.28|
| 14| 27.48 | 27.35 | 27.08 | 24.94|
| 15| 28.60 | 28.43 | 28.56 | 26.54|
| 16| 27.70 | 27.84 | 27.70 | 25.51|
| 17| 25.68 | 25.68 | 25.43 | 23.70|
|ALL| 20.30 | 20.14 | 19.43 | 17.09|
Experiments with two reorering tables in increasing and descreasing
orders for most frequent 2048,4096 han letters,also produced
marginal differences in improvements:
(4096D means: the ordering table is in decreasing frequency order)
| N| HAN-4096| HAN-4096D| HAN-2048| HAN-2048D|
| 1| 7.07 | 7.01 | 3.58 | 3.51 |
| 2| 13.61 | 13.44 | 11.57 | 11.27 |
| 3| 16.26 | 16.35 | 15.10 | 14.93 |
| 4| 20.80 | 20.56 | 20.19 | 19.90 |
| 5| 22.17 | 21.80 | 21.47 | 21.12 |
| 6| 24.85 | 24.21 | 24.41 | 23.82 |
| 7| 25.52 | 24.59 | 24.99 | 24.14 |
| 8| 26.47 | 25.68 | 26.00 | 25.36 |
| 9| 26.54 | 25.55 | 26.04 | 25.18 |
| 10| 27.47 | 26.79 | 27.01 | 26.42 |
| 11| 27.30 | 26.82 | 26.85 | 26.36 |
| 12| 27.74 | 27.46 | 27.41 | 27.13 |
| 13| 27.27 | 26.97 | 26.78 | 26.59 |
| 14| 27.48 | 27.31 | 27.08 | 26.99 |
| 15| 28.60 | 28.60 | 28.56 | 28.56 |
| 16| 27.70 | 27.55 | 27.70 | 27.20 |
| 17| 25.68 | 25.93 | 25.43 | 25.68 |
|ALL| 20.30 | 20.00 | 19.43 | 19.07 |
These experiments show that the influences of some fluctations in
character frequency distributions in the frequent set of a script
would not be so great that could invalidate or outdate this
reordering approach in the forseeable future.
But,to be as neutral and fair as possible in dealing with the cases
with different usage patterns in China,Japan,Korea and Taiwan, here
are provided some provisions for grouping country-specific variants
of certain han letters. Especially, a group of simplified chainese
letter (SC) and traditional chinese letter (TC) and Kanji-specific
letter (KC) are ranked by the sum of their frequecies and placed
side-by-side in the reordering table for Unified Han block.
For example, the reordering table looks like:
(TC1) (TC2 SC2) (TC3 KC3) (TC4) (TC5 SC5 KC5) (TC6) .....
This grouping will serve to prevent the frequency orders from being
skewed toward one of those country-specific usage patterns.
The experiments results 27 and 28 in [A3] shows that this reordering
scheme improve 21.95% and 18.50% for SC and TC labels,respectively.
According to experiments with huge han/hangeul domain samples,
as for 15 or more letters of han/hangeul domains, AMC-ACE-Z with
reordering produced the shortest ACE labels which length approximate
to 2.0*n~2.2*n (n= number of han/hangul code points in a label),
33.3% more efficient than bare AMC-ACE-Z without the reordering.
This efficiency is close to that of UCS-2 ( 2.0 * n) and much better
than that of UTF8 ( 3.0*n ).
The appendix [A3] also contains some tuning experiments on ACE-Z's
skew and damp parameters. With skew==48 and damp==75, +1.3% in
compression ratio was achieved for han domains with some marginal
loss of efficiency in non-CJK scripts.
19. unihan-1024
| 1| 4427| 4427| 14957(3.38)| 14711(3.32)| 1.64|
| 2| 57418| 114836| 384468(3.35)| 353466(3.08)| 8.06|
| 3| 41335| 124005| 401283(3.24)| 352095(2.84)|12.26|
| 4| 89296| 357184| 1139404(3.19)| 933070(2.61)|18.11|
| 5| 21091| 105455| 332420(3.15)| 267893(2.54)|19.41|
| 6| 15128| 90768| 284134(3.13)| 220263(2.43)|22.48|
| 7| 5181| 36267| 112576(3.10)| 86487(2.38)|23.17|
| 8| 3082| 24656| 76272(3.09)| 57854(2.35)|24.15|
| 9| 1417| 12753| 39319(3.08)| 29779(2.34)|24.26|
| 10| 1203| 12030| 37136(3.09)| 27817(2.31)|25.09|
| 11| 474| 5214| 16072(3.08)| 12035(2.31)|25.12|
| 12| 398| 4776| 14714(3.08)| 10947(2.29)|25.60|
| 13| 164| 2132| 6532(3.06)| 4881(2.29)|25.28|
| 14| 122| 1708| 5232(3.06)| 3927(2.30)|24.94|
| 15| 50| 750| 2283(3.04)| 1677(2.24)|26.54|
| 16| 29| 464| 1419(3.06)| 1057(2.28)|25.51|
| 17| 8| 136| 405(2.98)| 309(2.27)|23.70|
|All| 240823| 897561| 2868626(3.20)| 2378268(2.65)|17.09|
20. unihan-2048
| 1| 4427| 4427| 14957(3.38)| 14422(3.26)| 3.58|
| 2| 57418| 114836| 384468(3.35)| 339996(2.96)|11.57|
| 3| 41335| 124005| 401283(3.24)| 340675(2.75)|15.10|
| 4| 89296| 357184| 1139404(3.19)| 909323(2.55)|20.19|
| 5| 21091| 105455| 332420(3.15)| 261039(2.48)|21.47|
| 6| 15128| 90768| 284134(3.13)| 214781(2.37)|24.41|
| 7| 5181| 36267| 112576(3.10)| 84440(2.33)|24.99|
| 8| 3082| 24656| 76272(3.09)| 56439(2.29)|26.00|
| 9| 1417| 12753| 39319(3.08)| 29082(2.28)|26.04|
| 10| 1203| 12030| 37136(3.09)| 27106(2.25)|27.01|
| 11| 474| 5214| 16072(3.08)| 11756(2.25)|26.85|
| 12| 398| 4776| 14714(3.08)| 10681(2.24)|27.41|
| 13| 164| 2132| 6532(3.06)| 4783(2.24)|26.78|
| 14| 122| 1708| 5232(3.06)| 3815(2.23)|27.08|
| 15| 50| 750| 2283(3.04)| 1631(2.17)|28.56|
| 16| 29| 464| 1419(3.06)| 1026(2.21)|27.70|
| 17| 8| 136| 405(2.98)| 302(2.22)|25.43|
|All| 240823| 897561| 2868626(3.20)| 2311297(2.58)|19.43|
21. unihan-2048-D ( the reordering in decreasing frequency order)
| 1| 4427| 4427| 14957(3.38)| 14432(3.26)| 3.51|
| 2| 57418| 114836| 384468(3.35)| 341134(2.97)|11.27|
| 3| 41335| 124005| 401283(3.24)| 341362(2.75)|14.93|
| 4| 89296| 357184| 1139404(3.19)| 912694(2.56)|19.90|
| 5| 21091| 105455| 332420(3.15)| 262224(2.49)|21.12|
| 6| 15128| 90768| 284134(3.13)| 216465(2.38)|23.82|
| 7| 5181| 36267| 112576(3.10)| 85401(2.35)|24.14|
| 8| 3082| 24656| 76272(3.09)| 56931(2.31)|25.36|
| 9| 1417| 12753| 39319(3.08)| 29420(2.31)|25.18|
| 10| 1203| 12030| 37136(3.09)| 27324(2.27)|26.42|
| 11| 474| 5214| 16072(3.08)| 11835(2.27)|26.36|
| 12| 398| 4776| 14714(3.08)| 10722(2.24)|27.13|
| 13| 164| 2132| 6532(3.06)| 4795(2.25)|26.59|
| 14| 122| 1708| 5232(3.06)| 3820(2.24)|26.99|
| 15| 50| 750| 2283(3.04)| 1631(2.17)|28.56|
| 16| 29| 464| 1419(3.06)| 1033(2.23)|27.20|
| 17| 8| 136| 405(2.98)| 301(2.21)|25.68|
|All| 240823| 897561| 2868626(3.20)| 2321524(2.59)|19.07|
22. unihan-3072
| 1| 4427| 4427| 14957(3.38)| 14136(3.19)| 5.49|
| 2| 57418| 114836| 384468(3.35)| 333660(2.91)|13.22|
| 3| 41335| 124005| 401283(3.24)| 336865(2.72)|16.05|
| 4| 89296| 357184| 1139404(3.19)| 903458(2.53)|20.71|
| 5| 21091| 105455| 332420(3.15)| 259189(2.46)|22.03|
| 6| 15128| 90768| 284134(3.13)| 213746(2.35)|24.77|
| 7| 5181| 36267| 112576(3.10)| 83977(2.32)|25.40|
| 8| 3082| 24656| 76272(3.09)| 56168(2.28)|26.36|
| 9| 1417| 12753| 39319(3.08)| 28917(2.27)|26.46|
| 10| 1203| 12030| 37136(3.09)| 26962(2.24)|27.40|
| 11| 474| 5214| 16072(3.08)| 11690(2.24)|27.26|
| 12| 398| 4776| 14714(3.08)| 10647(2.23)|27.64|
| 13| 164| 2132| 6532(3.06)| 4757(2.23)|27.17|
| 14| 122| 1708| 5232(3.06)| 3801(2.23)|27.35|
| 15| 50| 750| 2283(3.04)| 1634(2.18)|28.43|
| 16| 29| 464| 1419(3.06)| 1024(2.21)|27.84|
| 17| 8| 136| 405(2.98)| 301(2.21)|25.68|
|All| 240823| 897561| 2868626(3.20)| 2290932(2.55)|20.14|
23. unihan-4096
| 1| 4427| 4427| 14957(3.38)| 13899(3.14)| 7.07|
| 2| 57418| 114836| 384468(3.35)| 332156(2.89)|13.61|
| 3| 41335| 124005| 401283(3.24)| 336045(2.71)|16.26|
| 4| 89296| 357184| 1139404(3.19)| 902406(2.53)|20.80|
| 5| 21091| 105455| 332420(3.15)| 258709(2.45)|22.17|
| 6| 15128| 90768| 284134(3.13)| 213522(2.35)|24.85|
| 7| 5181| 36267| 112576(3.10)| 83844(2.31)|25.52|
| 8| 3082| 24656| 76272(3.09)| 56083(2.27)|26.47|
| 9| 1417| 12753| 39319(3.08)| 28883(2.26)|26.54|
| 10| 1203| 12030| 37136(3.09)| 26935(2.24)|27.47|
| 11| 474| 5214| 16072(3.08)| 11684(2.24)|27.30|
| 12| 398| 4776| 14714(3.08)| 10632(2.23)|27.74|
| 13| 164| 2132| 6532(3.06)| 4751(2.23)|27.27|
| 14| 122| 1708| 5232(3.06)| 3794(2.22)|27.48|
| 15| 50| 750| 2283(3.04)| 1630(2.17)|28.60|
| 16| 29| 464| 1419(3.06)| 1026(2.21)|27.70|
| 17| 8| 136| 405(2.98)| 301(2.21)|25.68|
|All| 240823| 897561| 2868626(3.20)| 2286300(2.55)|20.30|
24. unihan-4096-D
| 1| 4427| 4427| 14957(3.38)| 13909(3.14)| 7.01|
| 2| 57418| 114836| 384468(3.35)| 332799(2.90)|13.44|
| 3| 41335| 124005| 401283(3.24)| 335682(2.71)|16.35|
| 4| 89296| 357184| 1139404(3.19)| 905086(2.53)|20.56|
| 5| 21091| 105455| 332420(3.15)| 259944(2.46)|21.80|
| 6| 15128| 90768| 284134(3.13)| 215353(2.37)|24.21|
| 7| 5181| 36267| 112576(3.10)| 84893(2.34)|24.59|
| 8| 3082| 24656| 76272(3.09)| 56682(2.30)|25.68|
| 9| 1417| 12753| 39319(3.08)| 29273(2.30)|25.55|
| 10| 1203| 12030| 37136(3.09)| 27189(2.26)|26.79|
| 11| 474| 5214| 16072(3.08)| 11762(2.26)|26.82|
| 12| 398| 4776| 14714(3.08)| 10674(2.23)|27.46|
| 13| 164| 2132| 6532(3.06)| 4770(2.24)|26.97|
| 14| 122| 1708| 5232(3.06)| 3803(2.23)|27.31|
| 15| 50| 750| 2283(3.04)| 1630(2.17)|28.60|
| 16| 29| 464| 1419(3.06)| 1028(2.22)|27.55|
| 17| 8| 136| 405(2.98)| 300(2.21)|25.93|
|All| 240823| 897561| 2868626(3.20)| 2294777(2.56)|20.00|
25. unihan-4096-DAMP075-SKEW48
| 1| 4427| 4427| 14957(3.38)| 13899(3.14)| 7.07|
| 2| 57418| 114836| 375901(3.27)| 324587(2.83)|13.65|
| 3| 41335| 124005| 394416(3.18)| 330550(2.67)|16.19|
| 4| 89296| 357184| 1126357(3.15)| 890277(2.49)|20.96|
| 5| 21091| 105455| 329783(3.13)| 255913(2.43)|22.40|
| 6| 15128| 90768| 282751(3.12)| 211339(2.33)|25.26|
| 7| 5181| 36267| 112181(3.09)| 83126(2.29)|25.90|
| 8| 3082| 24656| 76111(3.09)| 55712(2.26)|26.80|
| 9| 1417| 12753| 39285(3.08)| 28699(2.25)|26.95|
| 10| 1203| 12030| 37150(3.09)| 26767(2.23)|27.95|
| 11| 474| 5214| 16028(3.07)| 11603(2.23)|27.61|
| 12| 398| 4776| 14712(3.08)| 10567(2.21)|28.17|
| 13| 164| 2132| 6528(3.06)| 4735(2.22)|27.47|
| 14| 122| 1708| 5248(3.07)| 3762(2.20)|28.32|
| 15| 50| 750| 2281(3.04)| 1628(2.17)|28.63|
| 16| 29| 464| 1425(3.07)| 1017(2.19)|28.63|
| 17| 8| 136| 404(2.97)| 301(2.21)|25.50|
|All| 240823| 897561| 2835518(3.16)| 2254482(2.51)|20.49|
26. unihan-4096-DUDE
| 1| 4427| 4427| 17708(4.00)| 17708(4.00)| 0.00|
| 2| 57418| 114836| 443874(3.87)| 409657(3.57)| 7.71|
| 3| 41335| 124005| 474117(3.82)| 408039(3.29)|13.94|
| 4| 89296| 357184| 1361917(3.81)| 1074237(3.01)|21.12|
| 5| 21091| 105455| 401146(3.80)| 308378(2.92)|23.13|
| 6| 15128| 90768| 344208(3.79)| 250925(2.76)|27.10|
| 7| 5181| 36267| 137275(3.79)| 99475(2.74)|27.54|
| 8| 3082| 24656| 93013(3.77)| 65889(2.67)|29.16|
| 9| 1417| 12753| 48000(3.76)| 34230(2.68)|28.69|
| 10| 1203| 12030| 45427(3.78)| 31663(2.63)|30.30|
| 11| 474| 5214| 19564(3.75)| 13708(2.63)|29.93|
| 12| 398| 4776| 18013(3.77)| 12468(2.61)|30.78|
| 13| 164| 2132| 7969(3.74)| 5590(2.62)|29.85|
| 14| 122| 1708| 6377(3.73)| 4476(2.62)|29.81|
| 15| 50| 750| 2811(3.75)| 1926(2.57)|31.48|
| 16| 29| 464| 1749(3.77)| 1213(2.61)|30.65|
| 17| 8| 136| 508(3.74)| 355(2.61)|30.12|
|All| 240823| 897561| 3423676(3.81)| 2739937(3.05)|19.97|
27. unihan-SC-4096 ( SC only or SC+TC mixed )
| 1| 769| 769| 2717(3.53)| 2378(3.09)|12.48|
| 2| 16065| 32130| 108598(3.38)| 92597(2.88)|14.73|
| 3| 14315| 42945| 139693(3.25)| 116054(2.70)|16.92|
| 4| 48871| 195484| 623650(3.19)| 491073(2.51)|21.26|
| 5| 12135| 60675| 190928(3.15)| 147721(2.43)|22.63|
| 6| 10463| 62778| 196038(3.12)| 146516(2.33)|25.26|
| 7| 3594| 25158| 77931(3.10)| 57412(2.28)|26.33|
| 8| 2373| 18984| 58686(3.09)| 42907(2.26)|26.89|
| 9| 1078| 9702| 29875(3.08)| 21736(2.24)|27.24|
| 10| 934| 9340| 28786(3.08)| 20855(2.23)|27.55|
| 11| 392| 4312| 13279(3.08)| 9612(2.23)|27.62|
| 12| 314| 3768| 11579(3.07)| 8376(2.22)|27.66|
| 13| 144| 1872| 5724(3.06)| 4158(2.22)|27.36|
| 14| 104| 1456| 4455(3.06)| 3226(2.22)|27.59|
| 15| 41| 615| 1868(3.04)| 1348(2.19)|27.84|
| 16| 25| 400| 1219(3.05)| 887(2.22)|27.24|
| 17| 7| 119| 353(2.97)| 264(2.22)|25.21|
|All| 111624| 470507| 1495379(3.18)| 1167120(2.48)|21.95|
28. unihan-TC-4096 ( TC only )
| 1| 3658| 3658| 12240(3.35)| 11521(3.15)| 5.87|
| 2| 41353| 82706| 275870(3.34)| 239559(2.90)|13.16|
| 3| 27020| 81060| 261590(3.23)| 219991(2.71)|15.90|
| 4| 40425| 161700| 515754(3.19)| 411333(2.54)|20.25|
| 5| 8956| 44780| 141492(3.16)| 110988(2.48)|21.56|
| 6| 4665| 27990| 88096(3.15)| 67006(2.39)|23.94|
| 7| 1587| 11109| 34645(3.12)| 26432(2.38)|23.71|
| 8| 709| 5672| 17586(3.10)| 13176(2.32)|25.08|
| 9| 339| 3051| 9444(3.10)| 7147(2.34)|24.32|
| 10| 269| 2690| 8350(3.10)| 6080(2.26)|27.19|
| 11| 82| 902| 2793(3.10)| 2072(2.30)|25.81|
| 12| 84| 1008| 3135(3.11)| 2256(2.24)|28.04|
| 13| 20| 260| 808(3.11)| 593(2.28)|26.61|
| 14| 18| 252| 777(3.08)| 568(2.25)|26.90|
| 15| 9| 135| 415(3.07)| 282(2.09)|32.05|
| 16| 4| 64| 200(3.12)| 139(2.17)|30.50|
| 17| 1| 17| 52(3.06)| 37(2.18)|28.85|
|All| 129199| 427054| 1373247(3.22)| 1119180(2.62)|18.50|