[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
[idn] [FYI] improving ACE using code point reordering v1.0
Hi,
I have a new revision of my I-D: improving ACE using code point reordering
v1.0
http://www.postel.co.kr/lsb-ace-01.txt ( sources included for dude and
amc-w)
It reports 30%~ 58% improvements in compressing ACE labels of
typical han/hangeul business names in CJK.
Reordering v1.0 is based on character frequency plus
WORD ADJACENCY statitistics on modern han/hangeul business names
and the fact that most frequent 256 han letters have cumulative
usage frequency of near 60% ( for top freq 256 hangul syllables,
about 80%).
I applied this reordering to both DUDE and AMC-ACE-W, and found
DUDE outperforms AMC-ACE-W even for han/hangul.
I think tricks, tuning or heuristics (not based on language-specific
knowledge) are not enough to get the 'ceiling' ACE compression ratio.
I propose "Let languages compress themselves in new reordering layer
between NAMEPREP and ACE, and leave ACE encoding as simple as possible".
Your careful evaluation and feedback, please.
Thanks.
Soobok, lsb@postel.co.kr
----------------------------------------------------------------------------
---
Example Strings
About 30%~58% improvement in DUDE compression ratio is achieved in
these Hangul examples.
LDUDE and LAMCW denote reordering-applied DUDE-02 and
AMC-ACE-W, respectively. (AMCW for AMC-ACE-W).
Most examples show LDUDE outperforms LAMCW.
(K1) Korean String 1: ( 24 hangul syllables )
u+C138 u+ACC4 u+C758 u+BAA8 u+B4E0 u+C0AC u+B78C u+B4E4
u+C774 u+D55C u+AD6D u+C5B4 u+B97C u+C774 u+D574 u+D55C
u+B2E4 u+BA74 u+C5BC u+B9C8 u+B098 u+C88B
DUDE-02 : 6txiy79ny53nz79a8wizwwnzzuavyizv3atuuiz2vby27jz66iz8sit\
usauiyz5i23az96iz6ze3xaz2td ( 82 chars )
LDUDE : 5suhxb9jt2pydtwetwkxhtsrxhbyhvsmvvk7r2ityd6atqt8etvittk
( 55 chars, 33.9% shorter )
AMCW : 6tvifgem42ixihhakfnh6nhhem5wrk6fmpmpwim6m5wrmwxn5u8eivw\
mp6iqige2nem ( 67 chars )
LAMCW : 5swhtg8r5tycsb5swfgirxi5sxhsabyg5vypgcz2isa5tyd4d5p5sxj\
gmbgd5 ( 61 chars )
(K2) Korean String 2: ( 9 hangul syllables )
<KRNIC in korean>
U+D55C U+AD6D U+C778 U+D130 U+B137 U+C815 U+BCF4 U+C13C
DUDE-02 : 7xvNz2vBy4tFtywIyssHz3uCzw8Bz76I
( 32 chars )
LDUDE : 5syAB3BIJ7BB7N
( 14 chars, 56.2% shorter )
AMCW : 7xxNFmpM52QjsGjzNaxJhwKj6
( 25 chars )
LAMCW : 5ssAsB3AIBwAB3P
( 15 chars )
(K3) Korean String 3: ( 18 hangul syllables )
U+C804 U+AD6D U+C2E4 U+C9C1 U+B178 U+C219 U+C790 U+B300
U+CC45 U+C885 U+AD50 U+C2DC U+BBFC U+B2E8 U+CCB4 U+D611
U+C758 U+D68C
DUDE-02 : 62yEyxyJy92J5uFz25JzvyBx2Jzw3Az9wFw6Ayx7Fy92Nz3uA3tEz8\
xNt44FttwJtt7E ( 68 chars )
LDUDE : 5szAtBtvBt7Mt2Qv4Qu7KtFt5It3MuEvAtvDyJCtuC4G4J
( 46 chars, 32% shorter )
AMCW : 62sEFmpKzeNqbGm2Ks3M6sG2aPcfNefFksKy6I96GziPfwRstM42Rwn
( 55 chars )
LAMCW : 5stAsB5tvAGhmGmgG2mGatsE5t7JGbhsDvD5tsAyIK5swJ8RwG
( 50 chars )
(K4) Korean String 4: ( 7 hangul syllables )
<Hynics Semiconductor in korean>
U+D558 U+C774 U+B2C9 U+C2A4 U+BC18 U+B3C4 U+CCB4
DUDE-02 : 7xvItuuNzx5PzsyPz85N97Nz9zA
( 27 chars )
LDUDE : 5s3C4F5Q7PtwRtMK
( 16 chars, 40% shorter )
AMCW : 7xxIM5wGyjKxeJa2G8ePfw
( 22 chars )
LAMCW : 5s9CxH8JvE5tzMyAK
( 17 chars )
(K5) Korean String 5: ( 13 hangul syllables )
U+D658 U+ACBD U+C6B4 U+B3D9 U+C5F0 U+D569 U+BC18 U+D575
U+D2B9 U+BCC4 U+C704 U+C6D0 U+D68C
DUDE-02 : 7yvIz48Fy4sJzxyPzyuJts3Jy3zBy3yPz6Ny8zPz56At7EtsxN
( 50 chars )
LDUDE : 5s7NB4EDvHFtxDv5Kv6NtIt4R5GwK
( 29 chars, 42% shorter )
AMCW : 7yxIFf7MxwG83MrsRmjJa2RmxQx3JgeM2eMysRwn
( 40 chars )
LAMCW : 5s5N5PtJKuPI5tzMGybGiptF5s5KsNwG
( 32 chars )
About 35%~50% improvement in DUDE compression ratio is achieved in
these UniHan examples.
(TC1) Traditional Chinese String 1: ( 16 letters )
u+5354 u+91c7 u+5065 u+5eb7 u+4e8b u+696d u+670d u+52d9
u+7db2 u+002d u+5354 u+91c7 u+6709 u+9650 u+516c u+53f8
DUDE-02 : xvve6u3d6t4c87ctsvnuz8g8yavx7eu9ym-u88g6u3d9y6q9txj6z\
vnu3e ( 58 chars)
LDUDE : xs8qy7ny9jhyi6f6bb8h-4iy7nyxkbed
( 32 chars, 44.8% shorter)
AMCW : xvxen8huyfafzs2mc5pcipw7jh7u--xxen8hcijqcsvynx9i
( 48 chars )
LAMCW : xs2q2xcu4m4n6esb6abug--2q2xcusijpq
( 34 chars )
(TC2) Traditional Chinese String 2: ( 21 letters )
u+5317 u+4eac u+5e02 u+91ab u+85e5 u+7d93 u+6fdf u+6280
u+8853 u+7d93 u+71df u+516c u+53f8 u+5fa1 u+91ab u+7db2
u+7d61 u+83ef u+91ab u+7db2 u+8def
DUDE-02 : xvzht75mts4q694jtwwq92zgtuwn7xr847d9x6a6wnus5du3e6xj6\
8sk86tj7d982qtuwe86tj9sxp ( 78 chars)
LDUDE : xtwicfz6b99a38g27c2vdd8cz7mzuqdt6izuiy6iz5nz5fy6by6ib
( 53 chars, 32.0% shorter)
AMCW : xvths4naacn7mj9fh6veq9beakuvh6ve89vynx9iapbn7mh7uyb2v\
8rn7mh7um9r ( 64 chars )
LAMCW : xtuiukr28q5tqu9i4ukutjk9i3uduspqv6g28quug33kuur28quugh
( 54 chars )
(TC3) Traditional Chinese String 3: ( 18 letters )
u+795e u+8fb2 u+7db2 u+990a u+8eab u+4fdd u+5065 u+7db2
u+5065 u+5eb7 u+4e16 u+754c u+5065 u+5eb7 u+8a2d u+8a08
u+5bb6 u+60e0
DUDE-02 : z3vq9y8n9usa8w5itz4b6tzgt95iu77hu77h87cts4bv5xkuxuj87\
c7w3kuf7t5qv5xg ( 68 chars )
LDUDE : xwsiw5e9kzyqz8fhb2p2phtvgxtbwuah8qbtwmyg
( 40 chars, 41.1% shorter )
AMCW : z3xqnpuh7uq2knfmt7puyfh7uuyfafzstgf4nuyfafzmbpsi75gys\
8a ( 55 chars )
LAMCW : xwyiu7nug3wiu4pkmug4mnv3ky2mu4mnwcdvsiyq
( 40 chars )
(SC1) Simplified Chinese String 1 : ( 16 letters )
<ministry of foreign trade and economic cooperation, PRC>
u+4e2d u+534e u+4eba u+6c11 u+5171 u+548c u+56fd u+5bf9
u+5916 u+8d38 u+6613 u+7ecf u+6d4e u+5408 u+4f5c u+90e8
DUDE-02 : w8wpt7ydt79euu4mv7yax9puzb7seu8r7wuq85umt27ntv2bv3wgt\
5xe795e ( 60 chars )
LDUDE : xswjuzru6nu7fv7kv4gutrwgb7mbwiu6cuzqqxm
( 39 chars, 35.0% shorter )
AMCW : w8up29ps5kdst5uh7ygsup29pm3cb39n8tknpb39hkygswhdysupa\
qd ( 55 chars )
LAMCW : xsujwxgu3kwwrv3fwvduunykm5ab9jwvmuwfmta
( 39 chars )
(SC2) Simplified Chinese String 2 : ( 18 letters )
u+4e2d u+56fd u+4eba u+6c11 u+5927 u+5b66 u+4e2d u+56fd
u+8d22 u+653f u+91d1 u+878d u+653f u+7b56 u+7814 u+7a76
u+4e2d u+5fc3
DUDE-02 : w8wpt27at2whuu4mvxvguwbtxwmt27a757r82tp9w8qtyxn8u5ct\
8yjvwcuycvwxmtt8q ( 69 chars )
LDUDE : xswjf5gu7fu6rb4ifz8dx6ju8gnu8kwugy8fd8rd
( 40 chars, 42.2% shorter )
LAMCW : xsujun3kwwru2abujn36rwsgu8anwsg2uau6fgujk
( 41 chars )
About 20%~35% improvement in DUDE compression ratio is achieved in
these Japanese Kanji/Katakana examples.
(JP1) Japanese String 1: ( 25 letters )
U+793E U+56E3 U+6CD5 U+4EBA U+65E5 U+672C U+30CD U+30C3 U+30C8
U+30EF U+30FC U+30AF U+30A4 U+30F3 U+30D5 U+30A9 U+30E1 U+30FC
U+30B7 U+30E7 U+30F3 U+30BB U+30F3 U+30BF U+30FC
DUDE-02 : z3xQu97Pv4vGuuyRu5xRu6Jxz8BQMuHtDxDMxHuGzNwItPwMxAtE\
wIwIwNwD (60 chars)
LDUDE : xs8Nu2Cu4RvMGBysxGyCKtHtQCPFtAyPyKtPBGPyAyAyFyR
( 47 chars, 21.6% shorter)
AMCW : z3vQ28DDyxs5KB9fCjnvs6P6DI8R9N4RE9D7F4J8B9N5H8H9D5M9\
D5R9N ( 57 chars )
LAMCW : xs2NwsQu4B3KNPvs6M4JD5E4KIFA5A7P5H4KMPA6A4A6F4K
( 47 chars )
(JP2) Japanese String 2: ( 16 letters )
U+8CA1 U+56E3 U+6CD5 U+4EBA U+5317 U+6D77 U+9053 U+81EA
U+7136 U+4FDD U+8B77 U+63A8 U+9032 U+5354 U+4F1A
DUDE-02 : 266B74wCv4vGuuyRt74Pv8yA97uEtt5J9s7Nv88M6w4K827R9v3K\
6vyGt6wQ (60 chars)
LDUDE : xs3Hu9Ju4RvMt5CFvuGvsRxtGw5Iz2Ev6BzIwtJE
( 40 chars, 33.3% shorter)
AMCW : 264B28DDyxs5KxtHD5zNuvI9kE3yt7PMmzBpiNtuxxEttK
( 46 chars )
LAMCW : xs9HwsQu4B3KvuIPwsMvsEytCu4K3uQy8R3Hu2QK
( 40 chars )
(JP3) Japanese String 3: ( 16 letters )
U+6771 U+4EAC U+90FD U+60C5 U+5831 U+30B5 U+30FC U+30D3 U+30B9
U+7523 U+696D U+5065 U+5EB7 U+4FDD U+967A U+7D44 U+5408
DUDE-02 : yztBu37P78xB9svIv29Ey22EwJuRyKwx3Kt6wQv3sI87CttyK734\
H85vQu3wN (61 chars)
LDUDE : xttHxPvtFu9CDyssAyEyHyRys9PxQ4KHGEu4CuwJ
( 40 chars, 34.4% shorter)
AMCW : z3vQ28DDyxs5KB9fCjnvs6P6DI8R9N4RE9D7F4J8B9N5H8H9D5M9\
D5R9N ( 57 chars )
LAMCW : xs2NwsQu4B3KNPvs6M4JD5E4KIFA5A7P5H4KMPA6A4A6F4K
( 47 chars )
LDUDE-2 shows the same good compression ratio for Latin family of
scripts.
(L1) Vietnamese: ( 38 syllables using diacritical marks )
Ta<dotbelow>isaoho<dotbelow>kh<ocirc>ngth<ecirc><hookabove>chi\
<hookabove>no<acute>iti<ecirc><acute>ngVi<ecirc><dotbelow>t
U+0054 u+0061 u+0323 u+0069 u+0073 u+0061 u+006F u+0068 u+006F
u+0323 u+006B u+0068 u+00F4 u+006E u+0067 u+0074 u+0068 u+00EA
u+0309 u+0063 u+0068 u+0069 u+0309 u+006E u+006F u+0301 u+0069
u+0074 u+0069 u+00EA u+0301 u+006E u+0067 U+0056 u+0069 u+00EA
u+0323 u+0074
DUDE-02 : vEvfvwcvwktktcqhhvwnvwid3n3kjtdtn2cv8dvykmbvyavyhbvyqv\
yitptp2dv8mvyrjvBvr2dv6jvxh ( 82 chars )
LDUDE : uGuh5c5kckqhh5n4atm3n3ktmtdq2cxd7kmb7a7hb7q7irr2dxm7rt\
muDvr2dvj5f (66 chars , 16 chars(19%) shorter)
(L2) Spanish: ( using basic Latin & Latin Supplement )
Porqu<eacute>nopuedensimplementehablarenEspa<ntilde>ol
U+0050 u+006F u+0072 u+0071 u+0075 u+00E9 u+006E u+006F u+0070
u+0075 u+0065 u+0064 u+0065 u+006E u+0073 u+0069 u+006D u+0070
u+006C u+0065 u+006D u+0065 u+006E u+0074 u+0065 u+0068 u+0061
u+0062 u+006C u+0061 u+0072 u+0065 u+006E U+0045 u+0073 u+0070
u+0061 u+00F1 u+006F u+006C
DUDE-02 : vAvrtpde3n2hbtrftabbmtptketptnjiimtktbpjdqptdthmuMvgdt\
b3a3qd (61 chars)
LDUDE : uAurftmtg2q2hbrhcbbmfcepnjiimidpjdqpmrmuMuqmb3a3qd
(51 chars, 10 chars (16%) shorter)
(L3) Czech: (using Latin Extended A)
Pro<ccaron>prost<ecaron>nemluv<iacute><ccaron>esky
U+0050 u+0072 u+006F u+010D u+0070 u+0072 u+006F u+0073 u+0074
u+011B u+006E u+0065 u+006D u+006C u+0075 u+0076 u+00ED u+010D
u+0065 u+0073 u+006B u+0079
DUDE-02 : vAuctptyctzpctptnhtyrtzfmibtjd3mt8atyitgtitc
(45 chars)
LDUDE : uAukfycypkfepzpzfmibmtb3m8ayiqtik
(34 chars, 24% shorter)