[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[idn] [FYI] improving ACE using code point reordering v1.0



Hi,

I have a new revision of my I-D:  improving ACE using code point reordering
v1.0
 http://www.postel.co.kr/lsb-ace-01.txt   ( sources included for dude and
amc-w)

It reports 30%~ 58% improvements in compressing ACE labels of
typical han/hangeul business names in CJK.

Reordering  v1.0  is based on character frequency plus
WORD ADJACENCY statitistics on modern han/hangeul business names
and the fact that most frequent 256 han letters have cumulative
usage frequency of near 60%  ( for top freq 256 hangul syllables,
about 80%).

I applied this reordering to both DUDE and AMC-ACE-W, and found
DUDE outperforms AMC-ACE-W even for han/hangul.

I think  tricks, tuning or heuristics (not based on language-specific
knowledge) are not enough to get the 'ceiling' ACE compression ratio.

I propose "Let  languages compress themselves  in new reordering layer
between NAMEPREP and ACE, and leave ACE encoding as simple as possible".


Your careful evaluation and feedback, please.

Thanks.

Soobok, lsb@postel.co.kr

----------------------------------------------------------------------------
---
Example Strings

    About 30%~58% improvement in DUDE compression ratio is achieved in
    these Hangul examples.

    LDUDE and LAMCW denote reordering-applied DUDE-02 and
    AMC-ACE-W, respectively. (AMCW for AMC-ACE-W).
    Most examples show LDUDE outperforms LAMCW.

    (K1) Korean String 1: ( 24 hangul syllables )
      u+C138 u+ACC4 u+C758 u+BAA8 u+B4E0 u+C0AC u+B78C u+B4E4
      u+C774 u+D55C u+AD6D u+C5B4 u+B97C u+C774 u+D574 u+D55C
      u+B2E4 u+BA74 u+C5BC u+B9C8 u+B098 u+C88B

      DUDE-02 : 6txiy79ny53nz79a8wizwwnzzuavyizv3atuuiz2vby27jz66iz8sit\
                usauiyz5i23az96iz6ze3xaz2td ( 82 chars )
      LDUDE   : 5suhxb9jt2pydtwetwkxhtsrxhbyhvsmvvk7r2ityd6atqt8etvittk
                ( 55 chars, 33.9% shorter )
      AMCW    : 6tvifgem42ixihhakfnh6nhhem5wrk6fmpmpwim6m5wrmwxn5u8eivw\
                mp6iqige2nem ( 67 chars )
      LAMCW   : 5swhtg8r5tycsb5swfgirxi5sxhsabyg5vypgcz2isa5tyd4d5p5sxj\
                gmbgd5 ( 61 chars )


    (K2) Korean String 2:  ( 9 hangul syllables )
      <KRNIC in korean>
      U+D55C U+AD6D U+C778 U+D130 U+B137 U+C815 U+BCF4 U+C13C

      DUDE-02 : 7xvNz2vBy4tFtywIyssHz3uCzw8Bz76I
                ( 32 chars )
      LDUDE   : 5syAB3BIJ7BB7N
                ( 14 chars,  56.2% shorter )
      AMCW    : 7xxNFmpM52QjsGjzNaxJhwKj6
                ( 25 chars )
      LAMCW   : 5ssAsB3AIBwAB3P
                ( 15 chars )


    (K3) Korean String 3:  ( 18 hangul syllables )
      U+C804 U+AD6D U+C2E4 U+C9C1 U+B178 U+C219 U+C790 U+B300
      U+CC45 U+C885 U+AD50 U+C2DC U+BBFC U+B2E8 U+CCB4 U+D611
      U+C758 U+D68C

      DUDE-02 : 62yEyxyJy92J5uFz25JzvyBx2Jzw3Az9wFw6Ayx7Fy92Nz3uA3tEz8\
                xNt44FttwJtt7E ( 68 chars )
      LDUDE   : 5szAtBtvBt7Mt2Qv4Qu7KtFt5It3MuEvAtvDyJCtuC4G4J
                ( 46 chars,  32% shorter )
      AMCW    : 62sEFmpKzeNqbGm2Ks3M6sG2aPcfNefFksKy6I96GziPfwRstM42Rwn
                ( 55 chars )
      LAMCW   : 5stAsB5tvAGhmGmgG2mGatsE5t7JGbhsDvD5tsAyIK5swJ8RwG
                ( 50 chars )


    (K4) Korean String 4:  ( 7 hangul syllables )
      <Hynics Semiconductor in korean>
      U+D558 U+C774 U+B2C9 U+C2A4 U+BC18 U+B3C4 U+CCB4

      DUDE-02 : 7xvItuuNzx5PzsyPz85N97Nz9zA
                ( 27 chars )
      LDUDE   : 5s3C4F5Q7PtwRtMK
                ( 16 chars,  40% shorter )
      AMCW    : 7xxIM5wGyjKxeJa2G8ePfw
                ( 22 chars )
      LAMCW   : 5s9CxH8JvE5tzMyAK
                ( 17 chars )


    (K5) Korean String 5:  ( 13 hangul syllables )
      U+D658 U+ACBD U+C6B4 U+B3D9 U+C5F0 U+D569 U+BC18 U+D575
      U+D2B9 U+BCC4 U+C704 U+C6D0 U+D68C

      DUDE-02 : 7yvIz48Fy4sJzxyPzyuJts3Jy3zBy3yPz6Ny8zPz56At7EtsxN
                ( 50 chars )
      LDUDE   : 5s7NB4EDvHFtxDv5Kv6NtIt4R5GwK
                ( 29 chars, 42% shorter )
      AMCW    : 7yxIFf7MxwG83MrsRmjJa2RmxQx3JgeM2eMysRwn
                ( 40 chars )
      LAMCW   : 5s5N5PtJKuPI5tzMGybGiptF5s5KsNwG
                ( 32 chars )



    About 35%~50% improvement in DUDE compression ratio is achieved in
    these UniHan examples.

    (TC1) Traditional Chinese String 1: ( 16 letters )
      u+5354 u+91c7 u+5065 u+5eb7 u+4e8b u+696d u+670d u+52d9
      u+7db2 u+002d u+5354 u+91c7 u+6709 u+9650 u+516c u+53f8

      DUDE-02 : xvve6u3d6t4c87ctsvnuz8g8yavx7eu9ym-u88g6u3d9y6q9txj6z\
                vnu3e  ( 58 chars)
      LDUDE   : xs8qy7ny9jhyi6f6bb8h-4iy7nyxkbed
                ( 32 chars, 44.8% shorter)
      AMCW    : xvxen8huyfafzs2mc5pcipw7jh7u--xxen8hcijqcsvynx9i
                ( 48 chars )
      LAMCW   : xs2q2xcu4m4n6esb6abug--2q2xcusijpq
                ( 34 chars )


    (TC2) Traditional Chinese String 2: ( 21 letters )
      u+5317 u+4eac u+5e02 u+91ab u+85e5 u+7d93 u+6fdf u+6280
      u+8853 u+7d93 u+71df u+516c u+53f8 u+5fa1 u+91ab u+7db2
      u+7d61 u+83ef u+91ab u+7db2 u+8def

      DUDE-02 : xvzht75mts4q694jtwwq92zgtuwn7xr847d9x6a6wnus5du3e6xj6\
                8sk86tj7d982qtuwe86tj9sxp ( 78 chars)
      LDUDE   : xtwicfz6b99a38g27c2vdd8cz7mzuqdt6izuiy6iz5nz5fy6by6ib
                ( 53 chars, 32.0% shorter)
      AMCW    : xvths4naacn7mj9fh6veq9beakuvh6ve89vynx9iapbn7mh7uyb2v\
                8rn7mh7um9r ( 64 chars )
      LAMCW   : xtuiukr28q5tqu9i4ukutjk9i3uduspqv6g28quug33kuur28quugh
                ( 54 chars )


    (TC3) Traditional Chinese String 3: ( 18 letters )
      u+795e u+8fb2 u+7db2 u+990a u+8eab u+4fdd u+5065 u+7db2
      u+5065 u+5eb7 u+4e16 u+754c u+5065 u+5eb7 u+8a2d u+8a08
      u+5bb6 u+60e0

      DUDE-02 : z3vq9y8n9usa8w5itz4b6tzgt95iu77hu77h87cts4bv5xkuxuj87\
                c7w3kuf7t5qv5xg ( 68 chars )
      LDUDE   : xwsiw5e9kzyqz8fhb2p2phtvgxtbwuah8qbtwmyg
                ( 40 chars, 41.1% shorter )
      AMCW    : z3xqnpuh7uq2knfmt7puyfh7uuyfafzstgf4nuyfafzmbpsi75gys\
                8a ( 55 chars )
      LAMCW   : xwyiu7nug3wiu4pkmug4mnv3ky2mu4mnwcdvsiyq
                ( 40 chars )


    (SC1) Simplified Chinese String  1 : ( 16 letters )
      <ministry of foreign trade and economic cooperation, PRC>
      u+4e2d u+534e u+4eba u+6c11 u+5171 u+548c u+56fd u+5bf9
      u+5916 u+8d38 u+6613 u+7ecf u+6d4e u+5408 u+4f5c u+90e8

      DUDE-02 : w8wpt7ydt79euu4mv7yax9puzb7seu8r7wuq85umt27ntv2bv3wgt\
                5xe795e ( 60 chars )
      LDUDE   : xswjuzru6nu7fv7kv4gutrwgb7mbwiu6cuzqqxm
                ( 39 chars, 35.0% shorter )
      AMCW    : w8up29ps5kdst5uh7ygsup29pm3cb39n8tknpb39hkygswhdysupa\
                qd ( 55 chars )
      LAMCW   : xsujwxgu3kwwrv3fwvduunykm5ab9jwvmuwfmta
                ( 39 chars )


    (SC2) Simplified Chinese String  2 : ( 18 letters )
      u+4e2d u+56fd u+4eba u+6c11 u+5927 u+5b66 u+4e2d u+56fd
      u+8d22 u+653f u+91d1 u+878d u+653f u+7b56 u+7814 u+7a76
      u+4e2d u+5fc3

      DUDE-02 : w8wpt27at2whuu4mvxvguwbtxwmt27a757r82tp9w8qtyxn8u5ct\
                8yjvwcuycvwxmtt8q ( 69 chars )
      LDUDE   : xswjf5gu7fu6rb4ifz8dx6ju8gnu8kwugy8fd8rd
                ( 40 chars, 42.2% shorter )
      LAMCW   : xsujun3kwwru2abujn36rwsgu8anwsg2uau6fgujk
                ( 41 chars )


    About 20%~35% improvement in DUDE compression ratio is achieved in
    these Japanese Kanji/Katakana examples.

    (JP1) Japanese String 1: ( 25 letters )
      U+793E U+56E3 U+6CD5 U+4EBA U+65E5 U+672C U+30CD U+30C3 U+30C8
      U+30EF U+30FC U+30AF U+30A4 U+30F3 U+30D5 U+30A9 U+30E1 U+30FC
      U+30B7 U+30E7 U+30F3 U+30BB U+30F3 U+30BF U+30FC

      DUDE-02 : z3xQu97Pv4vGuuyRu5xRu6Jxz8BQMuHtDxDMxHuGzNwItPwMxAtE\
                wIwIwNwD  (60 chars)
      LDUDE   : xs8Nu2Cu4RvMGBysxGyCKtHtQCPFtAyPyKtPBGPyAyAyFyR
                ( 47 chars, 21.6% shorter)
      AMCW    : z3vQ28DDyxs5KB9fCjnvs6P6DI8R9N4RE9D7F4J8B9N5H8H9D5M9\
                D5R9N ( 57 chars )
      LAMCW   : xs2NwsQu4B3KNPvs6M4JD5E4KIFA5A7P5H4KMPA6A4A6F4K
                ( 47 chars )


    (JP2) Japanese String 2: ( 16 letters )
      U+8CA1 U+56E3 U+6CD5 U+4EBA U+5317 U+6D77 U+9053 U+81EA
      U+7136 U+4FDD U+8B77 U+63A8 U+9032 U+5354 U+4F1A

      DUDE-02 : 266B74wCv4vGuuyRt74Pv8yA97uEtt5J9s7Nv88M6w4K827R9v3K\
                6vyGt6wQ (60 chars)
      LDUDE   : xs3Hu9Ju4RvMt5CFvuGvsRxtGw5Iz2Ev6BzIwtJE
                ( 40 chars, 33.3% shorter)
      AMCW    : 264B28DDyxs5KxtHD5zNuvI9kE3yt7PMmzBpiNtuxxEttK
                ( 46 chars )
      LAMCW   : xs9HwsQu4B3KvuIPwsMvsEytCu4K3uQy8R3Hu2QK
                ( 40 chars )


    (JP3) Japanese String 3: ( 16 letters )
      U+6771 U+4EAC U+90FD U+60C5 U+5831 U+30B5 U+30FC U+30D3 U+30B9
      U+7523 U+696D U+5065 U+5EB7 U+4FDD U+967A U+7D44 U+5408

      DUDE-02 : yztBu37P78xB9svIv29Ey22EwJuRyKwx3Kt6wQv3sI87CttyK734\
                H85vQu3wN (61 chars)
      LDUDE   : xttHxPvtFu9CDyssAyEyHyRys9PxQ4KHGEu4CuwJ
                ( 40 chars, 34.4% shorter)
      AMCW    : z3vQ28DDyxs5KB9fCjnvs6P6DI8R9N4RE9D7F4J8B9N5H8H9D5M9\
                D5R9N ( 57 chars )
      LAMCW   : xs2NwsQu4B3KNPvs6M4JD5E4KIFA5A7P5H4KMPA6A4A6F4K
                ( 47 chars )



    LDUDE-2  shows the same good compression ratio for Latin family of
    scripts.

    (L1) Vietnamese: ( 38 syllables using diacritical marks )
      Ta<dotbelow>isaoho<dotbelow>kh<ocirc>ngth<ecirc><hookabove>chi\
      <hookabove>no<acute>iti<ecirc><acute>ngVi<ecirc><dotbelow>t
      U+0054 u+0061 u+0323 u+0069 u+0073 u+0061 u+006F u+0068 u+006F
      u+0323 u+006B u+0068 u+00F4 u+006E u+0067 u+0074 u+0068 u+00EA
      u+0309 u+0063 u+0068 u+0069 u+0309 u+006E u+006F u+0301 u+0069
      u+0074 u+0069 u+00EA u+0301 u+006E u+0067 U+0056 u+0069 u+00EA
      u+0323 u+0074

      DUDE-02 : vEvfvwcvwktktcqhhvwnvwid3n3kjtdtn2cv8dvykmbvyavyhbvyqv\
                yitptp2dv8mvyrjvBvr2dv6jvxh ( 82 chars )
      LDUDE   : uGuh5c5kckqhh5n4atm3n3ktmtdq2cxd7kmb7a7hb7q7irr2dxm7rt\
                muDvr2dvj5f (66 chars , 16 chars(19%) shorter)


    (L2) Spanish: ( using basic Latin & Latin Supplement )
      Porqu<eacute>nopuedensimplementehablarenEspa<ntilde>ol
      U+0050 u+006F u+0072 u+0071 u+0075 u+00E9 u+006E u+006F u+0070
      u+0075 u+0065 u+0064 u+0065 u+006E u+0073 u+0069 u+006D u+0070
      u+006C u+0065 u+006D u+0065 u+006E u+0074 u+0065 u+0068 u+0061
      u+0062 u+006C u+0061 u+0072 u+0065 u+006E U+0045 u+0073 u+0070
      u+0061 u+00F1 u+006F u+006C

      DUDE-02 : vAvrtpde3n2hbtrftabbmtptketptnjiimtktbpjdqptdthmuMvgdt\
                b3a3qd  (61 chars)
      LDUDE   : uAurftmtg2q2hbrhcbbmfcepnjiimidpjdqpmrmuMuqmb3a3qd
                (51 chars, 10 chars (16%) shorter)


    (L3) Czech:  (using Latin Extended A)
      Pro<ccaron>prost<ecaron>nemluv<iacute><ccaron>esky
      U+0050 u+0072 u+006F u+010D u+0070 u+0072 u+006F u+0073 u+0074
      u+011B u+006E u+0065 u+006D u+006C u+0075 u+0076 u+00ED u+010D
      u+0065 u+0073 u+006B u+0079

      DUDE-02 : vAuctptyctzpctptnhtyrtzfmibtjd3mt8atyitgtitc
                (45 chars)
      LDUDE   : uAukfycypkfepzpzfmibmtb3m8ayiqtik
                (34 chars, 24% shorter)