[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [idn] new I-D: Safely Encoding of likeness information into ACE label version 0.2



Hi, Eric

For pure Han labels (150,000 sample domains),
  the mean average # of distinct code points is roughly the same as
  # of code points in the label. i.e. most of letters are distinct.

For Latin (European ) labels ( in 44000 sample domains)
  the mean average # of distinct code points is roughly HALF of
  # of code points.
  We need only one lookalike encoding for a distinct code point.
  I guess only half or two-thirds of them are look-alike ones.

This will partly answer for some of your questions.

Look into ATOM# columns of the next statistics if you have time. :-)

Soobok Lee
------------------------------------------------------------------------

For .Latin

N:            length of a domain label ( # of code points)
FREQ:         number domains of length N
N*FREQ:       sum of # of code points of domains of length N
SUM OF AMCZ:  sum of lengths of AMCZ labels
X:            SUM OF AMCZ / N * FREQ
SUM OF LAMCZ: sum of lengths of LAMCZ labels
Y:            SUM OF LAMCZ / N * FREQ
COMP:         (SUM OF LAMCZ - SUM OF AMCZ) / SUM OF AMCZ * 100
ATOM#:        MEAN AVERAGE of # of DISTINCT ATOMS

|  N|    FREQ|    N*FREQ|  SUM OF AMCZ(X)| SUM OF LAMCZ(Y)| COMP| ATOM#

|  1|      85|        85|       169(1.99)|       169(1.99)|0.00|1.00
|  2|    1026|      2052|      4031(1.96)|      4031(1.96)|0.00|1.96
|  3|     993|      2979|      4954(1.66)|      4954(1.66)|0.00|2.84
|  4|    1730|      6920|     10440(1.51)|     10440(1.51)|0.00|3.81
|  5|    3089|     15445|     22020(1.43)|     22020(1.43)|0.00|4.75
|  6|    3760|     22560|     30529(1.35)|     30529(1.35)|0.00|5.57
|  7|    4134|     28938|     37669(1.30)|     37669(1.30)|0.00|6.32
|  8|    4405|     35240|     44532(1.26)|     44532(1.26)|0.00|7.02
|  9|    4286|     38574|     47653(1.24)|     47653(1.24)|0.00|7.73
| 10|    4027|     40270|     49106(1.22)|     49106(1.22)|0.00|8.33
| 11|    3523|     38753|     47178(1.22)|     47178(1.22)|0.00|8.89
| 12|    2777|     33324|     40538(1.22)|     40538(1.22)|0.00|9.43
| 13|    2325|     30225|     36672(1.21)|     36672(1.21)|0.00|9.96
| 14|    1928|     26992|     32752(1.21)|     32752(1.21)|0.00|10.44
| 15|    1537|     23055|     27801(1.21)|     27801(1.21)|0.00|10.86
| 16|    1276|     20416|     24422(1.20)|     24422(1.20)|0.00|11.24
| 17|    1029|     17493|     20741(1.19)|     20741(1.19)|0.00|11.68
| 18|     776|     13968|     16438(1.18)|     16438(1.18)|0.00|12.05
| 19|     588|     11172|     13057(1.17)|     13057(1.17)|0.00|12.40
| 20|     429|      8580|      9956(1.16)|      9956(1.16)|0.00|12.67
| 21|     296|      6216|      7146(1.15)|      7146(1.15)|0.00|12.93
| 22|     177|      3894|      4476(1.15)|      4476(1.15)|0.00|13.20
| 23|     116|      2668|      3032(1.14)|      3032(1.14)|0.00|13.59
| 24|      65|      1560|      1772(1.14)|      1772(1.14)|0.00|13.40
| 25|      67|      1675|      1899(1.13)|      1899(1.13)|0.00|14.16
| 26|      36|       936|      1054(1.13)|      1054(1.13)|0.00|13.53
| 27|      29|       783|       880(1.12)|       880(1.12)|0.00|15.00
| 28|      10|       280|       317(1.13)|       317(1.13)|0.00|14.60
| 29|      10|       290|       321(1.11)|       321(1.11)|0.00|18.30
| 30|       7|       210|       231(1.10)|       231(1.10)|0.00|14.14
| 31|       7|       217|       243(1.12)|       243(1.12)|0.00|19.00
| 32|       9|       288|       321(1.11)|       321(1.11)|0.00|15.67
| 33|       4|       132|       146(1.11)|       146(1.11)|0.00|18.50


For .Unihan

N:            length of a domain label ( # of code points)
FREQ:         number domains of length N
N*FREQ:       sum of # of code points of domains of length N
SUM OF AMCZ:  sum of lengths of AMCZ labels
X:            SUM OF AMCZ / N * FREQ
SUM OF LAMCZ: sum of lengths of LAMCZ labels
Y:            SUM OF LAMCZ / N * FREQ
COMP:         (SUM OF LAMCZ - SUM OF AMCZ) / SUM OF AMCZ * 100
ATOM#:        MEAN AVERAGE of # of DISTINCT ATOMS

|  N|    FREQ|    N*FREQ|  SUM OF AMCZ(X)| SUM OF LAMCZ(Y)| COMP| ATOM#

|  1|    3735|      3735|     14940(4.00)|     14940(4.00)|0.00|1.00
|  2|   42793|     85586|    322222(3.76)|    289658(3.38)|10.11|1.99
|  3|   28033|     84099|    295183(3.51)|    256250(3.05)|13.19|2.98
|  4|   54607|    218428|    740073(3.39)|    608569(2.79)|17.77|3.98
|  5|   12591|     62955|    208185(3.31)|    167839(2.67)|19.38|4.97
|  6|    7680|     46080|    149606(3.25)|    116532(2.53)|22.11|5.95
|  7|    2761|     19327|     61992(3.21)|     47043(2.43)|24.11|6.94
|  8|    1336|     10688|     33890(3.17)|     25154(2.35)|25.78|7.83
|  9|     641|      5769|     18131(3.14)|     13383(2.32)|26.19|8.86
| 10|     298|      2980|      9260(3.11)|      6821(2.29)|26.34|9.72
| 11|     137|      1507|      4712(3.13)|      3330(2.21)|29.33|10.79
| 12|      57|       684|      2114(3.09)|      1569(2.29)|25.78|11.40
| 13|      25|       325|      1008(3.10)|       760(2.34)|24.60|12.44
| 14|       6|        84|       259(3.08)|       187(2.23)|27.80|13.17
| 15|       6|        90|       272(3.02)|       199(2.21)|26.84|14.50
| 17|       1|        17|        49(2.88)|        33(1.94)|32.65|16.00

|   |  154707|    542354|   1861896(3.43)|   1552267(2.86)|16.63|



----- Original Message -----
From: "Eric Brunner-Williams in Portland Maine" <brunner@nic-naa.net>
To: "Soobok Lee" <lsb@postel.co.kr>
Cc: <idn@ops.ietf.org>; <brunner@nic-naa.net>
Sent: Saturday, July 28, 2001 12:51 PM
Subject: Re: [idn] new I-D: Safely Encoding of likeness information into ACE
label version 0.2


> Soobok,
>
> Your example of U+30AB and U+529B is clear, but as you mention, most
> Han/Hangeul do not have look-alike characters.
>
> Consider a hypothetical proposal to place distinct abstract character
> repertoires in disjoint blocks in some equally hypothetical standard
> coded character set, and assume that some character(s) have "look-alike"
> correspondences in multiple disjoint blocks.
>
> In particular, assume that in blocks i, j, and k, there exists code-points
> for the characters A(i), A(j), and A(k), and that A(i) "looks-like" A(j)
> and A(j) "looks-like" A(k).
>
> Further assume that for this property of "similarity" or "borrowing" (at
> least of the visual form) in each of these blocks is unlike the case in
> Han/Hangeul (sparce), and most do have look-alike characters (dense). Is
> there some point at which the "non-rarity" of "look-alike" would make a
> scheme such as the one you've proposed cumbersome? If so, this could be
> a constraint upon future modifications of a particular CCS.
>
> Thanks for taking the time to think about this, I know it isn't what you
> had in mind when taking on the Latin/Greek and Han/Hangeul similarities.
>
> Eric
>