[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [idn] new I-D: Safely Encoding of likeness information into ACE label version 0.2
- To: "Eric Brunner-Williams in Portland Maine" <brunner@nic-naa.net>
- Subject: Re: [idn] new I-D: Safely Encoding of likeness information into ACE label version 0.2
- From: "Soobok Lee" <lsb@postel.co.kr>
- Date: Tue, 31 Jul 2001 08:58:07 +0900
- Cc: <idn@ops.ietf.org>
Hi, Eric
For pure Han labels (150,000 sample domains),
the mean average # of distinct code points is roughly the same as
# of code points in the label. i.e. most of letters are distinct.
For Latin (European ) labels ( in 44000 sample domains)
the mean average # of distinct code points is roughly HALF of
# of code points.
We need only one lookalike encoding for a distinct code point.
I guess only half or two-thirds of them are look-alike ones.
This will partly answer for some of your questions.
Look into ATOM# columns of the next statistics if you have time. :-)
Soobok Lee
------------------------------------------------------------------------
For .Latin
N: length of a domain label ( # of code points)
FREQ: number domains of length N
N*FREQ: sum of # of code points of domains of length N
SUM OF AMCZ: sum of lengths of AMCZ labels
X: SUM OF AMCZ / N * FREQ
SUM OF LAMCZ: sum of lengths of LAMCZ labels
Y: SUM OF LAMCZ / N * FREQ
COMP: (SUM OF LAMCZ - SUM OF AMCZ) / SUM OF AMCZ * 100
ATOM#: MEAN AVERAGE of # of DISTINCT ATOMS
| N| FREQ| N*FREQ| SUM OF AMCZ(X)| SUM OF LAMCZ(Y)| COMP| ATOM#
| 1| 85| 85| 169(1.99)| 169(1.99)|0.00|1.00
| 2| 1026| 2052| 4031(1.96)| 4031(1.96)|0.00|1.96
| 3| 993| 2979| 4954(1.66)| 4954(1.66)|0.00|2.84
| 4| 1730| 6920| 10440(1.51)| 10440(1.51)|0.00|3.81
| 5| 3089| 15445| 22020(1.43)| 22020(1.43)|0.00|4.75
| 6| 3760| 22560| 30529(1.35)| 30529(1.35)|0.00|5.57
| 7| 4134| 28938| 37669(1.30)| 37669(1.30)|0.00|6.32
| 8| 4405| 35240| 44532(1.26)| 44532(1.26)|0.00|7.02
| 9| 4286| 38574| 47653(1.24)| 47653(1.24)|0.00|7.73
| 10| 4027| 40270| 49106(1.22)| 49106(1.22)|0.00|8.33
| 11| 3523| 38753| 47178(1.22)| 47178(1.22)|0.00|8.89
| 12| 2777| 33324| 40538(1.22)| 40538(1.22)|0.00|9.43
| 13| 2325| 30225| 36672(1.21)| 36672(1.21)|0.00|9.96
| 14| 1928| 26992| 32752(1.21)| 32752(1.21)|0.00|10.44
| 15| 1537| 23055| 27801(1.21)| 27801(1.21)|0.00|10.86
| 16| 1276| 20416| 24422(1.20)| 24422(1.20)|0.00|11.24
| 17| 1029| 17493| 20741(1.19)| 20741(1.19)|0.00|11.68
| 18| 776| 13968| 16438(1.18)| 16438(1.18)|0.00|12.05
| 19| 588| 11172| 13057(1.17)| 13057(1.17)|0.00|12.40
| 20| 429| 8580| 9956(1.16)| 9956(1.16)|0.00|12.67
| 21| 296| 6216| 7146(1.15)| 7146(1.15)|0.00|12.93
| 22| 177| 3894| 4476(1.15)| 4476(1.15)|0.00|13.20
| 23| 116| 2668| 3032(1.14)| 3032(1.14)|0.00|13.59
| 24| 65| 1560| 1772(1.14)| 1772(1.14)|0.00|13.40
| 25| 67| 1675| 1899(1.13)| 1899(1.13)|0.00|14.16
| 26| 36| 936| 1054(1.13)| 1054(1.13)|0.00|13.53
| 27| 29| 783| 880(1.12)| 880(1.12)|0.00|15.00
| 28| 10| 280| 317(1.13)| 317(1.13)|0.00|14.60
| 29| 10| 290| 321(1.11)| 321(1.11)|0.00|18.30
| 30| 7| 210| 231(1.10)| 231(1.10)|0.00|14.14
| 31| 7| 217| 243(1.12)| 243(1.12)|0.00|19.00
| 32| 9| 288| 321(1.11)| 321(1.11)|0.00|15.67
| 33| 4| 132| 146(1.11)| 146(1.11)|0.00|18.50
For .Unihan
N: length of a domain label ( # of code points)
FREQ: number domains of length N
N*FREQ: sum of # of code points of domains of length N
SUM OF AMCZ: sum of lengths of AMCZ labels
X: SUM OF AMCZ / N * FREQ
SUM OF LAMCZ: sum of lengths of LAMCZ labels
Y: SUM OF LAMCZ / N * FREQ
COMP: (SUM OF LAMCZ - SUM OF AMCZ) / SUM OF AMCZ * 100
ATOM#: MEAN AVERAGE of # of DISTINCT ATOMS
| N| FREQ| N*FREQ| SUM OF AMCZ(X)| SUM OF LAMCZ(Y)| COMP| ATOM#
| 1| 3735| 3735| 14940(4.00)| 14940(4.00)|0.00|1.00
| 2| 42793| 85586| 322222(3.76)| 289658(3.38)|10.11|1.99
| 3| 28033| 84099| 295183(3.51)| 256250(3.05)|13.19|2.98
| 4| 54607| 218428| 740073(3.39)| 608569(2.79)|17.77|3.98
| 5| 12591| 62955| 208185(3.31)| 167839(2.67)|19.38|4.97
| 6| 7680| 46080| 149606(3.25)| 116532(2.53)|22.11|5.95
| 7| 2761| 19327| 61992(3.21)| 47043(2.43)|24.11|6.94
| 8| 1336| 10688| 33890(3.17)| 25154(2.35)|25.78|7.83
| 9| 641| 5769| 18131(3.14)| 13383(2.32)|26.19|8.86
| 10| 298| 2980| 9260(3.11)| 6821(2.29)|26.34|9.72
| 11| 137| 1507| 4712(3.13)| 3330(2.21)|29.33|10.79
| 12| 57| 684| 2114(3.09)| 1569(2.29)|25.78|11.40
| 13| 25| 325| 1008(3.10)| 760(2.34)|24.60|12.44
| 14| 6| 84| 259(3.08)| 187(2.23)|27.80|13.17
| 15| 6| 90| 272(3.02)| 199(2.21)|26.84|14.50
| 17| 1| 17| 49(2.88)| 33(1.94)|32.65|16.00
| | 154707| 542354| 1861896(3.43)| 1552267(2.86)|16.63|
----- Original Message -----
From: "Eric Brunner-Williams in Portland Maine" <brunner@nic-naa.net>
To: "Soobok Lee" <lsb@postel.co.kr>
Cc: <idn@ops.ietf.org>; <brunner@nic-naa.net>
Sent: Saturday, July 28, 2001 12:51 PM
Subject: Re: [idn] new I-D: Safely Encoding of likeness information into ACE
label version 0.2
> Soobok,
>
> Your example of U+30AB and U+529B is clear, but as you mention, most
> Han/Hangeul do not have look-alike characters.
>
> Consider a hypothetical proposal to place distinct abstract character
> repertoires in disjoint blocks in some equally hypothetical standard
> coded character set, and assume that some character(s) have "look-alike"
> correspondences in multiple disjoint blocks.
>
> In particular, assume that in blocks i, j, and k, there exists code-points
> for the characters A(i), A(j), and A(k), and that A(i) "looks-like" A(j)
> and A(j) "looks-like" A(k).
>
> Further assume that for this property of "similarity" or "borrowing" (at
> least of the visual form) in each of these blocks is unlike the case in
> Han/Hangeul (sparce), and most do have look-alike characters (dense). Is
> there some point at which the "non-rarity" of "look-alike" would make a
> scheme such as the one you've proposed cumbersome? If so, this could be
> a constraint upon future modifications of a particular CCS.
>
> Thanks for taking the time to think about this, I know it isn't what you
> had in mind when taking on the Latin/Greek and Han/Hangeul similarities.
>
> Eric
>