[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [idn] Re: ave length, best compression etc
First 512 (or more) letters in my reordering table were
manually chosen from frequent business category names
and region names in taiwan/china/japan.
They are placed to reflect word adjacencies.
SC/TC pairs are placed together in this 512 letters.
For ordering of other letters following the first 512 ones,
I merged two relative letter frequency distribution data sets with
SC:TC=1:1 weight allocation ( excluding the first 512 letters).
a SC letter and its TC counterpart (and even Kanji counterpart)
should have been treated as one letter and placed together,
but not in my current reordering table which therefore
might favor the common letters of SC and TC character sets
over SC-only and TC-only letters.
I am reading CNNIC-submitted draft on SC/TC
tranformation , which is a good source of TC/SC pairs
for my next version of reordering tables.
Regards,
Soobok
----- Original Message -----
From: "Martin Duerst" <duerst@w3.org>
> >The most frequent 256 Han letters has cumulative frequency sum
> > of 58.2% and for the cases of top 512,1024,2048 and 4096 ones,
> > it reaches 72.8,85.9,95.4 and 99.4%, respectively.
> > 4096 is roughly close to the size of Simplified Chinese Character
sets.
>
> By the way, how do you weight simplified, traditional, Japanese,
> and Korean when they use different codepoints?
>
> Regards, Martin.
>