[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [idn] call for comments for REORDERING
----- Original Message -----
From: "Martin Duerst" <duerst@w3.org>
> >
> >1) saturations in TLD namespaces would require longer names for which
> > REORDERING is designed to give greater benefits/compression ratio.
>
> No. What James referred to is that saturation tends to fill up the
> short name slots, and thus flatten the probability distribution.
> I.e. if somebody doesn't get the name they wanted, the chance is
> that they go for something like xq.com, because it's easy to
> remember because it's short. Neither x nor q are very frequent
> letters.
Han/hangeul characters carries meanings while latin alphabets
denote phonemes. Therefore your analogy between latin and han domains
may be false. Chinese people would rather choose to register
digit-added variants of alreagy taken desired domains in saturated ML.com,
instead of choosing non-sense irrelevant rare han characters.
Later time, I will provide some proofs that SC and TC only have
small partial set of frequent characters. That's already clear in
SJIS and KSC5601 han characters set which size is less than 5000.
> >INs and OUTs from 4096 ones are rare and does not invalidate the validity
> >of most frequent 1024 and 2048 ones.
> >Moreover, TC/SC/KC characters are put side-by-side
>
> Can you explain that better? What about Japanese cases?
KC means "Kanji-specific". ALready addressed. Korea share TC with TAIWAN.
KC/TC/SC are put side by side in reordering table.
>
>
> >to avoid countriy-specific biases in han reordering table.
> >
> >non-CJK scripts often haver small set of basic alphabets, and their
> >character usage patterns are more stable than those for han/hangeul.
>
> No, many other scripts are used for many more languages, with
> quite different usage patterns. (A lot of Han usage in Japan,
> and most of it in Korea, is due to loanwords from Chinese.)
>
But, even without Urdu consideration in
arabic reordering, the efficiency of reordering is always better than
without it, because the lexicographic ordering in un-reordered
arabic script block can be regarded as *RANDOM* ordering
in frequency measure (maximum entropy).
Partial reordering (without Urdu consideration) is always better than
no reordering.
If Urdu text samples are available, my arabic reordering table may be
improved to reflect them, though.
SOobok Lee