[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [idn] call for comments for REORDERING<024301c149a0$c65b1ee0$ec1bd9d2@temp><4.2.0.58.J.20011010135529.03d9e6a0@localhost><024b01c15847$11e8b430$0501000a@jamessonyvaio><4.2.0.58.J.20011019160708.037cccd0@localhost>



At 17:39 01/10/19 +0900, Soobok Lee wrote:

>----- Original Message -----
>From: "Martin Duerst" <duerst@w3.org>
> > >
> > >1) saturations in TLD namespaces would require longer names for which
> > >     REORDERING is designed to give greater benefits/compression ratio.
> >
> > No. What James referred to is that saturation tends to fill up the
> > short name slots, and thus flatten the probability distribution.
> > I.e. if somebody doesn't get the name they wanted, the chance is
> > that they go for something like xq.com, because it's easy to
> > remember because it's short. Neither x nor q are very frequent
> > letters.
>
>Han/hangeul characters carries meanings while latin alphabets
>denote phonemes. Therefore your analogy between latin and han domains
>may be false. Chinese people would rather choose to register
>digit-added variants of  alreagy taken desired domains in saturated ML.com,
>instead of choosing non-sense irrelevant rare han characters.

Some really rare and irrelevant han characters may indeed never
be chosen. But still if you want to name a company, there are
many different possibilities, and people will look for short,
not yet used possibilities (which still make some sense)
rather than use longer and longer names.


>Later time, I will provide some proofs that SC and TC only have
>small partial set of frequent characters. That's already clear in
>SJIS and KSC5601 han characters set which size is less than 5000.

Yes, this is true.


> > >to avoid countriy-specific biases in han reordering table.
> > >
> > >non-CJK scripts often haver small set of basic alphabets, and their
> > >character usage patterns are more stable than those for han/hangeul.
> >
> > No, many other scripts are used for many more languages, with
> > quite different usage patterns. (A lot of Han usage in Japan,
> > and most of it in Korea, is due to loanwords from Chinese.)
> >
>
>But, even without Urdu consideration in
>arabic reordering, the efficiency of reordering is always  better than
>without it, because the lexicographic ordering in un-reordered
>arabic script block can be regarded as *RANDOM* ordering
>in frequency measure (maximum entropy).

It's probably not, because most alphabets contain a few
'late additions'. And just using first order frequency
to bring the most frequent characters to the front may
not be the most efficient way for compression.


>Partial reordering (without Urdu consideration) is always better than
>no reordering.

I don't deny that you may be able to squeeze out a few bits.
But I don't think that should be the aim of this exercise.


>If Urdu text samples  are available, my arabic reordering table may be
>improved to reflect them, though.

Which might then make it less efficient for Arabic.


Regards,   Martin.