[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [idn] call for comments for REORDERING




----- Original Message ----- 
From: "Martin Duerst" <duerst@w3.org>
To: "Soobok Lee" <lsb@postel.co.kr>; "James Seng/Personal" <jseng@pobox.org.sg>; <idn@ops.ietf.org>
Sent: Friday, October 19, 2001 5:58 PM
Subject: Re: [idn] call for comments for REORDERING


> At 17:39 01/10/19 +0900, Soobok Lee wrote:
> 
> >----- Original Message -----
> >From: "Martin Duerst" <duerst@w3.org>
> > > >
> > > >1) saturations in TLD namespaces would require longer names for which
> > > >     REORDERING is designed to give greater benefits/compression ratio.
> > >
> > > No. What James referred to is that saturation tends to fill up the
> > > short name slots, and thus flatten the probability distribution.
> > > I.e. if somebody doesn't get the name they wanted, the chance is
> > > that they go for something like xq.com, because it's easy to
> > > remember because it's short. Neither x nor q are very frequent
> > > letters.
> >
> >Han/hangeul characters carries meanings while latin alphabets
> >denote phonemes. Therefore your analogy between latin and han domains
> >may be false. Chinese people would rather choose to register
> >digit-added variants of  alreagy taken desired domains in saturated ML.com,
> >instead of choosing non-sense irrelevant rare han characters.
> 
> Some really rare and irrelevant han characters may indeed never
> be chosen. But still if you want to name a company, there are
> many different possibilities, and people will look for short,
> not yet used possibilities (which still make some sense)
> rather than use longer and longer names.
> 
In most cases, they add latin digits. CJK people would know what i am saying.


> 
> >Later time, I will provide some proofs that SC and TC only have
> >small partial set of frequent characters. That's already clear in
> >SJIS and KSC5601 han characters set which size is less than 5000.
> 
> Yes, this is true.
> 
> 
> > > >to avoid countriy-specific biases in han reordering table.
> > > >
> > > >non-CJK scripts often haver small set of basic alphabets, and their
> > > >character usage patterns are more stable than those for han/hangeul.
> > >
> > > No, many other scripts are used for many more languages, with
> > > quite different usage patterns. (A lot of Han usage in Japan,
> > > and most of it in Korea, is due to loanwords from Chinese.)
> > >
> >
> >But, even without Urdu consideration in
> >arabic reordering, the efficiency of reordering is always  better than
> >without it, because the lexicographic ordering in un-reordered
> >arabic script block can be regarded as *RANDOM* ordering
> >in frequency measure (maximum entropy).
> 
> It's probably not, because most alphabets contain a few
> 'late additions'.
 
If and only if the reordering table for a script needs
modifications for added characters, it can be done in the
next version of nameprep/ACE with new ACE prefix.


> And just using first order frequency
> to bring the most frequent characters to the front may
> not be the most efficient way for compression.
> 

Do you a good idea that can replace current first-order frequency reordering ?
Welcome any changes to that.

If someone devise new ordering scheme in the future, that may substitute 
current reordering scheme in the next namepre/ACE version with
new ACE prefix.


> 
> >Partial reordering (without Urdu consideration) is always better than
> >no reordering.
> 
> I don't deny that you may be able to squeeze out a few bits.
> But I don't think that should be the aim of this exercise.
> 
> >If Urdu text samples  are available, my arabic reordering table may be
> >improved to reflect them, though.
> 
> Which might then make it less efficient for Arabic.

Yes, but marginally.

> 
> 
> Regards,   Martin.
>