[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[idn] Analysis : maximum length of hangul/han domains



Even in DUDE,  the  max length of hangul domain label can be  56
(first  <4*base32> + 55 <1*base32>s = 59 octets = 63 - 4 (prefix) ),
 if we form a hangul domain label with hangul syllables
only from U+B000~U+B00F, because their XOR diff values are always less than
0x10.

If we form a hangul domain label with hangul syllables
only from U+B000~U+B0FF (256).  Their XOR diff values are always less than
0x100,
and in the worst case, the maximum length is 28  (first  <4*base32> + 27
<2*base32>s = 58).

If we form a hangul domain label with hangul syllables
only from U+B000~U+BFFF (4096).  Their XOR diff values are always less than
0x1000,
and in the worst case, the maximum length is 19  (first  <4*base32> + 18
<3*base32>s = 58).

If we form a hangul domain label with hangul syllables
from entire hangul code range U+AC00~U+D7AF.
Their XOR diff values are sometimes greater than 0x1000,
and in the worst case, the maximum length is 14  (first  <4*base32> + 13
<4*base32>s = 56).

My frequency&adjacency-based codepoint reordering is based on this analysis.

If you gather(really map) most frequent 256 and 4096 hangul  code points to
aligned blocks of
256 and 4096 code points starting from 0xB000,
 we can express most modern typical hangul domains with hangul syllables
mainly from the inner 256 codes block (~ 0xB0FF)
and occasionally from the outer 4096 codes block ( ~ 0xBFFF).

In this case, the max length of a hangul domain varies between 19 ~ 28
depending upon
how many deviations from top freq 256 hangul syllables  occur in the domain
label.
In my study on samples of typical hangul business names,
the averaged percentage of deviations does not exceed 20~30%.
I guess that the averaged maximum length of hangul domains is around  25
for both 'reordered' DUDE and 'reordered' AMC-ACE-W.
This reordering can be implemented only with code mapping tables which
are much smaller than those for KC norm and NAMEPREP.

This analysis works for Unified Han code range for which the averaged
maximum length seems to be around 22 slightly less than that for hangul.

If you want to read more about this topic,  go
 http://www.postel.co.kr/lsb-ace-01.txt
This is the new version of my draft.

Soobok Lee

----- Original Message -----
From: "Harald Tveit Alvestrand" <harald@alvestrand.no>
To: "James Seng/Personal" <James@Seng.cc>; <idn@ops.ietf.org>
Sent: Friday, July 06, 2001 8:30 PM
Subject: Re: [idn] Fw: Hello, This is KyungJae Park (KRNIC)


> this points to a very interesting property of the ACEs:
> the upper limit on length is not a fixed value, it depends on the character
> values used, even within a single script.
>
> For ams-ace-m:
>   len          maxlen enc    minlen enc  average
>     22            66           35        39.6
>
> so most of the 5 Hangul names of length 22 are acceptable, but some are not.
>
> This will be fun to explain to customers.....
>
>