[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Hangul and IDN (was Re: [idn] reordering strawpoll)




Regardless of reordering, there is an actual problem for Hangul, which
I don't think has been addressed. I have lately, and with the help of a
Korean colleague, been looking fairly deeply into the problem of
collating (ordering) Hangul strings properly.  So even though I cannot
understand Korean, and only begin to be able to read the letters (I look
more at code point numbers than glyphs), I've looked quite a lot into this.
See also page 53 of The Unicode Standard 3.0, which deals with
Hangul syllables. Let me just pick an example. The number of instances
are in the thousands, but the basic problem is the same.

The precomposed Hangul syllable U+AE4C (GGA) is canonically equivalent
with <U+1101, U+1161> (GG, A), through algorithmic decomposition. That is fine
so far. But <U+1101, U+1161> is in turn equivalent to <U+1100, U+1100, U+1161>
(G, G, A), but this equivalence is neither a canonical equivalence, as it
should have been, nor a compatibility equivalence. Still, the latter letter
sequence represents EXACTLY the same syllable as the two earlier character
sequences, and a proper rendering engine (of which there are already some,
I'm told) would correctly render the three sequences in the same way.
But for historical reasons, there is now neither a canonical, nor a compatibility
equivalence there.  Just an equivalence, in the same script, in syllabic meaning
and (when properly implemented) in display. (Yes, G and GG are pronounced
differently, but this is about spelling.)

This is something that 'nameprep' should handle, since it is unfortunately not
handled by NFKC.  The logical steps would be to 1) algorithmically decompose
Hangul syllables, 2) map cluster Jamos to the basic letter sequences each
represent. Then either (design decision) invoke NFKC or NFKC augmented
to compose also "modern" cluster Jamo's before the part of NFKC formation
that does algorithmic composition of Hangul syllables (the historic cluster
Jamos can (design decision) stay decomposed).  Or, indeed, do the
decomposition into basic (i.e. non-cluster) Hangul Jamo letters, after
conversion to NFKC form, leaving Hangul "subnames" as sequences of
letter characters, just like for other alphabetic scripts (I don't know how this
would effect the length of ACE encoded IDN names).  (Some thought
needs to go into how ((Halfwidth)) Compatibility Hangul Letters are to be
handled. The compatibility mapping are, ahem, not fully appropriate... 
The Hangul "filler" characters are also a problem, which needs to be
considered.)

          Kind regards
          /kent k