[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: Combining characters (was: Re: [idn] hostname historyhell)
>
> David Hopwood wrote:
> > Soobok Lee wrote:
> > > Now that <I><dot-above> is downcased to <i> as an exceptional
> case,
> > > Then, we have an interesting question:
> > > which direction should we lowercase <I><dot-above><acute>
> into ?
> >
> > To <i acute>. That is, the equivalence class is:
> >
> > <I><dot-above><acute> U+0049 U+0307
> U+0301
> > <I dot-above><acute> U+0130 U+0301
> > <I><acute> U+0049 U+0301
> > <I acute> U+00CD
> > <i><acute> U+0069 U+0301
> > <i acute> U+00ED
> > <dotless i><acute> U+0131 U+0301
> > <fullwidth I><acute> U+FF29 U+0301
> > <fullwidth I><dot-above><acute> U+FF29 U+0307
> U+0301
> > <fullwidth i><acute> U+FF49 U+0301
> >
> > and if NFKC is used, also: [snip]
> >
> > <i acute> U+00ED is the normalised representative for all of
> these.
> >
> > <i><dot-above><acute> is in a different equivalence class (AFAIK,
> no
> > language uses it, so this doesn't matter).
>
> My mistake; it is used in Lithuanian. The Lithuanian usage would
> argue
> for <i><dot-above><acute> being in the same equivalence class (since
> its
> Lithuanian uppercase form is <I acute>). So, another solution that
> should be considered is to use NFC o fold as in the current version
> of
> stringprep, but map out U+0307 whenever it is attached to a
> character
> based on 'i' or 'I'. That wouldn't cause any problems for Turkish or
> Azeri. I'll list all the options in another post.
>
This is a good demostration case, that when we deal
with symbol usages cross different locality, a procedure
like NFC will overlook something. It is better to list
all the input and output characters in a table for
easy checking, and easy to understand what are in
an equivalent set.
However, something like NFC is necessary as a tool
to check consistence and as a guide to form such
a table.
Liana