[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Combining characters (was: Re: [idn] hostname historyhell)



> 
> David Hopwood wrote:
> > Soobok Lee wrote:
> > > Now that <I><dot-above> is downcased to <i> as an exceptional 
> case,
> > > Then, we have an interesting question:
> > > which direction should we  lowercase   <I><dot-above><acute>   
> into ?
> > 
> > To <i acute>. That is, the equivalence class is:
> > 
> >   <I><dot-above><acute>                         U+0049 U+0307 
> U+0301
> >   <I dot-above><acute>                          U+0130 U+0301
> >   <I><acute>                                    U+0049 U+0301
> >   <I acute>                                     U+00CD
> >   <i><acute>                                    U+0069 U+0301
> >   <i acute>                                     U+00ED
> >   <dotless i><acute>                            U+0131 U+0301
> >   <fullwidth I><acute>                          U+FF29 U+0301
> >   <fullwidth I><dot-above><acute>               U+FF29 U+0307 
> U+0301
> >   <fullwidth i><acute>                          U+FF49 U+0301
> > 
> > and if NFKC is used, also: [snip]
> > 
> > <i acute> U+00ED is the normalised representative for all of 
> these.
> > 
> > <i><dot-above><acute> is in a different equivalence class (AFAIK, 
> no
> > language uses it, so this doesn't matter).
> 
> My mistake; it is used in Lithuanian. The Lithuanian usage would 
> argue
> for <i><dot-above><acute> being in the same equivalence class (since 
> its
> Lithuanian uppercase form is <I acute>). So, another solution that
> should be considered is to use NFC o fold as in the current version 
> of
> stringprep, but map out U+0307 whenever it is attached to a 
> character
> based on 'i' or 'I'. That wouldn't cause any problems for Turkish or
> Azeri. I'll list all the options in another post.
> 

This is a good demostration case, that when we deal 
with symbol usages cross different locality, a procedure
like NFC will overlook something.  It is better to list 
all the input and output characters in a table for 
easy checking, and easy to understand what are in 
an equivalent set. 

However, something like NFC is necessary as a tool 
to check consistence and as a guide to form such
a table.

Liana