[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Combining characters (was: Re: [idn] hostname historyhell)



-----BEGIN PGP SIGNED MESSAGE-----

['o' means functional composition. Everything below applies equally to
NFKC and NFC, but I'll say "NFC".]

Patrik Fältström <paf@cisco.com> wrote:
> On 01-11-26 01.12 +0000 David Hopwood <david.hopwood@zetnet.co.uk> wrote:
> 
> > My preference to fix the problems with dotless-i and ypogegrammeni would
> > be NFC(fold(NFC(x))).
> >
> > (This can be optimised to NFC-compose(fold(NFD(x))), or alternatively
> > the folding can detect cases where the second NFC application is needed -
> > usually it is not.)
> 
> See section 3 of draft-hoffman-stringprep-00.txt.

That section doesn't address the problems we're talking about.
There are four problems with the interaction between case folding and
NFC that need to be considered:

1. For a string that includes a character with Greek ypogegrammeni/
   prosgegrammeni (U+0345) in its decomposition, folding only works
   correctly if the input string is normalised (i.e. either U+0345
   must be last in the combining sequence, or the character must be
   fully composed).

2. A string that includes decomposed <I><dot-above> will be case-folded
   to <i><dot-above>, which is inconsistent with the folding of composed
   <I dot-above> to <i>. Any solution to this must also handle the case
   where there are other combining characters between the <I> and
   <dot-above>, or composed with the <I>.

3. If a character X folds to a sequence of two characters Y Z, then
   Z may have precomposed forms that do not correspond to precomposed
   forms of X.

   For example, <sharp-s> U+00DF does not precompose with anything,
   but <s> U+0074 does, so <sharp-s><acute> will fold to <s><s><acute>,
   instead of the NFC-normalised form <s><s acute>.
   (In practice this will not occur in real-world names, so strings
   like <sharp-s><acute> *could* be disallowed.)

4. There are a small number of lowercase precomposed characters that do
   not have uppercase precomposed equivalents; they are listed in
   the SpecialCasing-5.txt file:

     U+0390 GREEK SMALL LETTER IOTA WITH DIALYTIKA AND TONOS
     U+03B0 GREEK SMALL LETTER UPSILON WITH DIALYTIKA AND TONOS
     U+01F0 LATIN SMALL LETTER J WITH CARON
     U+1E96 LATIN SMALL LETTER H WITH LINE BELOW
     U+1E97 LATIN SMALL LETTER T WITH DIAERESIS
     U+1E98 LATIN SMALL LETTER W WITH RING ABOVE
     U+1E99 LATIN SMALL LETTER Y WITH RING ABOVE
     U+1F50 GREEK SMALL LETTER UPSILON WITH PSILI
     U+1F52 GREEK SMALL LETTER UPSILON WITH PSILI AND VARIA
     U+1F54 GREEK SMALL LETTER UPSILON WITH PSILI AND OXIA
     U+1F56 GREEK SMALL LETTER UPSILON WITH PSILI AND PERISPOMENI
     U+1FB6 GREEK SMALL LETTER ALPHA WITH PERISPOMENI
     U+1FC6 GREEK SMALL LETTER ETA WITH PERISPOMENI
     U+1FD2 GREEK SMALL LETTER IOTA WITH DIALYTIKA AND VARIA
     U+1FD3 GREEK SMALL LETTER IOTA WITH DIALYTIKA AND OXIA
     U+1FD6 GREEK SMALL LETTER IOTA WITH PERISPOMENI
     U+1FD7 GREEK SMALL LETTER IOTA WITH DIALYTIKA AND PERISPOMENI
     U+1FE2 GREEK SMALL LETTER UPSILON WITH DIALYTIKA AND VARIA
     U+1FE3 GREEK SMALL LETTER UPSILON WITH DIALYTIKA AND OXIA
     U+1FE4 GREEK SMALL LETTER RHO WITH PSILI
     U+1FE6 GREEK SMALL LETTER UPSILON WITH PERISPOMENI
     U+1FE7 GREEK SMALL LETTER UPSILON WITH DIALYTIKA AND PERISPOMENI
     U+1FF6 GREEK SMALL LETTER OMEGA WITH PERISPOMENI

   Folding an NFC-normalised string containing the decomposed uppercase
   forms of these characters, will result in a string that is not
   NFC-normalised.

   (The file also lists

     U+0149 LATIN SMALL LETTER N PRECEDED BY APOSTROPHE
     U+1E9A LATIN SMALL LETTER A WITH RIGHT HALF RING

   but U+0149 causes only problem 3 above, and U+1E9A does not cause
   any problems for normalisation.)

Problems 1 and 2 mean that "NFC o fold" doesn't work, unless the
foldings that map ypogegrammeni/prosgegrammeni to iota and dotted-I
to i are omitted.

Problems 3 and 4 mean that "fold o NFC" doesn't work, regardless
of which foldings are used.

A pathological example that shows all four problems is:

  <alpha ypogegrammeni><varia><I><dot-below><dot-above><sharp-s><acute><J><caron>
  "\u1FB3\u0300\u0049\u0323\u0307\u00DF\u0301\u004A\u030C"

There are several ways of solving all the problems simultaneously:

 a) NFC o fold o NFC.

 b) foldafter o NFC o foldbefore, where foldafter handles only <dotted-I>
    and <ypogegrammeni> and foldbefore handles the rest.

 c) foldafter' o NFC o foldbefore', where foldbefore' handles the characters
    that only have lowercase precomposed variants, and foldafter' handles
    the rest. If a character folds to two characters, then it must not be
    immediately followed by a combining mark [*].

 d) NFC o fold', where fold' maps out <dot-above> when it applies to
    <i> or <I> or their fullwidth variants, and includes only the
    "simple" foldings for characters containing ypogegrammeni/prosgegrammeni.
    (Unlike the standard Unicode case folding algorithm and options a)-c)
    or e), this handles Lithuanian correctly.)

 e) NFC o fold'', where fold'' does not include a case folding for
    <dotted-I>, and includes only the "simple" foldings for characters
    containing ypogegrammeni/prosgegrammeni. This means that names are
    not fully case-insensitive for Turkish, Azeri or Lithuanian, but it
    is simpler than a)-d).

 f) foldascii, where encoded strings are required to already be
    NFC-normalised, and cannot contain uppercase or titlecase non-ASCII
    characters. This approach does not support case preservation for
    non-ASCII strings; it also requires that any process that destroys
    NFC normalisation must re-normalise the name. OTOH, it means that
    systems that only resolve names and do not generate them, need not
    implement NFC. If I understand correctly, this is what Dan Bernstein
    is arguing for.

I'm now leaning towards d) or f), but any of these approaches would work.


[*] The definition of a combining mark here is a character with canonical
    combining class > 0.

- -- 
David Hopwood <david.hopwood@zetnet.co.uk>

Home page & PGP public key: http://www.users.zetnet.co.uk/hopwood/
RSA 2048-bit; fingerprint 71 8E A6 23 0E D3 4C E5  0F 69 8C D4 FA 66 15 01
Nothing in this message is intended to be legally binding. If I revoke a
public key but refuse to specify why, it is because the private key has been
seized under the Regulation of Investigatory Powers Act; see www.fipr.org/rip


-----BEGIN PGP SIGNATURE-----
Version: 2.6.3i
Charset: noconv

iQEVAwUBPAKwRTkCAxeYt5gVAQHbLwgAu/UlyPRf8DNgU5e4W2GlYj+UbueIUDhj
TW8svhJ2adJxIJSS7ymE5jbub+TPIlaxNfO84MYB/BDWNXdlXeOKSxXJSKh1r9SH
0UmI5YvS5jY470OzMseSpMlAh8ef8+Iz+SSI6sFEpgdAJBlImpYTtwMCRj1HINhH
IS6WgFUl8DYo3Ip+Rw22v8EgTq2dQ4xgyvAplmZYFiKNrcwxQmQAFpASoALyGFSO
cPYmw2JqzIG4yb3PzVfQKESmcZ17SMTfSU83l/2adPrS7SwLH7eLzv6c651rZTiC
qwnI5WTEREduOK90C2cKke6xD3PJRJD06MXrWZsBnakb2+ilWpkVGA==
=DRF1
-----END PGP SIGNATURE-----