[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: Combining characters (was: Re: [idn] hostname historyhell)
-----BEGIN PGP SIGNED MESSAGE-----
['o' means functional composition. Everything below applies equally to
NFKC and NFC, but I'll say "NFC".]
Patrik Fältström <paf@cisco.com> wrote:
> On 01-11-26 01.12 +0000 David Hopwood <david.hopwood@zetnet.co.uk> wrote:
>
> > My preference to fix the problems with dotless-i and ypogegrammeni would
> > be NFC(fold(NFC(x))).
> >
> > (This can be optimised to NFC-compose(fold(NFD(x))), or alternatively
> > the folding can detect cases where the second NFC application is needed -
> > usually it is not.)
>
> See section 3 of draft-hoffman-stringprep-00.txt.
That section doesn't address the problems we're talking about.
There are four problems with the interaction between case folding and
NFC that need to be considered:
1. For a string that includes a character with Greek ypogegrammeni/
prosgegrammeni (U+0345) in its decomposition, folding only works
correctly if the input string is normalised (i.e. either U+0345
must be last in the combining sequence, or the character must be
fully composed).
2. A string that includes decomposed <I><dot-above> will be case-folded
to <i><dot-above>, which is inconsistent with the folding of composed
<I dot-above> to <i>. Any solution to this must also handle the case
where there are other combining characters between the <I> and
<dot-above>, or composed with the <I>.
3. If a character X folds to a sequence of two characters Y Z, then
Z may have precomposed forms that do not correspond to precomposed
forms of X.
For example, <sharp-s> U+00DF does not precompose with anything,
but <s> U+0074 does, so <sharp-s><acute> will fold to <s><s><acute>,
instead of the NFC-normalised form <s><s acute>.
(In practice this will not occur in real-world names, so strings
like <sharp-s><acute> *could* be disallowed.)
4. There are a small number of lowercase precomposed characters that do
not have uppercase precomposed equivalents; they are listed in
the SpecialCasing-5.txt file:
U+0390 GREEK SMALL LETTER IOTA WITH DIALYTIKA AND TONOS
U+03B0 GREEK SMALL LETTER UPSILON WITH DIALYTIKA AND TONOS
U+01F0 LATIN SMALL LETTER J WITH CARON
U+1E96 LATIN SMALL LETTER H WITH LINE BELOW
U+1E97 LATIN SMALL LETTER T WITH DIAERESIS
U+1E98 LATIN SMALL LETTER W WITH RING ABOVE
U+1E99 LATIN SMALL LETTER Y WITH RING ABOVE
U+1F50 GREEK SMALL LETTER UPSILON WITH PSILI
U+1F52 GREEK SMALL LETTER UPSILON WITH PSILI AND VARIA
U+1F54 GREEK SMALL LETTER UPSILON WITH PSILI AND OXIA
U+1F56 GREEK SMALL LETTER UPSILON WITH PSILI AND PERISPOMENI
U+1FB6 GREEK SMALL LETTER ALPHA WITH PERISPOMENI
U+1FC6 GREEK SMALL LETTER ETA WITH PERISPOMENI
U+1FD2 GREEK SMALL LETTER IOTA WITH DIALYTIKA AND VARIA
U+1FD3 GREEK SMALL LETTER IOTA WITH DIALYTIKA AND OXIA
U+1FD6 GREEK SMALL LETTER IOTA WITH PERISPOMENI
U+1FD7 GREEK SMALL LETTER IOTA WITH DIALYTIKA AND PERISPOMENI
U+1FE2 GREEK SMALL LETTER UPSILON WITH DIALYTIKA AND VARIA
U+1FE3 GREEK SMALL LETTER UPSILON WITH DIALYTIKA AND OXIA
U+1FE4 GREEK SMALL LETTER RHO WITH PSILI
U+1FE6 GREEK SMALL LETTER UPSILON WITH PERISPOMENI
U+1FE7 GREEK SMALL LETTER UPSILON WITH DIALYTIKA AND PERISPOMENI
U+1FF6 GREEK SMALL LETTER OMEGA WITH PERISPOMENI
Folding an NFC-normalised string containing the decomposed uppercase
forms of these characters, will result in a string that is not
NFC-normalised.
(The file also lists
U+0149 LATIN SMALL LETTER N PRECEDED BY APOSTROPHE
U+1E9A LATIN SMALL LETTER A WITH RIGHT HALF RING
but U+0149 causes only problem 3 above, and U+1E9A does not cause
any problems for normalisation.)
Problems 1 and 2 mean that "NFC o fold" doesn't work, unless the
foldings that map ypogegrammeni/prosgegrammeni to iota and dotted-I
to i are omitted.
Problems 3 and 4 mean that "fold o NFC" doesn't work, regardless
of which foldings are used.
A pathological example that shows all four problems is:
<alpha ypogegrammeni><varia><I><dot-below><dot-above><sharp-s><acute><J><caron>
"\u1FB3\u0300\u0049\u0323\u0307\u00DF\u0301\u004A\u030C"
There are several ways of solving all the problems simultaneously:
a) NFC o fold o NFC.
b) foldafter o NFC o foldbefore, where foldafter handles only <dotted-I>
and <ypogegrammeni> and foldbefore handles the rest.
c) foldafter' o NFC o foldbefore', where foldbefore' handles the characters
that only have lowercase precomposed variants, and foldafter' handles
the rest. If a character folds to two characters, then it must not be
immediately followed by a combining mark [*].
d) NFC o fold', where fold' maps out <dot-above> when it applies to
<i> or <I> or their fullwidth variants, and includes only the
"simple" foldings for characters containing ypogegrammeni/prosgegrammeni.
(Unlike the standard Unicode case folding algorithm and options a)-c)
or e), this handles Lithuanian correctly.)
e) NFC o fold'', where fold'' does not include a case folding for
<dotted-I>, and includes only the "simple" foldings for characters
containing ypogegrammeni/prosgegrammeni. This means that names are
not fully case-insensitive for Turkish, Azeri or Lithuanian, but it
is simpler than a)-d).
f) foldascii, where encoded strings are required to already be
NFC-normalised, and cannot contain uppercase or titlecase non-ASCII
characters. This approach does not support case preservation for
non-ASCII strings; it also requires that any process that destroys
NFC normalisation must re-normalise the name. OTOH, it means that
systems that only resolve names and do not generate them, need not
implement NFC. If I understand correctly, this is what Dan Bernstein
is arguing for.
I'm now leaning towards d) or f), but any of these approaches would work.
[*] The definition of a combining mark here is a character with canonical
combining class > 0.
- --
David Hopwood <david.hopwood@zetnet.co.uk>
Home page & PGP public key: http://www.users.zetnet.co.uk/hopwood/
RSA 2048-bit; fingerprint 71 8E A6 23 0E D3 4C E5 0F 69 8C D4 FA 66 15 01
Nothing in this message is intended to be legally binding. If I revoke a
public key but refuse to specify why, it is because the private key has been
seized under the Regulation of Investigatory Powers Act; see www.fipr.org/rip
-----BEGIN PGP SIGNATURE-----
Version: 2.6.3i
Charset: noconv
iQEVAwUBPAKwRTkCAxeYt5gVAQHbLwgAu/UlyPRf8DNgU5e4W2GlYj+UbueIUDhj
TW8svhJ2adJxIJSS7ymE5jbub+TPIlaxNfO84MYB/BDWNXdlXeOKSxXJSKh1r9SH
0UmI5YvS5jY470OzMseSpMlAh8ef8+Iz+SSI6sFEpgdAJBlImpYTtwMCRj1HINhH
IS6WgFUl8DYo3Ip+Rw22v8EgTq2dQ4xgyvAplmZYFiKNrcwxQmQAFpASoALyGFSO
cPYmw2JqzIG4yb3PzVfQKESmcZ17SMTfSU83l/2adPrS7SwLH7eLzv6c651rZTiC
qwnI5WTEREduOK90C2cKke6xD3PJRJD06MXrWZsBnakb2+ilWpkVGA==
=DRF1
-----END PGP SIGNATURE-----