[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: Combining characters (was: Re: [idn] hostname historyhell)
-----BEGIN PGP SIGNED MESSAGE-----
Soobok Lee wrote:
> Now that <I><dot-above> is downcased to <i> as an exceptional case,
> Then, we have an interesting question:
> which direction should we lowercase <I><dot-above><acute> into ?
To <i acute>. That is, the equivalence class is:
<I><dot-above><acute> U+0049 U+0307 U+0301
<I dot-above><acute> U+0130 U+0301
<I><acute> U+0049 U+0301
<I acute> U+00CD
<i><acute> U+0069 U+0301
<i acute> U+00ED
<dotless i><acute> U+0131 U+0301
<fullwidth I><acute> U+FF29 U+0301
<fullwidth I><dot-above><acute> U+FF29 U+0307 U+0301
<fullwidth i><acute> U+FF49 U+0301
and if NFKC is used, also:
<information source><acute> U+2139 U+0301
<roman numeral one><acute> U+2160 U+0301
<roman numeral one><dot-above><acute> U+2160 U+0307 U+0301
<small roman numeral one><acute> U+2170 U+0301
<circled I><acute> U+24BE U+0301
<circled I><dot-above><acute> U+24BE U+0307 U+0301
<circled i><acute> U+24D8 U+0301
<bold I><acute> U+1D408 U+0301
<bold I><dot-above><acute> U+1D408 U+0307 U+0301
<bold i><acute> U+1D422 U+0301
<italic I><acute> U+1D43C U+0301
<italic I><dot-above><acute> U+1D43C U+0307 U+0301
<italic i><acute> U+1D456 U+0301
<bold italic I><acute> U+1D470 U+0301
<bold italic I><dot-above><acute> U+1D470 U+0307 U+0301
<bold italic i><acute> U+1D48A U+0301
<script I><acute> U+2110 U+0301
<script I><dot-above><acute> U+2110 U+0307 U+0301
<script i><acute> U+1D4BE U+0301
<bold script I><acute> U+1D4D8 U+0301
<bold script I><dot-above><acute> U+1D4D8 U+0307 U+0301
<bold script i><acute> U+1D4F2 U+0301
<fraktur I><acute> U+2111 U+0301
<fraktur I><dot-above><acute> U+2111 U+0307 U+0301
<fraktur i><acute> U+1D526 U+0301
<double-struck I><acute> U+1D540 U+0301
<double-struck I><dot-above><acute> U+1D540 U+0307 U+0301
<double-struck i><acute> U+1D55A U+0301
<bold fraktur I><acute> U+1D574 U+0301
<bold fraktur I><dot-above><acute> U+1D574 U+0307 U+0301
<bold fraktur i><acute> U+1D58E U+0301
<sans-serif I><acute> U+1D5A8 U+0301
<sans-serif I><dot-above><acute> U+1D5A8 U+0307 U+0301
<sans-serif i><acute> U+1D5C2 U+0301
<sans-serif bold I><acute> U+1D5DC U+0301
<sans-serif bold I><dot-above><acute> U+1D5DC U+0307 U+0301
<sans-serif bold i><acute> U+1D5F6 U+0301
<sans-serif italic I><acute> U+1D610 U+0301
<sans-serif italic I><dot-above><acute> U+1D610 U+0307 U+0301
<sans-serif italic i><acute> U+1D62A U+0301
<sans-serif bold italic I><acute> U+1D644 U+0301
<sans-serif bold italic I><dot-above><acute> U+1D644 U+0307 U+0301
<sans-serif bold italic i><acute> U+1D65E U+0301
<monospace I><acute> U+1D678 U+0301
<monospace I><dot-above><acute> U+1D678 U+0307 U+0301
<monospace i><acute> U+1D692 U+0301
<i acute> U+00ED is the normalised representative for all of these.
<i><dot-above><acute> is in a different equivalence class (AFAIK, no
language uses it, so this doesn't matter).
> 1. <I dot-above><acute> ===> <i><acute> : CaseMap(NFKC(x))
> 2. <i><dot-above><acute> : NFKC(CaseMap(x))
Again, we're doing a variant of case folding, not case mapping. UTR #21
describes the difference.
Note that neither fold(NFKC(x)) nor NFKC(fold(x)) is sufficient in general,
because fold(NFKC(x)) is not guaranteed to be NFC-normalised, and this
function is not idempotent.
My preference to fix the problems with dotless-i and ypogegrammeni would
be NFC(fold(NFC(x))).
(This can be optimised to NFC-compose(fold(NFD(x))), or alternatively
the folding can detect cases where the second NFC application is needed -
usually it is not.)
> Does the whole combining sequence inherit the bicameral property of
> the base character or the partial combining sequence minus the last mark ?
Combining sequences technically don't have properties; only characters have
properties. However, for most purposes a combining sequence should be treated
as having the case of its base character.
- --
David Hopwood <david.hopwood@zetnet.co.uk>
Home page & PGP public key: http://www.users.zetnet.co.uk/hopwood/
RSA 2048-bit; fingerprint 71 8E A6 23 0E D3 4C E5 0F 69 8C D4 FA 66 15 01
Nothing in this message is intended to be legally binding. If I revoke a
public key but refuse to specify why, it is because the private key has been
seized under the Regulation of Investigatory Powers Act; see www.fipr.org/rip
-----BEGIN PGP SIGNATURE-----
Version: 2.6.3i
Charset: noconv
iQEVAwUBPAGW8zkCAxeYt5gVAQFe0wf/eajQdoFWC7ptKw/l2NfUFJtyTrMBdW3Y
t1RGh+/fp5iGovIYM3XGHA0S/CZhOgufhMuDy22bsvJ/qVYRUjOYgbkPA+C1X0ER
tShXO7TjM8YPxjfWQuQjCVw0qm/qak19Nykz3HydLxM3nrW0oWkGegSB6krw/5ze
JnUj3YZfOs043P2kJCj6Ai29DIDkFMSTTBqOClZpqihGFu5eUp9tp/rcRUDFLT+s
B6wsgiFi18SDW2ut1LQlbQUQPLoU0Uy+KOxhF+ECzR8Uq/uoR2SQjsjp2DMrQg+0
DfzU0NZObLLC41QQl6R88rcR6I58rdvDwEwc6iPVDH96Op9HejD2GQ==
=WjQi
-----END PGP SIGNATURE-----