[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Combining characters (was: Re: [idn] hostname historyhell)



-----BEGIN PGP SIGNED MESSAGE-----

Soobok Lee wrote:
> Now that <I><dot-above> is downcased to <i> as an exceptional case,
> Then, we have an interesting question:
> which direction should we  lowercase   <I><dot-above><acute>   into ?

To <i acute>. That is, the equivalence class is:

  <I><dot-above><acute>                         U+0049 U+0307 U+0301
  <I dot-above><acute>                          U+0130 U+0301
  <I><acute>                                    U+0049 U+0301
  <I acute>                                     U+00CD
  <i><acute>                                    U+0069 U+0301
  <i acute>                                     U+00ED
  <dotless i><acute>                            U+0131 U+0301
  <fullwidth I><acute>                          U+FF29 U+0301
  <fullwidth I><dot-above><acute>               U+FF29 U+0307 U+0301
  <fullwidth i><acute>                          U+FF49 U+0301

and if NFKC is used, also:

  <information source><acute>                   U+2139 U+0301
  <roman numeral one><acute>                    U+2160 U+0301
  <roman numeral one><dot-above><acute>         U+2160 U+0307 U+0301
  <small roman numeral one><acute>              U+2170 U+0301
  <circled I><acute>                            U+24BE U+0301
  <circled I><dot-above><acute>                 U+24BE U+0307 U+0301
  <circled i><acute>                            U+24D8 U+0301
  <bold I><acute>                               U+1D408 U+0301
  <bold I><dot-above><acute>                    U+1D408 U+0307 U+0301
  <bold i><acute>                               U+1D422 U+0301
  <italic I><acute>                             U+1D43C U+0301
  <italic I><dot-above><acute>                  U+1D43C U+0307 U+0301
  <italic i><acute>                             U+1D456 U+0301
  <bold italic I><acute>                        U+1D470 U+0301
  <bold italic I><dot-above><acute>             U+1D470 U+0307 U+0301
  <bold italic i><acute>                        U+1D48A U+0301
  <script I><acute>                             U+2110 U+0301
  <script I><dot-above><acute>                  U+2110 U+0307 U+0301
  <script i><acute>                             U+1D4BE U+0301
  <bold script I><acute>                        U+1D4D8 U+0301
  <bold script I><dot-above><acute>             U+1D4D8 U+0307 U+0301
  <bold script i><acute>                        U+1D4F2 U+0301
  <fraktur I><acute>                            U+2111 U+0301
  <fraktur I><dot-above><acute>                 U+2111 U+0307 U+0301
  <fraktur i><acute>                            U+1D526 U+0301
  <double-struck I><acute>                      U+1D540 U+0301
  <double-struck I><dot-above><acute>           U+1D540 U+0307 U+0301
  <double-struck i><acute>                      U+1D55A U+0301
  <bold fraktur I><acute>                       U+1D574 U+0301
  <bold fraktur I><dot-above><acute>            U+1D574 U+0307 U+0301
  <bold fraktur i><acute>                       U+1D58E U+0301
  <sans-serif I><acute>                         U+1D5A8 U+0301
  <sans-serif I><dot-above><acute>              U+1D5A8 U+0307 U+0301
  <sans-serif i><acute>                         U+1D5C2 U+0301
  <sans-serif bold I><acute>                    U+1D5DC U+0301
  <sans-serif bold I><dot-above><acute>         U+1D5DC U+0307 U+0301
  <sans-serif bold i><acute>                    U+1D5F6 U+0301
  <sans-serif italic I><acute>                  U+1D610 U+0301
  <sans-serif italic I><dot-above><acute>       U+1D610 U+0307 U+0301
  <sans-serif italic i><acute>                  U+1D62A U+0301
  <sans-serif bold italic I><acute>             U+1D644 U+0301
  <sans-serif bold italic I><dot-above><acute>  U+1D644 U+0307 U+0301
  <sans-serif bold italic i><acute>             U+1D65E U+0301
  <monospace I><acute>                          U+1D678 U+0301
  <monospace I><dot-above><acute>               U+1D678 U+0307 U+0301
  <monospace i><acute>                          U+1D692 U+0301

<i acute> U+00ED is the normalised representative for all of these.

<i><dot-above><acute> is in a different equivalence class (AFAIK, no
language uses it, so this doesn't matter).

>   1.    <I dot-above><acute>     ===>     <i><acute>    :  CaseMap(NFKC(x))
>   2.    <i><dot-above><acute>                           :  NFKC(CaseMap(x))

Again, we're doing a variant of case folding, not case mapping. UTR #21
describes the difference.

Note that neither fold(NFKC(x)) nor NFKC(fold(x)) is sufficient in general,
because fold(NFKC(x)) is not guaranteed to be NFC-normalised, and this
function is not idempotent.

My preference to fix the problems with dotless-i and ypogegrammeni would
be NFC(fold(NFC(x))).

(This can be optimised to NFC-compose(fold(NFD(x))), or alternatively
the folding can detect cases where the second NFC application is needed -
usually it is not.)

> Does the whole combining sequence inherit the bicameral property of
> the base character or the partial combining sequence minus the last mark ?

Combining sequences technically don't have properties; only characters have
properties. However, for most purposes a combining sequence should be treated
as having the case of its base character.

- -- 
David Hopwood <david.hopwood@zetnet.co.uk>

Home page & PGP public key: http://www.users.zetnet.co.uk/hopwood/
RSA 2048-bit; fingerprint 71 8E A6 23 0E D3 4C E5  0F 69 8C D4 FA 66 15 01
Nothing in this message is intended to be legally binding. If I revoke a
public key but refuse to specify why, it is because the private key has been
seized under the Regulation of Investigatory Powers Act; see www.fipr.org/rip


-----BEGIN PGP SIGNATURE-----
Version: 2.6.3i
Charset: noconv

iQEVAwUBPAGW8zkCAxeYt5gVAQFe0wf/eajQdoFWC7ptKw/l2NfUFJtyTrMBdW3Y
t1RGh+/fp5iGovIYM3XGHA0S/CZhOgufhMuDy22bsvJ/qVYRUjOYgbkPA+C1X0ER
tShXO7TjM8YPxjfWQuQjCVw0qm/qak19Nykz3HydLxM3nrW0oWkGegSB6krw/5ze
JnUj3YZfOs043P2kJCj6Ai29DIDkFMSTTBqOClZpqihGFu5eUp9tp/rcRUDFLT+s
B6wsgiFi18SDW2ut1LQlbQUQPLoU0Uy+KOxhF+ECzR8Uq/uoR2SQjsjp2DMrQg+0
DfzU0NZObLLC41QQl6R88rcR6I58rdvDwEwc6iPVDH96Op9HejD2GQ==
=WjQi
-----END PGP SIGNATURE-----