[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[idn] NFC vs NFKC



-----BEGIN PGP SIGNED MESSAGE-----

David Hopwood wrote:
> ... I've been preparing a detailed rationale for using NFC in
> preference to NFKC, that considers all the categories of compatibility
> mappings. I'll post it tomorrow.

Took me a bit longer than I thought to finish, but here it is:

- -----
This post categorises and describes all of the compatibility
mappings in Unicode 3.1. The intention is to show that these
mappings are of little value for name preparation, because in
almost all cases, one or more of the following applies:

DISALLOWED: the source characters are already disallowed by
   nameprep-06, so removing the mapping has no effect.
NOT-USEFUL: the source characters are not useful in domain names.
PUNCT-SYMBOL: the source characters are punctuation or symbols,
   similar to characters that are disallowed for ASCII.
NOT-EQUIV: the source characters are not semantically equivalent
   to the target characters - i.e. folding them would be more
   confusing than not folding them (if, for the sake of argument,
   they were allowed).
LEGACY: the source characters are *only* intended to be used for
   round-trip mappings from legacy charsets.
DEPRECATED: the source characters are formally deprecated by
   Unicode 3.1.
OBSCURE: the source characters are very difficult to type, or
   produce using an input method, so not folding them is unlikely
   to cause any practical difficulty to users.
INTERNAL: the source characters are normally only intended for
   internal use within an application or rendering engine.
INCONSISTENT: a category of mappings is arbitrary and inconsistent,
   with only some of the potential mappings in that category being
   defined as compatibility equivalences.

There are a very small number of exceptions:

EXCEPTION: the mapping could be quite useful, and there is no
   reason to exclude it on any of the grounds above.

I claim that there are few enough exceptions to show that NFKC
is not the right mapping to use: *if* some of these [EXCEPTION]
mappings are wanted, they should be handled as an additional step
to NFC normalisation (similar to the mapped-out characters).


Here are the mapping categories. They're in no particular order,
and some characters appear in more than one category.

<nobreak>
  The source character is a non-breaking version of the target
  character:
    00A0 NO-BREAK SPACE                       [DISALLOWED]
    0F0C TIBETAN MARK DELIMITER TSHEG BSTAR   [NOT-USEFUL; see below]
    2007 FIGURE SPACE                         [DISALLOWED]
    2011 NON-BREAKING HYPHEN                  [EXCEPTION]
    202F NARROW NO-BREAK SPACE                [DISALLOWED]

  Tibetan script consists of morphemes separated by tseks (also
  transliterated as "tsheg"); see section 9.13 of the Unicode
  standard. U+0F0C is a non-breaking tsek (the character name is
  a mistake). In domain names, either an ordinary tsek (U+0F0B),
  or a hyphen should be used instead.

  It's possible that a document might use a non-breaking hyphen
  to prevent a domain name or URI being split over lines, and it
  could be useful to map it to a hyphen when cutting and pasting,
  so this is an [EXCEPTION].

<super> and <sub>
  The source character is a superscripted or subscripted version of
  the target character. These can be further categorised as:

    letters (ordinal indicators, modifier letters, and 'n')
       [NOT-EQUIV]

    digits [NOT-USEFUL, OBSCURE, NOT-EQUIV]

    symbols (including superscript SM and TM)
       [NOT-USEFUL, OBSCURE, NOT-EQUIV]

    Kanbun (annotation of classical Chinese in Japanese texts)
       [NOT-USEFUL, OBSCURE, NOT-EQUIV]

  Whether the letters are useful is arguable, but if they are, they
  should not be folded (since they are definitely not semantically
  equivalent to the target character).

<fraction>
  Various legacy character sets have characters for fractions, e.g.
  1/2, etc. [LEGACY, OBSCURE, NOT-USEFUL].

<circle> and <square>,
also <compat> 3036 CIRCLED POSTAL MARK

  Circled and squared variants. [NOT-USEFUL, NOT-EQUIV]

  (Note that the decomposition is to the uncircled/unsquared character
  on its own, without a U+20DD COMBINING ENCLOSING CIRCLE or
  U+20DE COMBINING ENCLOSING SQUARE. So the effect of this folding
  is that names with and without the circle/square are equivalent,
  despite being visually distinct.)

<wide> Full-width variants
  FF01..FF0C Full-width symbols/punctuation        [DISALLOWED, PUNCT-SYMBOL]
  FF0E       FULLWIDTH FULL STOP                   [DISALLOWED]
  FF0F       FULLWIDTH SOLIDUS                     [DISALLOWED, PUNCT-SYMBOL]
  FF1A..FF20 Full-width symbols/punctuation        [DISALLOWED, PUNCT-SYMBOL]
  FF3B..FF40 Full-width symbols/punctuation        [DISALLOWED, PUNCT-SYMBOL]
  FFE0..FFE6 Full-width symbols                    [PUNCT-SYMBOL]

  FF0D       FULLWIDTH HYPHEN-MINUS                [EXCEPTION]
  FF10..FF19 FULLWIDTH DIGIT ZERO..NINE            [EXCEPTION]
  FF21..FF3A FULLWIDTH LATIN CAPITAL LETTER A..Z   [EXCEPTION]
  FF41..FF5A FULLWIDTH LATIN SMALL LETTER A..Z     [EXCEPTION]

  CJK input methods can sometimes produce full-width characters,
  and it may be useful to map these to half-width (normal) LDH
  ASCII characters.

  However, nameprep is probably not the best place to do that.
  Doing it there would mean that it is valid for full-width ASCII
  to appear in an encoded name [*]. This will display as replacement
  boxes when viewed on a system without CJK fonts. It would be far
  preferable to make sure that encoded names always use normal
  ASCII. That suggests doing this folding in name input widgets,
  and/or defining a way to tell input methods when a domain name (or
  similar identifier) is being entered.

  Note that CJK users already have to set input methods to produce
  half-width ASCII characters, in order to type existing LDH ASCII
  domain names. So although this folding may improve usability, it
  isn't essential.

  [*] I'm making the assumption that whatever IDN solution is chosen
      will allow names to be encoded transparently in at least some
      cases, i.e. it won't force ACE to be used everywhere.

<narrow>
  FF61..FF64 Half-width punctuation          [PUNCT-SYMBOL]
  FF65..FF9F Half-width Katakana             [EXCEPTION]
  FFE8..FFEE Half-width symbols              [PUNCT-SYMBOL]

  The same comments apply to half-width Katakana as to the <wide>
  mappings above.

<compat>, <narrow> Hangul Compatibility Jamo
  3131..318E Full-width Compatibility Jamo
  FFA0..FFDC Half-width Compatibility Jamo

  The normal set of Jamo encoded at 1100..1100 is conjoining, that
  is, sequences of Jamo are displayed as, and are NFC-equivalent
  to, the corresponding syllables. The Compatibility Jamo (both
  full-width and half-width) are non-conjoining, i.e. they each
  take up a character cell; that is the only reason why they were
  encoded separately.

  So, the effect of using NFKC is that a domain name could be
  displayed with Jamo in separate character cells, but would
  actually be equivalent to the corresponding name displayed as
  syllables. I can't see any reason why that would be desirable.

  Also, section 10.4 of [Unicode3.0] says, "These characters are
  provided solely for compatibility with the KS C 5601 standard."
  [LEGACY, NOT-EQUIV].

<small>
  These are all in the CJK Compatibility Forms block (FE30..FE44).
  They were only encoded for compatibility with CNS 11643.

  Most are [DISALLOWED] because the corresponding ASCII symbol
  is disallowed; the following are not:
    FE51 SMALL IDEOGRAPHIC COMMA              [LEGACY, PUNCT-SYMBOL]
    FE58 SMALL EM DASH                        [LEGACY, PUNCT-SYMBOL]
    FE5D SMALL LEFT TORTOISE SHELL BRACKET    [LEGACY, PUNCT-SYMBOL]
    FE5E SMALL RIGHT TORTOISE SHELL BRACKET   [LEGACY, PUNCT-SYMBOL]
    FE63 SMALL HYPHEN-MINUS                   [LEGACY]

  Note that it would probably be more useful for a converter from
  CNS 11643 to map to the ordinary variants of these characters,
  anyway, rather than the small variants, which no-one uses.

<compat> Overline variants
  FE49..FE4C
  [NOT-USEFUL, PUNCT-SYMBOL]

<compat> Spaces
  (mapping is U+0020 SPACE; also <wide> 3000 IDEOGRAPHIC SPACE).
  [DISALLOWED]

<compat> Spacing marks
  (mapping starts with U+0020 SPACE)
  These are mappings from a spacing diacritical mark, to <space> +
  the corresponding combining mark. They are [DISALLOWED] because
  the <space> is disallowed.

<compat> Maps to disallowed ASCII (other than space)
  2024 ONE DOT LEADER
  2025 TWO DOT LEADER
  2026 HORIZONTAL ELLIPSIS
  203C DOUBLE EXCLAMATION MARK
  2048 QUESTION EXCLAMATION MARK
  2049 EXCLAMATION QUESTION MARK
  2474..2487 PARENTHESIZED DIGIT/NUMBER ONE..TWENTY
  2480..249B DIGIT/NUMBER ONE..TWENTY FULL STOP
  249C..245B PARENTHESIZED LATIN SMALL LETTER A..Z
  3200..321C PARENTHESIZED HANGUL *
  3220..3243 PARENTHESIZED IDEOGRAPH *
  FE4D DASHED LOW LINE
  FE4E CENTRELINE LOW LINE
  FE4F WAVY LOW LINE

  [DISALLOWED].

<compat> Hangzhou numerals
  3038 HANGZHOU NUMERAL TEN
  3039 HANGZHOU NUMERAL TWENTY
  303A HANGZHOU NUMERAL THIRTY

  These map to the ideographs U+5341 meaning ten (or complete or
  perfect), U+5344 meaning twenty, and U+5345 meaning thirty.
  I suspect that input methods will normally produce those
  ideographs, not the numeral characters (i.e. these characters
  are [OBSCURE]) - can anyone confirm that?

<compat> Ideographic telegraph symbols for months, hours, and days
  32C0..32CB
  3358..3370
  33E0..33FE

  These map to a decimal ASCII number, followed by the ideograph
  U+6708 (for months) or U+70B9 (for hours) or U+65E5 (for days).
  Again, I suspect that input methods will produce those sequences
  rather than the symbols. [OBSCURE].

<compat> CJK Radicals
  2E9F CJK RADICAL MOTHER
  2EF3 CJK RADICAL C-SIMPLIFIED TURTLE
  2F00..2FD5 KangXi radicals block

  See section 10.1 of [Unicode3.0] for a discussion of radicals. Their
  main uses are:
    - to categorize or collate ideographs (e.g. in an index)
    - to describe new ideographs, especially using the "ideographic
      description sequence" convention.

  The first of these isn't applicable to domain names, and nameprep
  already disallows ideographic description characters. Therefore,
  the simplest approach would be to disallow all radicals.

  Note that even if mapping from radicals to ideographs were a good
  idea, the selection of such mappings defined by NFKC is highly
  inconsistent - e.g. the following radicals from the CJK Radicals
  Supplement block correspond to unified ideographs:

    2E83 -> 4E5A  2E85 -> 4EB8  2E8E -> 5140  2E8F -> 5C23  2E90 -> 5C22
    2E92 -> 5DF3  2E96 -> 5FC4  2E98 -> 624C  2E9F -> 6BCD  2EC0 -> 535D
    2EA1 -> 6C35  2EA3 -> 706C  2EA8 -> 72AD  2EAD -> 793B  2EAF -> 7CF9
    2EB0 -> 7E9F  2EB1 -> 7F53  2EB2 -> 7F52  2EBD -> 81FC? 2EBE -> 8279
    2EC1 -> 864E  2EC2 -> 8864  2EC3 -> 8980  2EC8 -> 8BA0  2ECC -> 8FB6
    2ED0 -> 9485  2ED1 -> 9577  2ED2 -> 9578  2ED3 -> 957F  2ED4 -> 95E8
    2ED6 -> 961D  2ED8 -> 9752  2ED9 -> 97E6  2EDB -> 98CE  2EDC -> 98DE
    2EDD -> 98DF  2EDF -> 98E0  2EE0 -> 9963  2EE2 -> 9A6C  2EE3 -> 9AA8
    2EE5 -> 9C7C  2EE6 -> 9E1F  2EEA -> 9EFE  2EEC -> 9F50  2EEE -> 9F7F
    2EF0 -> 9F99  2EF1 -> 9F9C  2EF3 -> 9F9F

  but only two of these are compatibility mappings (2E9F and 2EF3).
  [NOT-USEFUL, NOT-EQUIV, INCONSISTENT]

  The Yi radicals should probably also be disallowed because they
  are not useful in domain names, even though they don't have any
  compatibility mappings.

<vertical>
  Presentation forms of symbols for use in vertical (top-to-bottom)
  layout. These are all in the Small Form Variants block (FE50..FE6B).
  These mappings are not useful because:

  - the corresponding left-to-right symbols are not normally
    used in domain names (most of them are brackets).
  - domain names are not normally laid out vertically (it would
    be better to use a left-to-right footnote in most cases).

  [NOT-USEFUL, PUNCT-SYMBOL]

<font>, <compat>, <initial>, <medial>, <final>, <isolated>
Presentation forms and some ligatures:

  Latin:    0132..0133, FB00..FB06
  Armenian: 0587, FB13..FB17
  Arabic:   0675..0678
  Lao:      0EDC..0EDD
  Hebrew:   FB20..FB29, FB4F
  Arabic:   <initial>, <medial>, <final>, <isolated>

  Presentation ligatures/forms are rendering variants, so these
  characters should not normally appear in external representations
  of text (they are often used internally as part of a rendering
  implementation, but that isn't relevant for nameprep).

  Note that the word "ligature" is overloaded: some ligatures behave
  like presentation forms (e.g. ff, fi, ffi, ij in Latin scripts),
  while others (e.g. oe and ae) are part of the spelling of words,
  such as "arch<ae>ology" (British English spelling). The argument
  above does not apply to the second type of ligature, but those
  don't have compatibility mappings.

  An input method/keyboard driver should never generate a ligature
  or presentation form, and lots of existing software would break
  if it did. (In general, language-specific rules are necessary to
  properly ligaturize text - e.g. in English the "ff" in "shelfful"
  should not ligaturize because the two "f"s are in different
  syllables.) Even if a user copies text containing presentation
  ligatures from a word processor, they will be decomposed on the
  clipboard, unless the word processor is completely broken in this
  respect.
  [INTERNAL, OBSCURE]

<compat> Deprecated characters
  0F77 TIBETAN VOWEL SIGN VOCALIC RR
  0F79 TIBETAN VOWEL SIGN VOCALIC LL

  The character descriptions say that "use of this character is
  strongly discouraged". (ISTR some text in the standard explaining
  why, but I can't find it now.)
  [DEPRECATED]

<compat> Combinations of spacing characters:
  013F LATIN CAPITAL LETTER L WITH MIDDLE DOT
  0140 LATIN SMALL LETTER L WITH MIDDLE DOT
  0149 LATIN SMALL LETTER N PRECEDED BY APOSTROPHE
  0E33 THAI CHARACTER SARA AM
  0EB3 LAO VOWEL SIGN AM
  1E9A LATIN SMALL LETTER A WITH RIGHT HALF RING

  These are combinations of characters that were encoded as a single
  character in other standards. The only reason why they aren't
  canonical equivalences, is that the decomposition is to two spacing
  characters, rather than a spacing character and a combining mark.

  These mappings could be treated as [EXCEPTION]s, although the
  combined characters are rare enough that it probably isn't worth
  the hassle to do that. I don't know whether they are produced by
  keyboard drivers.

<compat> Croatian digraphs:
  01C4..01CC
  01F1..01F3

  Chapter 7 of [Unicode3.0] says:

    Croatian Digraphs Matching Serbian Cyrillic Lettters.

    Serbo-Croatian is a single language with paired alphabets: a
    Latin script (Croatian) and a Cyrillic script (Serbian). A set
    of compatibility digraph codes is provided for one-to-one
    transliteration.

  IOW, these digraphs should occur only in text that has been
  automatically transliterated from Serbian to Croatian. Normally the
  digraph would be typed as two separate characters, so there is no
  need for a nameprep mapping. [OBSCURE]

<compat> Roman numerals
  2160..217F
  [NOT-USEFUL, OBSCURE]

<font>, <compat> Latin and Greek letter-like characters
  Most of the Letter-like Symbols block   [OBSCURE, NOT-USEFUL, NOT-EQUIV]
  00B5 MICRO SIGN                         [OBSCURE, NOT-USEFUL, NOT-EQUIV]
  20A8 RUPEE SIGN                    [INCONSISTENT, NOT-USEFUL, NOT-EQUIV]

  Various symbols that look like stylized letters, sometimes with
  mathematical meanings.

  (It's not clear why the Rupee sign should have a compatibility
  mapping to "Rs", when the same doesn't apply to other currency
  symbols - e.g. the Pesata sign does not have a compatibility
  mapping to "Pts". In any case, that doesn't really matter, since
  currency symbols are not useful in domain names.)

  Greek keyboard drivers will produce the "proper" lowercase mu
  character (U+03BC), not U+00B5.

  Note that the following are canonical equivalents, so they should
  not be disallowed (in order to satisfy the Unicode requirement of
  treating canonical equivalents identically):
    U+2126 OHM SIGN -> Omega
    U+212A KELVIN SIGN -> K
    U+212B ANGSTROM SIGN -> A with ring above

  All of the remaining Letter-like Symbols should be disallowed.

<font> Mathematical Alphanumeric Symbols block
  1D400..1D7FF

  These characters are for specialised use in mathematical text.
  (In fact the whole point of encoding them was that they are
  not semantically equivalent to the corresponding plain letters
  and digits - so folding them would be pointless.)

  [NOT-USEFUL, OBSCURE, NOT-EQUIV].

<compat> Greek symbols:
  03D0..03D6
  03F0..03F2
  03F4..03F5
  These are technical symbols, not normal Greek text.
  [NOT-USEFUL, OBSCURE, NOT-EQUIV]

<compat> Miscellaneous
  U+017F LATIN SMALL LETTER LONG S

  This is really a glyph variant of 's'. It is rarely used, so it
  doesn't really matter if it is not mapped to 's'. [OBSCURE].

<compat> Repeated characters
  2033 DOUBLE PRIME
  2034 TRIPLE PRIME
  2036 REVERSED DOUBLE PRIME
  2037 REVERSED TRIPLE PRIME
  203C DOUBLE EXCLAMATION MARK
  222C DOUBLE INTEGRAL
  222D TRIPLE INTEGRAL
  222F SURFACE INTEGRAL
  2230 VOLUME INTEGRAL

  [OBSCURE, PUNCT-SYMBOL].

- -- 
David Hopwood <david.hopwood@zetnet.co.uk>

Home page & PGP public key: http://www.users.zetnet.co.uk/hopwood/
RSA 2048-bit; fingerprint 71 8E A6 23 0E D3 4C E5  0F 69 8C D4 FA 66 15 01
Nothing in this message is intended to be legally binding. If I revoke a
public key but refuse to specify why, it is because the private key has been
seized under the Regulation of Investigatory Powers Act; see www.fipr.org/rip


-----BEGIN PGP SIGNATURE-----
Version: 2.6.3i
Charset: noconv

iQEVAwUBO9UXrDkCAxeYt5gVAQGspwf8DcHtURizKZj5I5mN/oE4krd7WfIXNwNj
F7KIdavHPkNrL9JUt9j1vBr8iJ7eaYaTZ0zns0l3kL9m9QUWpmCiuqyWsdRKPRJQ
w3mwDbartNV/en+OFp2qY8uHC1WAlcwZwcgS+RmSzfuSDdiYZ2gvXbySZjVNTAk1
+LjaGuoBu8bL+0YDNClWpwQha5uPUkYvw2WvKUr5+F0ASLwoMmSqnHSIlvHVX0rd
mOphfQfgo6k/4yG6YZKmp3F+8Onfs/IC2jZeorCWBMmre9uWO49Cf+WfO0C8CzOe
o17SWXla+oNqo5dasb9ewSAlehdGxMi5Lx4HDZoqshJe4Fh2P8Ea2A==
=iMh9
-----END PGP SIGNATURE-----