[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
[idn] NFC vs NFKC
-----BEGIN PGP SIGNED MESSAGE-----
David Hopwood wrote:
> ... I've been preparing a detailed rationale for using NFC in
> preference to NFKC, that considers all the categories of compatibility
> mappings. I'll post it tomorrow.
Took me a bit longer than I thought to finish, but here it is:
- -----
This post categorises and describes all of the compatibility
mappings in Unicode 3.1. The intention is to show that these
mappings are of little value for name preparation, because in
almost all cases, one or more of the following applies:
DISALLOWED: the source characters are already disallowed by
nameprep-06, so removing the mapping has no effect.
NOT-USEFUL: the source characters are not useful in domain names.
PUNCT-SYMBOL: the source characters are punctuation or symbols,
similar to characters that are disallowed for ASCII.
NOT-EQUIV: the source characters are not semantically equivalent
to the target characters - i.e. folding them would be more
confusing than not folding them (if, for the sake of argument,
they were allowed).
LEGACY: the source characters are *only* intended to be used for
round-trip mappings from legacy charsets.
DEPRECATED: the source characters are formally deprecated by
Unicode 3.1.
OBSCURE: the source characters are very difficult to type, or
produce using an input method, so not folding them is unlikely
to cause any practical difficulty to users.
INTERNAL: the source characters are normally only intended for
internal use within an application or rendering engine.
INCONSISTENT: a category of mappings is arbitrary and inconsistent,
with only some of the potential mappings in that category being
defined as compatibility equivalences.
There are a very small number of exceptions:
EXCEPTION: the mapping could be quite useful, and there is no
reason to exclude it on any of the grounds above.
I claim that there are few enough exceptions to show that NFKC
is not the right mapping to use: *if* some of these [EXCEPTION]
mappings are wanted, they should be handled as an additional step
to NFC normalisation (similar to the mapped-out characters).
Here are the mapping categories. They're in no particular order,
and some characters appear in more than one category.
<nobreak>
The source character is a non-breaking version of the target
character:
00A0 NO-BREAK SPACE [DISALLOWED]
0F0C TIBETAN MARK DELIMITER TSHEG BSTAR [NOT-USEFUL; see below]
2007 FIGURE SPACE [DISALLOWED]
2011 NON-BREAKING HYPHEN [EXCEPTION]
202F NARROW NO-BREAK SPACE [DISALLOWED]
Tibetan script consists of morphemes separated by tseks (also
transliterated as "tsheg"); see section 9.13 of the Unicode
standard. U+0F0C is a non-breaking tsek (the character name is
a mistake). In domain names, either an ordinary tsek (U+0F0B),
or a hyphen should be used instead.
It's possible that a document might use a non-breaking hyphen
to prevent a domain name or URI being split over lines, and it
could be useful to map it to a hyphen when cutting and pasting,
so this is an [EXCEPTION].
<super> and <sub>
The source character is a superscripted or subscripted version of
the target character. These can be further categorised as:
letters (ordinal indicators, modifier letters, and 'n')
[NOT-EQUIV]
digits [NOT-USEFUL, OBSCURE, NOT-EQUIV]
symbols (including superscript SM and TM)
[NOT-USEFUL, OBSCURE, NOT-EQUIV]
Kanbun (annotation of classical Chinese in Japanese texts)
[NOT-USEFUL, OBSCURE, NOT-EQUIV]
Whether the letters are useful is arguable, but if they are, they
should not be folded (since they are definitely not semantically
equivalent to the target character).
<fraction>
Various legacy character sets have characters for fractions, e.g.
1/2, etc. [LEGACY, OBSCURE, NOT-USEFUL].
<circle> and <square>,
also <compat> 3036 CIRCLED POSTAL MARK
Circled and squared variants. [NOT-USEFUL, NOT-EQUIV]
(Note that the decomposition is to the uncircled/unsquared character
on its own, without a U+20DD COMBINING ENCLOSING CIRCLE or
U+20DE COMBINING ENCLOSING SQUARE. So the effect of this folding
is that names with and without the circle/square are equivalent,
despite being visually distinct.)
<wide> Full-width variants
FF01..FF0C Full-width symbols/punctuation [DISALLOWED, PUNCT-SYMBOL]
FF0E FULLWIDTH FULL STOP [DISALLOWED]
FF0F FULLWIDTH SOLIDUS [DISALLOWED, PUNCT-SYMBOL]
FF1A..FF20 Full-width symbols/punctuation [DISALLOWED, PUNCT-SYMBOL]
FF3B..FF40 Full-width symbols/punctuation [DISALLOWED, PUNCT-SYMBOL]
FFE0..FFE6 Full-width symbols [PUNCT-SYMBOL]
FF0D FULLWIDTH HYPHEN-MINUS [EXCEPTION]
FF10..FF19 FULLWIDTH DIGIT ZERO..NINE [EXCEPTION]
FF21..FF3A FULLWIDTH LATIN CAPITAL LETTER A..Z [EXCEPTION]
FF41..FF5A FULLWIDTH LATIN SMALL LETTER A..Z [EXCEPTION]
CJK input methods can sometimes produce full-width characters,
and it may be useful to map these to half-width (normal) LDH
ASCII characters.
However, nameprep is probably not the best place to do that.
Doing it there would mean that it is valid for full-width ASCII
to appear in an encoded name [*]. This will display as replacement
boxes when viewed on a system without CJK fonts. It would be far
preferable to make sure that encoded names always use normal
ASCII. That suggests doing this folding in name input widgets,
and/or defining a way to tell input methods when a domain name (or
similar identifier) is being entered.
Note that CJK users already have to set input methods to produce
half-width ASCII characters, in order to type existing LDH ASCII
domain names. So although this folding may improve usability, it
isn't essential.
[*] I'm making the assumption that whatever IDN solution is chosen
will allow names to be encoded transparently in at least some
cases, i.e. it won't force ACE to be used everywhere.
<narrow>
FF61..FF64 Half-width punctuation [PUNCT-SYMBOL]
FF65..FF9F Half-width Katakana [EXCEPTION]
FFE8..FFEE Half-width symbols [PUNCT-SYMBOL]
The same comments apply to half-width Katakana as to the <wide>
mappings above.
<compat>, <narrow> Hangul Compatibility Jamo
3131..318E Full-width Compatibility Jamo
FFA0..FFDC Half-width Compatibility Jamo
The normal set of Jamo encoded at 1100..1100 is conjoining, that
is, sequences of Jamo are displayed as, and are NFC-equivalent
to, the corresponding syllables. The Compatibility Jamo (both
full-width and half-width) are non-conjoining, i.e. they each
take up a character cell; that is the only reason why they were
encoded separately.
So, the effect of using NFKC is that a domain name could be
displayed with Jamo in separate character cells, but would
actually be equivalent to the corresponding name displayed as
syllables. I can't see any reason why that would be desirable.
Also, section 10.4 of [Unicode3.0] says, "These characters are
provided solely for compatibility with the KS C 5601 standard."
[LEGACY, NOT-EQUIV].
<small>
These are all in the CJK Compatibility Forms block (FE30..FE44).
They were only encoded for compatibility with CNS 11643.
Most are [DISALLOWED] because the corresponding ASCII symbol
is disallowed; the following are not:
FE51 SMALL IDEOGRAPHIC COMMA [LEGACY, PUNCT-SYMBOL]
FE58 SMALL EM DASH [LEGACY, PUNCT-SYMBOL]
FE5D SMALL LEFT TORTOISE SHELL BRACKET [LEGACY, PUNCT-SYMBOL]
FE5E SMALL RIGHT TORTOISE SHELL BRACKET [LEGACY, PUNCT-SYMBOL]
FE63 SMALL HYPHEN-MINUS [LEGACY]
Note that it would probably be more useful for a converter from
CNS 11643 to map to the ordinary variants of these characters,
anyway, rather than the small variants, which no-one uses.
<compat> Overline variants
FE49..FE4C
[NOT-USEFUL, PUNCT-SYMBOL]
<compat> Spaces
(mapping is U+0020 SPACE; also <wide> 3000 IDEOGRAPHIC SPACE).
[DISALLOWED]
<compat> Spacing marks
(mapping starts with U+0020 SPACE)
These are mappings from a spacing diacritical mark, to <space> +
the corresponding combining mark. They are [DISALLOWED] because
the <space> is disallowed.
<compat> Maps to disallowed ASCII (other than space)
2024 ONE DOT LEADER
2025 TWO DOT LEADER
2026 HORIZONTAL ELLIPSIS
203C DOUBLE EXCLAMATION MARK
2048 QUESTION EXCLAMATION MARK
2049 EXCLAMATION QUESTION MARK
2474..2487 PARENTHESIZED DIGIT/NUMBER ONE..TWENTY
2480..249B DIGIT/NUMBER ONE..TWENTY FULL STOP
249C..245B PARENTHESIZED LATIN SMALL LETTER A..Z
3200..321C PARENTHESIZED HANGUL *
3220..3243 PARENTHESIZED IDEOGRAPH *
FE4D DASHED LOW LINE
FE4E CENTRELINE LOW LINE
FE4F WAVY LOW LINE
[DISALLOWED].
<compat> Hangzhou numerals
3038 HANGZHOU NUMERAL TEN
3039 HANGZHOU NUMERAL TWENTY
303A HANGZHOU NUMERAL THIRTY
These map to the ideographs U+5341 meaning ten (or complete or
perfect), U+5344 meaning twenty, and U+5345 meaning thirty.
I suspect that input methods will normally produce those
ideographs, not the numeral characters (i.e. these characters
are [OBSCURE]) - can anyone confirm that?
<compat> Ideographic telegraph symbols for months, hours, and days
32C0..32CB
3358..3370
33E0..33FE
These map to a decimal ASCII number, followed by the ideograph
U+6708 (for months) or U+70B9 (for hours) or U+65E5 (for days).
Again, I suspect that input methods will produce those sequences
rather than the symbols. [OBSCURE].
<compat> CJK Radicals
2E9F CJK RADICAL MOTHER
2EF3 CJK RADICAL C-SIMPLIFIED TURTLE
2F00..2FD5 KangXi radicals block
See section 10.1 of [Unicode3.0] for a discussion of radicals. Their
main uses are:
- to categorize or collate ideographs (e.g. in an index)
- to describe new ideographs, especially using the "ideographic
description sequence" convention.
The first of these isn't applicable to domain names, and nameprep
already disallows ideographic description characters. Therefore,
the simplest approach would be to disallow all radicals.
Note that even if mapping from radicals to ideographs were a good
idea, the selection of such mappings defined by NFKC is highly
inconsistent - e.g. the following radicals from the CJK Radicals
Supplement block correspond to unified ideographs:
2E83 -> 4E5A 2E85 -> 4EB8 2E8E -> 5140 2E8F -> 5C23 2E90 -> 5C22
2E92 -> 5DF3 2E96 -> 5FC4 2E98 -> 624C 2E9F -> 6BCD 2EC0 -> 535D
2EA1 -> 6C35 2EA3 -> 706C 2EA8 -> 72AD 2EAD -> 793B 2EAF -> 7CF9
2EB0 -> 7E9F 2EB1 -> 7F53 2EB2 -> 7F52 2EBD -> 81FC? 2EBE -> 8279
2EC1 -> 864E 2EC2 -> 8864 2EC3 -> 8980 2EC8 -> 8BA0 2ECC -> 8FB6
2ED0 -> 9485 2ED1 -> 9577 2ED2 -> 9578 2ED3 -> 957F 2ED4 -> 95E8
2ED6 -> 961D 2ED8 -> 9752 2ED9 -> 97E6 2EDB -> 98CE 2EDC -> 98DE
2EDD -> 98DF 2EDF -> 98E0 2EE0 -> 9963 2EE2 -> 9A6C 2EE3 -> 9AA8
2EE5 -> 9C7C 2EE6 -> 9E1F 2EEA -> 9EFE 2EEC -> 9F50 2EEE -> 9F7F
2EF0 -> 9F99 2EF1 -> 9F9C 2EF3 -> 9F9F
but only two of these are compatibility mappings (2E9F and 2EF3).
[NOT-USEFUL, NOT-EQUIV, INCONSISTENT]
The Yi radicals should probably also be disallowed because they
are not useful in domain names, even though they don't have any
compatibility mappings.
<vertical>
Presentation forms of symbols for use in vertical (top-to-bottom)
layout. These are all in the Small Form Variants block (FE50..FE6B).
These mappings are not useful because:
- the corresponding left-to-right symbols are not normally
used in domain names (most of them are brackets).
- domain names are not normally laid out vertically (it would
be better to use a left-to-right footnote in most cases).
[NOT-USEFUL, PUNCT-SYMBOL]
<font>, <compat>, <initial>, <medial>, <final>, <isolated>
Presentation forms and some ligatures:
Latin: 0132..0133, FB00..FB06
Armenian: 0587, FB13..FB17
Arabic: 0675..0678
Lao: 0EDC..0EDD
Hebrew: FB20..FB29, FB4F
Arabic: <initial>, <medial>, <final>, <isolated>
Presentation ligatures/forms are rendering variants, so these
characters should not normally appear in external representations
of text (they are often used internally as part of a rendering
implementation, but that isn't relevant for nameprep).
Note that the word "ligature" is overloaded: some ligatures behave
like presentation forms (e.g. ff, fi, ffi, ij in Latin scripts),
while others (e.g. oe and ae) are part of the spelling of words,
such as "arch<ae>ology" (British English spelling). The argument
above does not apply to the second type of ligature, but those
don't have compatibility mappings.
An input method/keyboard driver should never generate a ligature
or presentation form, and lots of existing software would break
if it did. (In general, language-specific rules are necessary to
properly ligaturize text - e.g. in English the "ff" in "shelfful"
should not ligaturize because the two "f"s are in different
syllables.) Even if a user copies text containing presentation
ligatures from a word processor, they will be decomposed on the
clipboard, unless the word processor is completely broken in this
respect.
[INTERNAL, OBSCURE]
<compat> Deprecated characters
0F77 TIBETAN VOWEL SIGN VOCALIC RR
0F79 TIBETAN VOWEL SIGN VOCALIC LL
The character descriptions say that "use of this character is
strongly discouraged". (ISTR some text in the standard explaining
why, but I can't find it now.)
[DEPRECATED]
<compat> Combinations of spacing characters:
013F LATIN CAPITAL LETTER L WITH MIDDLE DOT
0140 LATIN SMALL LETTER L WITH MIDDLE DOT
0149 LATIN SMALL LETTER N PRECEDED BY APOSTROPHE
0E33 THAI CHARACTER SARA AM
0EB3 LAO VOWEL SIGN AM
1E9A LATIN SMALL LETTER A WITH RIGHT HALF RING
These are combinations of characters that were encoded as a single
character in other standards. The only reason why they aren't
canonical equivalences, is that the decomposition is to two spacing
characters, rather than a spacing character and a combining mark.
These mappings could be treated as [EXCEPTION]s, although the
combined characters are rare enough that it probably isn't worth
the hassle to do that. I don't know whether they are produced by
keyboard drivers.
<compat> Croatian digraphs:
01C4..01CC
01F1..01F3
Chapter 7 of [Unicode3.0] says:
Croatian Digraphs Matching Serbian Cyrillic Lettters.
Serbo-Croatian is a single language with paired alphabets: a
Latin script (Croatian) and a Cyrillic script (Serbian). A set
of compatibility digraph codes is provided for one-to-one
transliteration.
IOW, these digraphs should occur only in text that has been
automatically transliterated from Serbian to Croatian. Normally the
digraph would be typed as two separate characters, so there is no
need for a nameprep mapping. [OBSCURE]
<compat> Roman numerals
2160..217F
[NOT-USEFUL, OBSCURE]
<font>, <compat> Latin and Greek letter-like characters
Most of the Letter-like Symbols block [OBSCURE, NOT-USEFUL, NOT-EQUIV]
00B5 MICRO SIGN [OBSCURE, NOT-USEFUL, NOT-EQUIV]
20A8 RUPEE SIGN [INCONSISTENT, NOT-USEFUL, NOT-EQUIV]
Various symbols that look like stylized letters, sometimes with
mathematical meanings.
(It's not clear why the Rupee sign should have a compatibility
mapping to "Rs", when the same doesn't apply to other currency
symbols - e.g. the Pesata sign does not have a compatibility
mapping to "Pts". In any case, that doesn't really matter, since
currency symbols are not useful in domain names.)
Greek keyboard drivers will produce the "proper" lowercase mu
character (U+03BC), not U+00B5.
Note that the following are canonical equivalents, so they should
not be disallowed (in order to satisfy the Unicode requirement of
treating canonical equivalents identically):
U+2126 OHM SIGN -> Omega
U+212A KELVIN SIGN -> K
U+212B ANGSTROM SIGN -> A with ring above
All of the remaining Letter-like Symbols should be disallowed.
<font> Mathematical Alphanumeric Symbols block
1D400..1D7FF
These characters are for specialised use in mathematical text.
(In fact the whole point of encoding them was that they are
not semantically equivalent to the corresponding plain letters
and digits - so folding them would be pointless.)
[NOT-USEFUL, OBSCURE, NOT-EQUIV].
<compat> Greek symbols:
03D0..03D6
03F0..03F2
03F4..03F5
These are technical symbols, not normal Greek text.
[NOT-USEFUL, OBSCURE, NOT-EQUIV]
<compat> Miscellaneous
U+017F LATIN SMALL LETTER LONG S
This is really a glyph variant of 's'. It is rarely used, so it
doesn't really matter if it is not mapped to 's'. [OBSCURE].
<compat> Repeated characters
2033 DOUBLE PRIME
2034 TRIPLE PRIME
2036 REVERSED DOUBLE PRIME
2037 REVERSED TRIPLE PRIME
203C DOUBLE EXCLAMATION MARK
222C DOUBLE INTEGRAL
222D TRIPLE INTEGRAL
222F SURFACE INTEGRAL
2230 VOLUME INTEGRAL
[OBSCURE, PUNCT-SYMBOL].
- --
David Hopwood <david.hopwood@zetnet.co.uk>
Home page & PGP public key: http://www.users.zetnet.co.uk/hopwood/
RSA 2048-bit; fingerprint 71 8E A6 23 0E D3 4C E5 0F 69 8C D4 FA 66 15 01
Nothing in this message is intended to be legally binding. If I revoke a
public key but refuse to specify why, it is because the private key has been
seized under the Regulation of Investigatory Powers Act; see www.fipr.org/rip
-----BEGIN PGP SIGNATURE-----
Version: 2.6.3i
Charset: noconv
iQEVAwUBO9UXrDkCAxeYt5gVAQGspwf8DcHtURizKZj5I5mN/oE4krd7WfIXNwNj
F7KIdavHPkNrL9JUt9j1vBr8iJ7eaYaTZ0zns0l3kL9m9QUWpmCiuqyWsdRKPRJQ
w3mwDbartNV/en+OFp2qY8uHC1WAlcwZwcgS+RmSzfuSDdiYZ2gvXbySZjVNTAk1
+LjaGuoBu8bL+0YDNClWpwQha5uPUkYvw2WvKUr5+F0ASLwoMmSqnHSIlvHVX0rd
mOphfQfgo6k/4yG6YZKmp3F+8Onfs/IC2jZeorCWBMmre9uWO49Cf+WfO0C8CzOe
o17SWXla+oNqo5dasb9ewSAlehdGxMi5Lx4HDZoqshJe4Fh2P8Ea2A==
=iMh9
-----END PGP SIGNATURE-----