[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [idn] NFC vs NFKC
I disagree with your assessment. For example, I believe that it is a good
thing that the non-breaking feature is suppressed -- that is irrelevant to
IDNs.
However, it would make your paper much easier to examine if you removed all
the characters that end up getting disallowed -- they are not
counter-examples. Of the characters that are left that you feel are
problematic, there are three possibilities that the committee has to judge:
A. they don't matter if they are converted
B. they do matter, but they can be remapped, or prohibited*
C. they do matter, and enough to make us abandon NFKC
The downside of abandoning NFKC is that we lose the equivalence between
thousands of characters that do represent the same fundamental abstract
characters, such as width variants, Arabic ligatures, etc.
Note: a character can be deleted by the remapping phase. It can also be
effectively prohibited *before* NFKC by simply mapping it to an invalid
character, like space, that is not affected by normalization and ends up
being prohibited.
BTW I have very little hope that this committee will ever produce a result
if issues keep getting re-raised time and time and time again. For the
committee to ever reach some kind of resolution, people have to ask
themselves, *not* whether they think the current situation is absolutely
optimal (in their view) -- since it *never* will be -- but instead, whether
they can live with the result; whether there are really any results that
will cause significant problems in practice.
Mark
—————
Δός μοι ποῦ στῶ, καὶ κινῶ τὴν γῆν — Ἀρχιμήδης
[http://www.macchiato.com]
----- Original Message -----
From: "David Hopwood" <david.hopwood@zetnet.co.uk>
To: <idn@ops.ietf.org>
Sent: Tuesday, October 23, 2001 00:17
Subject: [idn] NFC vs NFKC
> -----BEGIN PGP SIGNED MESSAGE-----
>
> David Hopwood wrote:
> > ... I've been preparing a detailed rationale for using NFC in
> > preference to NFKC, that considers all the categories of compatibility
> > mappings. I'll post it tomorrow.
>
> Took me a bit longer than I thought to finish, but here it is:
>
> - -----
> This post categorises and describes all of the compatibility
> mappings in Unicode 3.1. The intention is to show that these
> mappings are of little value for name preparation, because in
> almost all cases, one or more of the following applies:
>
> DISALLOWED: the source characters are already disallowed by
> nameprep-06, so removing the mapping has no effect.
> NOT-USEFUL: the source characters are not useful in domain names.
> PUNCT-SYMBOL: the source characters are punctuation or symbols,
> similar to characters that are disallowed for ASCII.
> NOT-EQUIV: the source characters are not semantically equivalent
> to the target characters - i.e. folding them would be more
> confusing than not folding them (if, for the sake of argument,
> they were allowed).
> LEGACY: the source characters are *only* intended to be used for
> round-trip mappings from legacy charsets.
> DEPRECATED: the source characters are formally deprecated by
> Unicode 3.1.
> OBSCURE: the source characters are very difficult to type, or
> produce using an input method, so not folding them is unlikely
> to cause any practical difficulty to users.
> INTERNAL: the source characters are normally only intended for
> internal use within an application or rendering engine.
> INCONSISTENT: a category of mappings is arbitrary and inconsistent,
> with only some of the potential mappings in that category being
> defined as compatibility equivalences.
>
> There are a very small number of exceptions:
>
> EXCEPTION: the mapping could be quite useful, and there is no
> reason to exclude it on any of the grounds above.
>
> I claim that there are few enough exceptions to show that NFKC
> is not the right mapping to use: *if* some of these [EXCEPTION]
> mappings are wanted, they should be handled as an additional step
> to NFC normalisation (similar to the mapped-out characters).
>
>
> Here are the mapping categories. They're in no particular order,
> and some characters appear in more than one category.
>
> <nobreak>
> The source character is a non-breaking version of the target
> character:
> 00A0 NO-BREAK SPACE [DISALLOWED]
> 0F0C TIBETAN MARK DELIMITER TSHEG BSTAR [NOT-USEFUL; see below]
> 2007 FIGURE SPACE [DISALLOWED]
> 2011 NON-BREAKING HYPHEN [EXCEPTION]
> 202F NARROW NO-BREAK SPACE [DISALLOWED]
>
> Tibetan script consists of morphemes separated by tseks (also
> transliterated as "tsheg"); see section 9.13 of the Unicode
> standard. U+0F0C is a non-breaking tsek (the character name is
> a mistake). In domain names, either an ordinary tsek (U+0F0B),
> or a hyphen should be used instead.
>
> It's possible that a document might use a non-breaking hyphen
> to prevent a domain name or URI being split over lines, and it
> could be useful to map it to a hyphen when cutting and pasting,
> so this is an [EXCEPTION].
>
> <super> and <sub>
> The source character is a superscripted or subscripted version of
> the target character. These can be further categorised as:
>
> letters (ordinal indicators, modifier letters, and 'n')
> [NOT-EQUIV]
>
> digits [NOT-USEFUL, OBSCURE, NOT-EQUIV]
>
> symbols (including superscript SM and TM)
> [NOT-USEFUL, OBSCURE, NOT-EQUIV]
>
> Kanbun (annotation of classical Chinese in Japanese texts)
> [NOT-USEFUL, OBSCURE, NOT-EQUIV]
>
> Whether the letters are useful is arguable, but if they are, they
> should not be folded (since they are definitely not semantically
> equivalent to the target character).
>
> <fraction>
> Various legacy character sets have characters for fractions, e.g.
> 1/2, etc. [LEGACY, OBSCURE, NOT-USEFUL].
>
> <circle> and <square>,
> also <compat> 3036 CIRCLED POSTAL MARK
>
> Circled and squared variants. [NOT-USEFUL, NOT-EQUIV]
>
> (Note that the decomposition is to the uncircled/unsquared character
> on its own, without a U+20DD COMBINING ENCLOSING CIRCLE or
> U+20DE COMBINING ENCLOSING SQUARE. So the effect of this folding
> is that names with and without the circle/square are equivalent,
> despite being visually distinct.)
>
> <wide> Full-width variants
> FF01..FF0C Full-width symbols/punctuation [DISALLOWED,
PUNCT-SYMBOL]
> FF0E FULLWIDTH FULL STOP [DISALLOWED]
> FF0F FULLWIDTH SOLIDUS [DISALLOWED,
PUNCT-SYMBOL]
> FF1A..FF20 Full-width symbols/punctuation [DISALLOWED,
PUNCT-SYMBOL]
> FF3B..FF40 Full-width symbols/punctuation [DISALLOWED,
PUNCT-SYMBOL]
> FFE0..FFE6 Full-width symbols [PUNCT-SYMBOL]
>
> FF0D FULLWIDTH HYPHEN-MINUS [EXCEPTION]
> FF10..FF19 FULLWIDTH DIGIT ZERO..NINE [EXCEPTION]
> FF21..FF3A FULLWIDTH LATIN CAPITAL LETTER A..Z [EXCEPTION]
> FF41..FF5A FULLWIDTH LATIN SMALL LETTER A..Z [EXCEPTION]
>
> CJK input methods can sometimes produce full-width characters,
> and it may be useful to map these to half-width (normal) LDH
> ASCII characters.
>
> However, nameprep is probably not the best place to do that.
> Doing it there would mean that it is valid for full-width ASCII
> to appear in an encoded name [*]. This will display as replacement
> boxes when viewed on a system without CJK fonts. It would be far
> preferable to make sure that encoded names always use normal
> ASCII. That suggests doing this folding in name input widgets,
> and/or defining a way to tell input methods when a domain name (or
> similar identifier) is being entered.
>
> Note that CJK users already have to set input methods to produce
> half-width ASCII characters, in order to type existing LDH ASCII
> domain names. So although this folding may improve usability, it
> isn't essential.
>
> [*] I'm making the assumption that whatever IDN solution is chosen
> will allow names to be encoded transparently in at least some
> cases, i.e. it won't force ACE to be used everywhere.
>
> <narrow>
> FF61..FF64 Half-width punctuation [PUNCT-SYMBOL]
> FF65..FF9F Half-width Katakana [EXCEPTION]
> FFE8..FFEE Half-width symbols [PUNCT-SYMBOL]
>
> The same comments apply to half-width Katakana as to the <wide>
> mappings above.
>
> <compat>, <narrow> Hangul Compatibility Jamo
> 3131..318E Full-width Compatibility Jamo
> FFA0..FFDC Half-width Compatibility Jamo
>
> The normal set of Jamo encoded at 1100..1100 is conjoining, that
> is, sequences of Jamo are displayed as, and are NFC-equivalent
> to, the corresponding syllables. The Compatibility Jamo (both
> full-width and half-width) are non-conjoining, i.e. they each
> take up a character cell; that is the only reason why they were
> encoded separately.
>
> So, the effect of using NFKC is that a domain name could be
> displayed with Jamo in separate character cells, but would
> actually be equivalent to the corresponding name displayed as
> syllables. I can't see any reason why that would be desirable.
>
> Also, section 10.4 of [Unicode3.0] says, "These characters are
> provided solely for compatibility with the KS C 5601 standard."
> [LEGACY, NOT-EQUIV].
>
> <small>
> These are all in the CJK Compatibility Forms block (FE30..FE44).
> They were only encoded for compatibility with CNS 11643.
>
> Most are [DISALLOWED] because the corresponding ASCII symbol
> is disallowed; the following are not:
> FE51 SMALL IDEOGRAPHIC COMMA [LEGACY, PUNCT-SYMBOL]
> FE58 SMALL EM DASH [LEGACY, PUNCT-SYMBOL]
> FE5D SMALL LEFT TORTOISE SHELL BRACKET [LEGACY, PUNCT-SYMBOL]
> FE5E SMALL RIGHT TORTOISE SHELL BRACKET [LEGACY, PUNCT-SYMBOL]
> FE63 SMALL HYPHEN-MINUS [LEGACY]
>
> Note that it would probably be more useful for a converter from
> CNS 11643 to map to the ordinary variants of these characters,
> anyway, rather than the small variants, which no-one uses.
>
> <compat> Overline variants
> FE49..FE4C
> [NOT-USEFUL, PUNCT-SYMBOL]
>
> <compat> Spaces
> (mapping is U+0020 SPACE; also <wide> 3000 IDEOGRAPHIC SPACE).
> [DISALLOWED]
>
> <compat> Spacing marks
> (mapping starts with U+0020 SPACE)
> These are mappings from a spacing diacritical mark, to <space> +
> the corresponding combining mark. They are [DISALLOWED] because
> the <space> is disallowed.
>
> <compat> Maps to disallowed ASCII (other than space)
> 2024 ONE DOT LEADER
> 2025 TWO DOT LEADER
> 2026 HORIZONTAL ELLIPSIS
> 203C DOUBLE EXCLAMATION MARK
> 2048 QUESTION EXCLAMATION MARK
> 2049 EXCLAMATION QUESTION MARK
> 2474..2487 PARENTHESIZED DIGIT/NUMBER ONE..TWENTY
> 2480..249B DIGIT/NUMBER ONE..TWENTY FULL STOP
> 249C..245B PARENTHESIZED LATIN SMALL LETTER A..Z
> 3200..321C PARENTHESIZED HANGUL *
> 3220..3243 PARENTHESIZED IDEOGRAPH *
> FE4D DASHED LOW LINE
> FE4E CENTRELINE LOW LINE
> FE4F WAVY LOW LINE
>
> [DISALLOWED].
>
> <compat> Hangzhou numerals
> 3038 HANGZHOU NUMERAL TEN
> 3039 HANGZHOU NUMERAL TWENTY
> 303A HANGZHOU NUMERAL THIRTY
>
> These map to the ideographs U+5341 meaning ten (or complete or
> perfect), U+5344 meaning twenty, and U+5345 meaning thirty.
> I suspect that input methods will normally produce those
> ideographs, not the numeral characters (i.e. these characters
> are [OBSCURE]) - can anyone confirm that?
>
> <compat> Ideographic telegraph symbols for months, hours, and days
> 32C0..32CB
> 3358..3370
> 33E0..33FE
>
> These map to a decimal ASCII number, followed by the ideograph
> U+6708 (for months) or U+70B9 (for hours) or U+65E5 (for days).
> Again, I suspect that input methods will produce those sequences
> rather than the symbols. [OBSCURE].
>
> <compat> CJK Radicals
> 2E9F CJK RADICAL MOTHER
> 2EF3 CJK RADICAL C-SIMPLIFIED TURTLE
> 2F00..2FD5 KangXi radicals block
>
> See section 10.1 of [Unicode3.0] for a discussion of radicals. Their
> main uses are:
> - to categorize or collate ideographs (e.g. in an index)
> - to describe new ideographs, especially using the "ideographic
> description sequence" convention.
>
> The first of these isn't applicable to domain names, and nameprep
> already disallows ideographic description characters. Therefore,
> the simplest approach would be to disallow all radicals.
>
> Note that even if mapping from radicals to ideographs were a good
> idea, the selection of such mappings defined by NFKC is highly
> inconsistent - e.g. the following radicals from the CJK Radicals
> Supplement block correspond to unified ideographs:
>
> 2E83 -> 4E5A 2E85 -> 4EB8 2E8E -> 5140 2E8F -> 5C23 2E90 -> 5C22
> 2E92 -> 5DF3 2E96 -> 5FC4 2E98 -> 624C 2E9F -> 6BCD 2EC0 -> 535D
> 2EA1 -> 6C35 2EA3 -> 706C 2EA8 -> 72AD 2EAD -> 793B 2EAF -> 7CF9
> 2EB0 -> 7E9F 2EB1 -> 7F53 2EB2 -> 7F52 2EBD -> 81FC? 2EBE -> 8279
> 2EC1 -> 864E 2EC2 -> 8864 2EC3 -> 8980 2EC8 -> 8BA0 2ECC -> 8FB6
> 2ED0 -> 9485 2ED1 -> 9577 2ED2 -> 9578 2ED3 -> 957F 2ED4 -> 95E8
> 2ED6 -> 961D 2ED8 -> 9752 2ED9 -> 97E6 2EDB -> 98CE 2EDC -> 98DE
> 2EDD -> 98DF 2EDF -> 98E0 2EE0 -> 9963 2EE2 -> 9A6C 2EE3 -> 9AA8
> 2EE5 -> 9C7C 2EE6 -> 9E1F 2EEA -> 9EFE 2EEC -> 9F50 2EEE -> 9F7F
> 2EF0 -> 9F99 2EF1 -> 9F9C 2EF3 -> 9F9F
>
> but only two of these are compatibility mappings (2E9F and 2EF3).
> [NOT-USEFUL, NOT-EQUIV, INCONSISTENT]
>
> The Yi radicals should probably also be disallowed because they
> are not useful in domain names, even though they don't have any
> compatibility mappings.
>
> <vertical>
> Presentation forms of symbols for use in vertical (top-to-bottom)
> layout. These are all in the Small Form Variants block (FE50..FE6B).
> These mappings are not useful because:
>
> - the corresponding left-to-right symbols are not normally
> used in domain names (most of them are brackets).
> - domain names are not normally laid out vertically (it would
> be better to use a left-to-right footnote in most cases).
>
> [NOT-USEFUL, PUNCT-SYMBOL]
>
> <font>, <compat>, <initial>, <medial>, <final>, <isolated>
> Presentation forms and some ligatures:
>
> Latin: 0132..0133, FB00..FB06
> Armenian: 0587, FB13..FB17
> Arabic: 0675..0678
> Lao: 0EDC..0EDD
> Hebrew: FB20..FB29, FB4F
> Arabic: <initial>, <medial>, <final>, <isolated>
>
> Presentation ligatures/forms are rendering variants, so these
> characters should not normally appear in external representations
> of text (they are often used internally as part of a rendering
> implementation, but that isn't relevant for nameprep).
>
> Note that the word "ligature" is overloaded: some ligatures behave
> like presentation forms (e.g. ff, fi, ffi, ij in Latin scripts),
> while others (e.g. oe and ae) are part of the spelling of words,
> such as "arch<ae>ology" (British English spelling). The argument
> above does not apply to the second type of ligature, but those
> don't have compatibility mappings.
>
> An input method/keyboard driver should never generate a ligature
> or presentation form, and lots of existing software would break
> if it did. (In general, language-specific rules are necessary to
> properly ligaturize text - e.g. in English the "ff" in "shelfful"
> should not ligaturize because the two "f"s are in different
> syllables.) Even if a user copies text containing presentation
> ligatures from a word processor, they will be decomposed on the
> clipboard, unless the word processor is completely broken in this
> respect.
> [INTERNAL, OBSCURE]
>
> <compat> Deprecated characters
> 0F77 TIBETAN VOWEL SIGN VOCALIC RR
> 0F79 TIBETAN VOWEL SIGN VOCALIC LL
>
> The character descriptions say that "use of this character is
> strongly discouraged". (ISTR some text in the standard explaining
> why, but I can't find it now.)
> [DEPRECATED]
>
> <compat> Combinations of spacing characters:
> 013F LATIN CAPITAL LETTER L WITH MIDDLE DOT
> 0140 LATIN SMALL LETTER L WITH MIDDLE DOT
> 0149 LATIN SMALL LETTER N PRECEDED BY APOSTROPHE
> 0E33 THAI CHARACTER SARA AM
> 0EB3 LAO VOWEL SIGN AM
> 1E9A LATIN SMALL LETTER A WITH RIGHT HALF RING
>
> These are combinations of characters that were encoded as a single
> character in other standards. The only reason why they aren't
> canonical equivalences, is that the decomposition is to two spacing
> characters, rather than a spacing character and a combining mark.
>
> These mappings could be treated as [EXCEPTION]s, although the
> combined characters are rare enough that it probably isn't worth
> the hassle to do that. I don't know whether they are produced by
> keyboard drivers.
>
> <compat> Croatian digraphs:
> 01C4..01CC
> 01F1..01F3
>
> Chapter 7 of [Unicode3.0] says:
>
> Croatian Digraphs Matching Serbian Cyrillic Lettters.
>
> Serbo-Croatian is a single language with paired alphabets: a
> Latin script (Croatian) and a Cyrillic script (Serbian). A set
> of compatibility digraph codes is provided for one-to-one
> transliteration.
>
> IOW, these digraphs should occur only in text that has been
> automatically transliterated from Serbian to Croatian. Normally the
> digraph would be typed as two separate characters, so there is no
> need for a nameprep mapping. [OBSCURE]
>
> <compat> Roman numerals
> 2160..217F
> [NOT-USEFUL, OBSCURE]
>
> <font>, <compat> Latin and Greek letter-like characters
> Most of the Letter-like Symbols block [OBSCURE, NOT-USEFUL, NOT-EQUIV]
> 00B5 MICRO SIGN [OBSCURE, NOT-USEFUL, NOT-EQUIV]
> 20A8 RUPEE SIGN [INCONSISTENT, NOT-USEFUL, NOT-EQUIV]
>
> Various symbols that look like stylized letters, sometimes with
> mathematical meanings.
>
> (It's not clear why the Rupee sign should have a compatibility
> mapping to "Rs", when the same doesn't apply to other currency
> symbols - e.g. the Pesata sign does not have a compatibility
> mapping to "Pts". In any case, that doesn't really matter, since
> currency symbols are not useful in domain names.)
>
> Greek keyboard drivers will produce the "proper" lowercase mu
> character (U+03BC), not U+00B5.
>
> Note that the following are canonical equivalents, so they should
> not be disallowed (in order to satisfy the Unicode requirement of
> treating canonical equivalents identically):
> U+2126 OHM SIGN -> Omega
> U+212A KELVIN SIGN -> K
> U+212B ANGSTROM SIGN -> A with ring above
>
> All of the remaining Letter-like Symbols should be disallowed.
>
> <font> Mathematical Alphanumeric Symbols block
> 1D400..1D7FF
>
> These characters are for specialised use in mathematical text.
> (In fact the whole point of encoding them was that they are
> not semantically equivalent to the corresponding plain letters
> and digits - so folding them would be pointless.)
>
> [NOT-USEFUL, OBSCURE, NOT-EQUIV].
>
> <compat> Greek symbols:
> 03D0..03D6
> 03F0..03F2
> 03F4..03F5
> These are technical symbols, not normal Greek text.
> [NOT-USEFUL, OBSCURE, NOT-EQUIV]
>
> <compat> Miscellaneous
> U+017F LATIN SMALL LETTER LONG S
>
> This is really a glyph variant of 's'. It is rarely used, so it
> doesn't really matter if it is not mapped to 's'. [OBSCURE].
>
> <compat> Repeated characters
> 2033 DOUBLE PRIME
> 2034 TRIPLE PRIME
> 2036 REVERSED DOUBLE PRIME
> 2037 REVERSED TRIPLE PRIME
> 203C DOUBLE EXCLAMATION MARK
> 222C DOUBLE INTEGRAL
> 222D TRIPLE INTEGRAL
> 222F SURFACE INTEGRAL
> 2230 VOLUME INTEGRAL
>
> [OBSCURE, PUNCT-SYMBOL].
>
> - --
> David Hopwood <david.hopwood@zetnet.co.uk>
>
> Home page & PGP public key: http://www.users.zetnet.co.uk/hopwood/
> RSA 2048-bit; fingerprint 71 8E A6 23 0E D3 4C E5 0F 69 8C D4 FA 66 15 01
> Nothing in this message is intended to be legally binding. If I revoke a
> public key but refuse to specify why, it is because the private key has
been
> seized under the Regulation of Investigatory Powers Act; see
www.fipr.org/rip
>
>
> -----BEGIN PGP SIGNATURE-----
> Version: 2.6.3i
> Charset: noconv
>
> iQEVAwUBO9UXrDkCAxeYt5gVAQGspwf8DcHtURizKZj5I5mN/oE4krd7WfIXNwNj
> F7KIdavHPkNrL9JUt9j1vBr8iJ7eaYaTZ0zns0l3kL9m9QUWpmCiuqyWsdRKPRJQ
> w3mwDbartNV/en+OFp2qY8uHC1WAlcwZwcgS+RmSzfuSDdiYZ2gvXbySZjVNTAk1
> +LjaGuoBu8bL+0YDNClWpwQha5uPUkYvw2WvKUr5+F0ASLwoMmSqnHSIlvHVX0rd
> mOphfQfgo6k/4yG6YZKmp3F+8Onfs/IC2jZeorCWBMmre9uWO49Cf+WfO0C8CzOe
> o17SWXla+oNqo5dasb9ewSAlehdGxMi5Lx4HDZoqshJe4Fh2P8Ea2A==
> =iMh9
> -----END PGP SIGNATURE-----
>
>