[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [idn] NFC vs NFKC<024301c149a0$c65b1ee0$ec1bd9d2@temp><4.2.0.58.J.20011010135529.03d9e6a0@localhost><4.2.0.58.J.20011018183339.031fa100@localhost><3BD0C257.A61BE893@zetnet.co.uk>



This is very interesting material. Thanks a lot for all this work.

The conclusions coincide mostly with mine, although I haven't
done such an explicit analysis.

Some comments below:

At 08:17 01/10/23 +0100, David Hopwood wrote:
>-----BEGIN PGP SIGNED MESSAGE-----
>
>David Hopwood wrote:
> > ... I've been preparing a detailed rationale for using NFC in
> > preference to NFKC, that considers all the categories of compatibility
> > mappings. I'll post it tomorrow.
>
>Took me a bit longer than I thought to finish, but here it is:
>
>- -----
>This post categorises and describes all of the compatibility
>mappings in Unicode 3.1. The intention is to show that these
>mappings are of little value for name preparation, because in
>almost all cases, one or more of the following applies:
>
>DISALLOWED: the source characters are already disallowed by
>    nameprep-06, so removing the mapping has no effect.
>NOT-USEFUL: the source characters are not useful in domain names.
>PUNCT-SYMBOL: the source characters are punctuation or symbols,
>    similar to characters that are disallowed for ASCII.
>NOT-EQUIV: the source characters are not semantically equivalent
>    to the target characters - i.e. folding them would be more
>    confusing than not folding them (if, for the sake of argument,
>    they were allowed).
>LEGACY: the source characters are *only* intended to be used for
>    round-trip mappings from legacy charsets.
>DEPRECATED: the source characters are formally deprecated by
>    Unicode 3.1.
>OBSCURE: the source characters are very difficult to type, or
>    produce using an input method, so not folding them is unlikely
>    to cause any practical difficulty to users.
>INTERNAL: the source characters are normally only intended for
>    internal use within an application or rendering engine.
>INCONSISTENT: a category of mappings is arbitrary and inconsistent,
>    with only some of the potential mappings in that category being
>    defined as compatibility equivalences.
>
>There are a very small number of exceptions:
>
>EXCEPTION: the mapping could be quite useful, and there is no
>    reason to exclude it on any of the grounds above.
>
>I claim that there are few enough exceptions to show that NFKC
>is not the right mapping to use: *if* some of these [EXCEPTION]
>mappings are wanted, they should be handled as an additional step
>to NFC normalisation (similar to the mapped-out characters).

I fully agree.


>Here are the mapping categories. They're in no particular order,
>and some characters appear in more than one category.
>
><nobreak>
>   The source character is a non-breaking version of the target
>   character:
>     00A0 NO-BREAK SPACE                       [DISALLOWED]
>     0F0C TIBETAN MARK DELIMITER TSHEG BSTAR   [NOT-USEFUL; see below]
>     2007 FIGURE SPACE                         [DISALLOWED]
>     2011 NON-BREAKING HYPHEN                  [EXCEPTION]
>     202F NARROW NO-BREAK SPACE                [DISALLOWED]
>
>   Tibetan script consists of morphemes separated by tseks (also
>   transliterated as "tsheg"); see section 9.13 of the Unicode
>   standard. U+0F0C is a non-breaking tsek (the character name is
>   a mistake). In domain names, either an ordinary tsek (U+0F0B),
>   or a hyphen should be used instead.
>
>   It's possible that a document might use a non-breaking hyphen
>   to prevent a domain name or URI being split over lines, and it
>   could be useful to map it to a hyphen when cutting and pasting,
>   so this is an [EXCEPTION].

It seems inconsistent to treat a non-breaking hyphen as an
[EXCEPTION], but a non-breaking tsheg as [NOT-USEFUL].
The non-breaking tsheg will be as useful to Tibetan domain
names as the non-breaking hyphen to domain names that use it,
or actually more so because there is a tsheg between any
two syllables in Tibetan, but hyphens appear less often.


><fraction>
>   Various legacy character sets have characters for fractions, e.g.
>   1/2, etc. [LEGACY, OBSCURE, NOT-USEFUL].

This is a typical example of where there is no problem for
IDN, because the '/' produced is also disallowed, but where
problems will appear if NFKC is applied for something similar.


><wide> Full-width variants

>   FF0D       FULLWIDTH HYPHEN-MINUS                [EXCEPTION]
>   FF10..FF19 FULLWIDTH DIGIT ZERO..NINE            [EXCEPTION]
>   FF21..FF3A FULLWIDTH LATIN CAPITAL LETTER A..Z   [EXCEPTION]
>   FF41..FF5A FULLWIDTH LATIN SMALL LETTER A..Z     [EXCEPTION]
>
>   CJK input methods can sometimes produce full-width characters,
>   and it may be useful to map these to half-width (normal) LDH
>   ASCII characters.
>
>   However, nameprep is probably not the best place to do that.
>   Doing it there would mean that it is valid for full-width ASCII
>   to appear in an encoded name [*]. This will display as replacement
>   boxes when viewed on a system without CJK fonts. It would be far
>   preferable to make sure that encoded names always use normal
>   ASCII. That suggests doing this folding in name input widgets,

This could be handled together with the 'ideographic fullstop'
issue, for which special treatment is neccessary anyway.


>   and/or defining a way to tell input methods when a domain name (or
>   similar identifier) is being entered.
>
>   Note that CJK users already have to set input methods to produce
>   half-width ASCII characters, in order to type existing LDH ASCII
>   domain names. So although this folding may improve usability, it
>   isn't essential.

I disagree quite a bit. For current all-ASCII domain names, it's
not a big issue, but for mixed Kanji/Latin domain names, it's
much more of an issue. Users in particular in Japan are used to
switch to a different mode for ASCII-only work, but there are
a lot of input methods that produce full-width Latin.


><narrow>
>   FF61..FF64 Half-width punctuation          [PUNCT-SYMBOL]
>   FF65..FF9F Half-width Katakana             [EXCEPTION]
>   FFE8..FFEE Half-width symbols              [PUNCT-SYMBOL]
>
>   The same comments apply to half-width Katakana as to the <wide>
>   mappings above.

I have to disagree. Half-width Katakana is much less of a problem.
Some systems (e.g. Unix, Japanese Email) don't handle half-width
Katakana. And they look ugly and clearly 'not right' to the average
Japanese. To input them, you have to switch to a special mode.
(I wrote a longer piece about full-width Latin and half-width
Katakana a while ago for the nameprep design team; I'll
post it separately).



><compat>, <narrow> Hangul Compatibility Jamo
>   3131..318E Full-width Compatibility Jamo
>   FFA0..FFDC Half-width Compatibility Jamo
>
>   The normal set of Jamo encoded at 1100..1100 is conjoining, that
>   is, sequences of Jamo are displayed as, and are NFC-equivalent
>   to, the corresponding syllables. The Compatibility Jamo (both
>   full-width and half-width) are non-conjoining, i.e. they each
>   take up a character cell; that is the only reason why they were
>   encoded separately.
>
>   So, the effect of using NFKC is that a domain name could be
>   displayed with Jamo in separate character cells, but would
>   actually be equivalent to the corresponding name displayed as
>   syllables. I can't see any reason why that would be desirable.
>
>   Also, section 10.4 of [Unicode3.0] says, "These characters are
>   provided solely for compatibility with the KS C 5601 standard."
>   [LEGACY, NOT-EQUIV].

The half-width ones should be prohibited similar to the half-width
Katakana. There may be some point in allowing the full-width ones,
but the discussion on this in the nameprep team wasn't conclusive.
Anyway, as already discussed in the nameprep team, not folding
them into the basic Jamo (in the U+1100 block) would address
a problem that Soobok mentioned again very recently.


><compat> Hangzhou numerals
>   3038 HANGZHOU NUMERAL TEN
>   3039 HANGZHOU NUMERAL TWENTY
>   303A HANGZHOU NUMERAL THIRTY
>
>   These map to the ideographs U+5341 meaning ten (or complete or
>   perfect), U+5344 meaning twenty, and U+5345 meaning thirty.
>   I suspect that input methods will normally produce those
>   ideographs, not the numeral characters (i.e. these characters
>   are [OBSCURE]) - can anyone confirm that?

I can't confirm, but I would strongly suspect that to be true.


><compat> Ideographic telegraph symbols for months, hours, and days
>   32C0..32CB
>   3358..3370
>   33E0..33FE
>
>   These map to a decimal ASCII number, followed by the ideograph
>   U+6708 (for months) or U+70B9 (for hours) or U+65E5 (for days).
>   Again, I suspect that input methods will produce those sequences
>   rather than the symbols. [OBSCURE].

Same comment as above.


><vertical>
>   Presentation forms of symbols for use in vertical (top-to-bottom)
>   layout. These are all in the Small Form Variants block (FE50..FE6B).
>   These mappings are not useful because:
>
>   - the corresponding left-to-right symbols are not normally
>     used in domain names (most of them are brackets).
>   - domain names are not normally laid out vertically (it would
>     be better to use a left-to-right footnote in most cases).
>
>   [NOT-USEFUL, PUNCT-SYMBOL]

I slightly disagree with your second point. The current domain
names already appear in vertical print (each letter is turned 90
degrees clockwise) in Japanese newspapers and magazines.

But this doesn't change your classification, because for
punctuation, the correct glyph should be chosen automatically
based on the writing direction. That's what the present-day
system I know do.


><font>, <compat>, <initial>, <medial>, <final>, <isolated>
>Presentation forms and some ligatures:

>   Arabic:   0675..0678

I would be very careful with these.


>   Note that the word "ligature" is overloaded: some ligatures behave
>   like presentation forms (e.g. ff, fi, ffi, ij in Latin scripts),
>   while others (e.g. oe and ae) are part of the spelling of words,
>   such as "arch<ae>ology" (British English spelling). The argument
>   above does not apply to the second type of ligature, but those
>   don't have compatibility mappings.
>
>   An input method/keyboard driver should never generate a ligature
>   or presentation form, and lots of existing software would break
>   if it did. (In general, language-specific rules are necessary to
>   properly ligaturize text - e.g. in English the "ff" in "shelfful"
>   should not ligaturize because the two "f"s are in different
>   syllables.) Even if a user copies text containing presentation
>   ligatures from a word processor, they will be decomposed on the
>   clipboard, unless the word processor is completely broken in this
>   respect.

I wouldn't be completely sure about the last sentence. But anyway,
the dns has worked for ages without mapping ff, fi, ffi,..., and so
it's rather safe to assume that it will continue without these kinds
of mappings.


><compat> Combinations of spacing characters:
>   013F LATIN CAPITAL LETTER L WITH MIDDLE DOT
>   0140 LATIN SMALL LETTER L WITH MIDDLE DOT
>   0149 LATIN SMALL LETTER N PRECEDED BY APOSTROPHE
>   0E33 THAI CHARACTER SARA AM
>   0EB3 LAO VOWEL SIGN AM
>   1E9A LATIN SMALL LETTER A WITH RIGHT HALF RING
>
>   These are combinations of characters that were encoded as a single
>   character in other standards. The only reason why they aren't
>   canonical equivalences, is that the decomposition is to two spacing
>   characters, rather than a spacing character and a combining mark.

I'm not sure. There are some canonical decompositions into
two spacing characters. The algorithm in UTR 15 is carefully
designed to put them back again.


>   These mappings could be treated as [EXCEPTION]s, although the
>   combined characters are rare enough that it probably isn't worth
>   the hassle to do that. I don't know whether they are produced by
>   keyboard drivers.

This probably should be checked on a one-by-one base.


>   Note that the following are canonical equivalents, so they should
>   not be disallowed (in order to satisfy the Unicode requirement of
>   treating canonical equivalents identically):
>     U+2126 OHM SIGN -> Omega
>     U+212A KELVIN SIGN -> K
>     U+212B ANGSTROM SIGN -> A with ring above

I don't understand this. Prohibition happens after normalization,
so there is no need to discuss whether they should be prohibited
or not, they will just be normalized away.


><compat> Miscellaneous
>   U+017F LATIN SMALL LETTER LONG S
>
>   This is really a glyph variant of 's'. It is rarely used, so it
>   doesn't really matter if it is not mapped to 's'. [OBSCURE].

For some people, it might matter (Irish?).


>David Hopwood <david.hopwood@zetnet.co.uk>


Thanks again for your thorough work!


Regards,   Martin.