[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [idn] NFC vs NFKC
David,
trying to map your categories onto "what would happen at the DNS side".
I think Occam's razor on this problem is roughly:
- If there are 2 names that map to the same sequence of characters
under NFKC, but not under NFC, and we think they should be DIFFERENT,
NFKC is bad, and NFC should be used.
- If there are 2 names that map to the same characters under NFKC, but
not under NFC, and we think the standard should force them to be ALWAYS
THE
SAME, NFKC is good.
Characters that do not produce valid domain names are Not A Problem.
--On tirsdag, oktober 23, 2001 08:17:15 +0100 David Hopwood
<david.hopwood@zetnet.co.uk> wrote:
> This post categorises and describes all of the compatibility
> mappings in Unicode 3.1. The intention is to show that these
> mappings are of little value for name preparation, because in
> almost all cases, one or more of the following applies:
>
> DISALLOWED: the source characters are already disallowed by
> nameprep-06, so removing the mapping has no effect.
Not A Problem.
> NOT-USEFUL: the source characters are not useful in domain names.
"Useful" is a value judgment, so we need to be careful.
Do you suggest outlawing this category?
> PUNCT-SYMBOL: the source characters are punctuation or symbols,
> similar to characters that are disallowed for ASCII.
Do you suggest outlawing this category?
> NOT-EQUIV: the source characters are not semantically equivalent
> to the target characters - i.e. folding them would be more
> confusing than not folding them (if, for the sake of argument,
> they were allowed).
This is the category that would argue strongly for NFC rather than NFKC.
> LEGACY: the source characters are *only* intended to be used for
> round-trip mappings from legacy charsets.
I do not understand what you want to do with those.
Do you suggest outlawing this category?
> DEPRECATED: the source characters are formally deprecated by
> Unicode 3.1.
Do you suggest outlawing this category?
> OBSCURE: the source characters are very difficult to type, or
> produce using an input method, so not folding them is unlikely
> to cause any practical difficulty to users.
Do you suggest outlawing this category?
> INTERNAL: the source characters are normally only intended for
> internal use within an application or rendering engine.
Do you suggest outlawing this category?
> INCONSISTENT: a category of mappings is arbitrary and inconsistent,
> with only some of the potential mappings in that category being
> defined as compatibility equivalences.
I don't understand what you want to do with those.
> There are a very small number of exceptions:
>
> EXCEPTION: the mapping could be quite useful, and there is no
> reason to exclude it on any of the grounds above.
This is the category that would argue in favour of NFKC and against NFC.
Below, I assume that you are proposing to do the following changes to
nameprep:
- Make NFC rather than NFKC the normalization function of nameprep
- Add NOT-USEFUL, FUNCT-SYMBOL, LEGACY, DEPRECATED, OBSCURE, INTERNAL
to the set of DISALLOWED characters
- Add the EXCEPTION rules to the nameprep ruleset
So the important categories are EXCEPTION (argue in favour of NFKC),
NOT-EQUIV (argue strongly in favour of NFC) and INCONSISTENT.
>
> I claim that there are few enough exceptions to show that NFKC
> is not the right mapping to use: *if* some of these [EXCEPTION]
> mappings are wanted, they should be handled as an additional step
> to NFC normalisation (similar to the mapped-out characters).
Here is the list I worry about.
I have followed your recommendation on everything you regard as OBSCURE or
NOT-USEFUL.
> 2011 NON-BREAKING HYPHEN [EXCEPTION]
> FF0D FULLWIDTH HYPHEN-MINUS [EXCEPTION]
> FF10..FF19 FULLWIDTH DIGIT ZERO..NINE [EXCEPTION]
> FF21..FF3A FULLWIDTH LATIN CAPITAL LETTER A..Z [EXCEPTION]
> FF41..FF5A FULLWIDTH LATIN SMALL LETTER A..Z [EXCEPTION]
> FF65..FF9F Half-width Katakana [EXCEPTION]
>
><compat> Combinations of spacing characters:
> 013F LATIN CAPITAL LETTER L WITH MIDDLE DOT
> 0140 LATIN SMALL LETTER L WITH MIDDLE DOT
> 0149 LATIN SMALL LETTER N PRECEDED BY APOSTROPHE
> 0E33 THAI CHARACTER SARA AM
> 0EB3 LAO VOWEL SIGN AM
> 1E9A LATIN SMALL LETTER A WITH RIGHT HALF RING
>
> These are combinations of characters that were encoded as a single
> character in other standards. The only reason why they aren't
> canonical equivalences, is that the decomposition is to two spacing
> characters, rather than a spacing character and a combining mark.
>
> These mappings could be treated as [EXCEPTION]s, although the
> combined characters are rare enough that it probably isn't worth
> the hassle to do that. I don't know whether they are produced by
> keyboard drivers.
The INCONSISTENT label is used with the Radicals block and the Rupee sign,
which is also labelled "NOT-USEFUL". Probably right.
*ALL* of the NOT-EQUIV cases are also labelled as NOT-USEFUL, OBSCURE or
another marker I have interpreted as "don't want to use these".
So far, if I have interpreted your missive correctly, you argue for:
- Outlawing a lot of characters (I wouldn't mind that :-)
- Adding the rules from NFKC for the remaining problematic characters
to Nameprep
Either I am missing something, or there isn't a single domain name that
would be legal under your proposed change to Nameprep where you could tell
from the output of Nameprep whether NFKC or NFC was applied.
Did I understand you correctly?
Harald
Nit (mandatory lingua-political hobby-horse :-)
> Note that the word "ligature" is overloaded: some ligatures behave
> like presentation forms (e.g. ff, fi, ffi, ij in Latin scripts),
> while others (e.g. oe and ae) are part of the spelling of words,
> such as "arch<ae>ology" (British English spelling). The argument
> above does not apply to the second type of ligature, but those
> don't have compatibility mappings.
As a Scandinavian, I have to point out that æ (ae) is not a ligature, it is
a letter. Others seemingly (ab)use it as a ligature, but it isn't :-)