[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [idn] NFC vs NFKC<200109181333.JAA19941@ietf.org><024301c149a0$c65b1ee0$ec1bd9d2@temp><4.2.0.58.J.20011010135529.03d9e6a0@localhost><4.2.0.58.J.20011018183339.031fa100@localhost><3BD0C257.A61BE893@zetnet.co.uk><3BD5197B.1C280C5D@zetnet.co.uk>



Hello Harald,

I can't speak for David, but some answers below anyway.

At 10:09 01/10/24 +0200, Harald Alvestrand wrote:
>David,
>trying to map your categories onto "what would happen at the DNS side".
>I think Occam's razor on this problem is roughly:
>
>- If there are 2 names that map to the same sequence of characters
>  under NFKC, but not under NFC, and we think they should be DIFFERENT,
>  NFKC is bad, and NFC should be used.
>- If there are 2 names that map to the same characters under NFKC, but
>  not under NFC, and we think the standard should force them to be ALWAYS THE
>  SAME, NFKC is good.
>
>Characters that do not produce valid domain names are Not A Problem.
>
>--On tirsdag, oktober 23, 2001 08:17:15 +0100 David Hopwood 
><david.hopwood@zetnet.co.uk> wrote:

>>NOT-USEFUL: the source characters are not useful in domain names.
>
>"Useful" is a value judgment, so we need to be careful.
>Do you suggest outlawing this category?

Yes.

>>PUNCT-SYMBOL: the source characters are punctuation or symbols,
>>    similar to characters that are disallowed for ASCII.
>
>Do you suggest outlawing this category?

Yes.

>>NOT-EQUIV: the source characters are not semantically equivalent
>>    to the target characters - i.e. folding them would be more
>>    confusing than not folding them (if, for the sake of argument,
>>    they were allowed).
>
>This is the category that would argue strongly for NFC rather than NFKC.

Yes indeed.


>>LEGACY: the source characters are *only* intended to be used for
>>    round-trip mappings from legacy charsets.
>
>I do not understand what you want to do with those.
>Do you suggest outlawing this category?

Yes. IDN is a new system, with the goal of being able to
represent labels in a wide range of scripts. There is no
need to allow artefacts of legacy character encodings,
because IDN isn't designed to map existing names back
and forth from other naming systems.


>>DEPRECATED: the source characters are formally deprecated by
>>    Unicode 3.1.
>
>Do you suggest outlawing this category?

Yes.


>>OBSCURE: the source characters are very difficult to type, or
>>    produce using an input method, so not folding them is unlikely
>>    to cause any practical difficulty to users.
>
>Do you suggest outlawing this category?

Yes indeed. The argument for using NFKC that I have heard
most often is that we want to avoid the situation that
somebody types in a domain name from something s/he has
seen on a billboard,..., but the name is not found because
some compatibility equivalents instead of the actual
characters were input. This, and many of the other categories,
shows that (with exception of [EXCEPTION], of course) there
is really no such problem.


>>INTERNAL: the source characters are normally only intended for
>>    internal use within an application or rendering engine.
>
>Do you suggest outlawing this category?

Yes.


>>INCONSISTENT: a category of mappings is arbitrary and inconsistent,
>>    with only some of the potential mappings in that category being
>>    defined as compatibility equivalences.
>
>I don't understand what you want to do with those.

Outlawing them is the best way to go.

Please note that most characters and character classes in David's
analysis carry more than one label, and that in many if not most
such cases, all of the labels suggest outlawing.


>>There are a very small number of exceptions:
>>
>>EXCEPTION: the mapping could be quite useful, and there is no
>>    reason to exclude it on any of the grounds above.
>
>This is the category that would argue in favour of NFKC and against NFC.

Yes, or in favor of including it in the IDN-specific mapping
before the normalization step, or in delegating this to some
earlier process in the same way we have delegated the
ideographic fullstop to preprocessing.


>Below, I assume that you are proposing to do the following changes to
>nameprep:
>
>- Make NFC rather than NFKC the normalization function of nameprep

Yes.

>- Add NOT-USEFUL, FUNCT-SYMBOL, LEGACY, DEPRECATED, OBSCURE, INTERNAL
>to the set of DISALLOWED characters

Yes.

>- Add the EXCEPTION rules to the nameprep ruleset

Yes, add the relevant mappings to the mapping stage before
normalization.


>So the important categories are EXCEPTION (argue in favour of NFKC), 
>NOT-EQUIV (argue strongly in favour of NFC) and INCONSISTENT.

Yes. But the other categories are also important. They show
that NFKC maps around a lot of dubious, very rarely used, or
otherwise legacy characters. This means that NFKC does a lot
of work that isn't needed at all.

Changing from NFKC to NFC will simplify things, in particular
for standalone implementations that don't have access to
normalization data. The number of normalization mappings is
very significantly reduced. The number of pre-normalization
mappings is just slightly increased. The 'prohibited' data
is exactly the same.


>>I claim that there are few enough exceptions to show that NFKC
>>is not the right mapping to use: *if* some of these [EXCEPTION]
>>mappings are wanted, they should be handled as an additional step
>>to NFC normalisation (similar to the mapped-out characters).
>
>Here is the list I worry about.
>I have followed your recommendation on everything you regard as OBSCURE or 
>NOT-USEFUL.
>
>>    2011 NON-BREAKING HYPHEN                  [EXCEPTION]

This is open for discussion. We could go either way.


>>  FF0D       FULLWIDTH HYPHEN-MINUS                [EXCEPTION]
>>  FF10..FF19 FULLWIDTH DIGIT ZERO..NINE            [EXCEPTION]
>>  FF21..FF3A FULLWIDTH LATIN CAPITAL LETTER A..Z   [EXCEPTION]

These have to be dealt with in case mapping anyway.
We can just directly map them to halfwidt lowercase.


>>  FF41..FF5A FULLWIDTH LATIN SMALL LETTER A..Z     [EXCEPTION]

Yes.


>>  FF65..FF9F Half-width Katakana             [EXCEPTION]

As I have explained in two earlier postings, these are not
something we have to worry about. Just excluding them should
be fine.


>><compat> Combinations of spacing characters:
>>  013F LATIN CAPITAL LETTER L WITH MIDDLE DOT
>>  0140 LATIN SMALL LETTER L WITH MIDDLE DOT
>>  0149 LATIN SMALL LETTER N PRECEDED BY APOSTROPHE
>>  0E33 THAI CHARACTER SARA AM
>>  0EB3 LAO VOWEL SIGN AM
>>  1E9A LATIN SMALL LETTER A WITH RIGHT HALF RING
>>
>>  These are combinations of characters that were encoded as a single
>>  character in other standards. The only reason why they aren't
>>  canonical equivalences, is that the decomposition is to two spacing
>>  characters, rather than a spacing character and a combining mark.
>>
>>  These mappings could be treated as [EXCEPTION]s, although the
>>  combined characters are rare enough that it probably isn't worth
>>  the hassle to do that. I don't know whether they are produced by
>>  keyboard drivers.
>
>The INCONSISTENT label is used with the Radicals block and the Rupee sign, 
>which is also labelled "NOT-USEFUL". Probably right.
>
>*ALL* of the NOT-EQUIV cases are also labelled as NOT-USEFUL, OBSCURE or 
>another marker I have interpreted as "don't want to use these".
>
>So far, if I have interpreted your missive correctly, you argue for:
>
>- Outlawing a lot of characters (I wouldn't mind that :-)
>- Adding the rules from NFKC for the remaining problematic characters
>  to Nameprep

Yes.


>Either I am missing something, or there isn't a single domain name that
>would be legal under your proposed change to Nameprep where you could tell 
>from the output of Nameprep whether NFKC or NFC was applied.
>
>Did I understand you correctly?

Yes. The set of legal domain labels would stay exactly the same.

Regards,   Martin.