[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [idn] NFC vs NFKC<024301c149a0$c65b1ee0$ec1bd9d2@temp><4.2.0.58.J.20011010135529.03d9e6a0@localhost><4.2.0.58.J.20011018183339.031fa100@localhost><3BD0C257.A61BE893@zetnet.co.uk><3BD5197B.1C280C5D@zetnet.co.uk>

To: "Mark Davis" <mark@macchiato.com>,"David Hopwood" <david.hopwood@zetnet.co.uk>, <idn@ops.ietf.org>
Subject: Re: [idn] NFC vs NFKC<024301c149a0$c65b1ee0$ec1bd9d2@temp><4.2.0.58.J.20011010135529.03d9e6a0@localhost><4.2.0.58.J.20011018183339.031fa100@localhost><3BD0C257.A61BE893@zetnet.co.uk><3BD5197B.1C280C5D@zetnet.co.uk>
From: Martin Duerst <duerst@w3.org>
Date: Thu, 25 Oct 2001 15:23:14 +0900

At 08:20 01/10/23 -0700, Mark Davis wrote:
(answering David Hopwood)

>I disagree with your assessment. For example, I believe that it is a good
>thing that the non-breaking feature is suppressed -- that is irrelevant to
>IDNs.

This is a detail that can go one way or another.
The fact that the current DNS has survived a few years
with a hyphen but without the non-breaking hyphen being
accepted suggests that not including it won't hurt too much.
But including it won't hurt too much, either.

>However, it would make your paper much easier to examine if you removed all
>the characters that end up getting disallowed -- they are not
>counter-examples.

I think they are very important. In terms of numbers, there
are I think 3165 compatibility mappings in
http://www.unicode.org/Public/UNIDATA/UnicodeData.txt.
According to David's analysis, which closely matches with
my understanding, something of the order of 50 of these
really need to be mapped. This shows that using NFKC is
serious overkill.

Also, there 1866 NFC mappings. Reducing the number of mappings
from 5031 to 1866 is a saving of over 60%.

>Of the characters that are left that you feel are
>problematic, there are three possibilities that the committee has to judge:
>
>A. they don't matter if they are converted

Yes, most of them will show up so rarely that it indeed
doesn't matter whether they get converted or prohibited.

>B. they do matter, but they can be remapped, or prohibited*
>C. they do matter, and enough to make us abandon NFKC
>
>The downside of abandoning NFKC is that we lose the equivalence between
>thousands of characters that do represent the same fundamental abstract
>characters, such as width variants, Arabic ligatures, etc.

Some of the width variants have been identified as exceptions.
How many of the Arabic ligatures will be entered by users
when they e.g. type in an arabic domain name on a keyboard?

Also, please note that the concept of 'fundamental abstract
character' is not something the IDN WG has been really
concerned about. As David has shown, many of the compatibility
(i.e. NFKC) equivalences are disputable. On the other hand,
for many characters not covered by NFKC you would easily
find some people claiming that some of them represent the
same 'fundamental abstract character'. As an example, many
people on this list might argue that there should be a
SC/TC mapping because the simplified and the traditional
variants represent one and the same 'fundamental abstract
character'.

What this WG has been concerned about is to avoid excluding
variants that could easily be input by the user in place
of some equivalent character (to distinguish this from
a lookalike that represents something completely different).

David's analysis, which coincides with my findings, has show
that about 50 out of about 3000 compatibility (i.e. NFKC)
equivalents are relevant for user input. Nobody has claimed
anything to the contrary, or brought up any evidence to the
contrary. If anybody has some such evidence, please send it in.

>Note: a character can be deleted by the remapping phase.

Which is not a good idea except for characters where deletion
doesn't make much of a difference (e.g. Arabic Tatweel,
non-spacing marks,...). If somebody would type in
fooBARbaz, and would get to foobaz, because B, A, and R
are mapped out, it would realy be quite bad.

>It can also be
>effectively prohibited *before* NFKC by simply mapping it to an invalid
>character, like space, that is not affected by normalization and ends up
>being prohibited.

Yes. If there were just *a few* characters that we wanted to prohibit
before doing NFKC, while we want to keep most of the NFKC mappings,
then that would be a reasonable idea. But as it turns out, we want
to ignore/prohibit most of the characters mapped by NFKC, while
only mapping very few of them.

>BTW I have very little hope that this committee will ever produce a result
>if issues keep getting re-raised time and time and time again.

How many times has this issue (NFC vs NFCK) been raised, on this list?
How much background material was provided in these discussions?

>For the
>committee to ever reach some kind of resolution, people have to ask
>themselves, *not* whether they think the current situation is absolutely
>optimal (in their view) -- since it *never* will be -- but instead, whether
>they can live with the result; whether there are really any results that
>will cause significant problems in practice.

It is the first time anybody has done such a careful analysis of
NFC vs. NFKC. The choice between NFC and NFKC is rather fundamental,
no the least because IDN also most probably will serve as an example
to other similar problems. The conclusions from this analysis are
quite clear, at least to me.

I agree that asking whether the solution is absolutely optimal
in somebody's personal view is not a question that leads to consensus.
But the question of whether some change is an overall improvement
that helps many while not causing problems for anybody is a very
relevant question.

Regards,    Martin.