[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [idn] Thoughts on nameprep



First, a comment on the nameprep spec.  The main steps are:

    map, normalize, prohibit

If the steps were rearranged to:

    normalize, fold, normalize, delete, prohibit

then the nameprep spec would get quite a bit smaller.  All the "Case
map" entries would disappear, because it's enough to reference UTR21 for
the fold step, and all the "Additional folding" entries would disappear,
because they are taken care of by normalizing both before and after
folding.  The spec would need to list only the mapped-out, prohibited,
and unassigned sets.

Please note that I am not proposing any change to the algorithm itself!
I'm merely suggesting a more concise way to define it.

Of course implementations would be free to transform the sequence of
tables into fewer tables, as long as they ultimately produce the same
output given the same input.  I bet in most cases, though, cleverness
would be more trouble than it's worth.

Now to continue the other discussion...

Bruce Thomson <bthomson@fm-net.ne.jp> wrote:

> Isn't e-acute-accent just a ligature?

No, it's a letter with a diacritic.  A ligature is two or more letters
that are connected for purely typographic reasons, while remaining
logically separate letters, like your fi, ff, ffi examples.

But for the purposes of this discussion the distinction between
ligatures and letters with diacritics is not relevant.  In both cases
you have a choice between the precomposed representation and the
decomposed representation.

> If we were to just make ligatures illegal in domain names, would the
> screams be all that loud?

I don't see how that would help.  Depending on how the user types, and
depending on the operating system layers between the keyboard and the
application, the application might read e followed by acute-accent, or
might read a single e-with-acute-accent character.  The application
cannot control what the OS does, and should not try to dictate how
people type a particular letter if there are multiple ways (including
cut-and-paste), so it must be prepared to read either representation,
and convert to a canonical representation.

> I wonder whether these characters aren't causing a variety of other
> problems already, because you can't look at them and tell how they
> were typed.

Why should anyone care how they were typed?

> If we are going to allow Unicode in text files in general, variables
> in C programs with accents would be confusing, because you could
> have two seemingly identical variable names that are in fact
> different. Also with user names, passwords, etc. Why do we have to
> solve this problem for idns alone?

We don't, because the Unicode Consortium has already solved it for us,
by defining normalization and case-folding.  :)

> People expect case to matter.

In domain names?  No they don't, because today case doesn't matter
in domain names.  It's not uncommon to see domain names in uppercase
or mixed case for stylistic reasons.  If someone tries to use an IDN
containing some uppercase characters and gets a host unknown error even
though the lowercase version exists, they're going to be surprised and
angry and rightfully so.

> As an app writer, if I can be sure that I just map using one table and
> forbid with a second one, I wouldn't worry too much.

If nameprep were implemented using a straightforward translation of the
spec, there would be one table for mapping, two tables for normalization
(one for compatible decomposition and one for canonical composition),
and one table for prohibition.  There might be a way to optimize that
down to fewer steps and fewer tables, if you're under tight constraints.

> But when I look at the nameprep mapping table there are LOTS of
> entries....  Japanese alone requires over 100 with the current spec,
> and these are unnecessary.

Do you mean the mapping table given in the nameprep spec, or the tables
needed to implement nameprep?  The mapping table that appears in the
nameprep spec doesn't include any Japanese characters.  The tables
needed for normalization do include Japanese characters, but those
tables are defined by Unicode and implicitly referenced by the nameprep
spec.

Maybe the normalization defined by Unicode is overkill for IDN purposes,
but it's already standardized, and off-the-shelf implementations are
likely to exist.

AMC