[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[idn] Back to work (Nameprep) (was: Re: Just send UTF-8 with nameprep (was: RE: [idn] Reality Check))



At 20:45 01/07/17 -0400, Keith Moore wrote:

>We can probably afford to have two ways of looking up IDNs.  We
>cannot afford a tug-of-war in this group that delays adoption of
>any solution.

Thanks, Keith!
This is an excellent proposal for a forward-looking compromise.


>What I'd really like to see us work on is the transcribability problem.
>This is a problem that all of the proposals have in common - there
>are still too many similar glyphs with different code points that are not
>folded by nameprep.  I see this as the biggest remaining problem that
>must be solved before we can standardize an ACE lookup scheme for IDNs.
>(even if we standardize an alternate one later that uses UTF-8 or some
>other encoding)

Very good point. Do you have any particular kind of similarities
in mind, or can you give some examples? Or do you have some kind
of principles or tests in mind that should be applied? Or any
particular kind of procedure that we should follow?

Also, there are some things that get folded in nameprep that should not.
This is not so important if nameprep is mainly seen as the absolute
check for registrations. But if it is supposed to be strictly applied
on every occasion, in particular every time a name is resolved,
as e.g. Patrick is describing it, these foldings may lead to people
believing that these characters are acceptable in a domain name the
same way an upper-case character is.

What should be eliminated are:

1) Characters that are very clearly visually distinct from the ones
    they are mapped to. (for obvious reasons)
2) Characters that completely map to ASCII-only characters
    (to make sure that current applications and applications doing
    nameprep behave the same way for ASCII-only).

An example of a character for which both of the above apply is
U+2460 CIRCLED DIGIT ONE. This is a simple digit 'one' in a circle.
Obviously, everybody can immediately see that it's different from
just a '1'. Also, if it's mapped to '1' by nameprep applications,
these applications will behave differently from applications not using
nameprep (i.e. everything out there now).

There are quite a few of these, all identified with <circle> (or
similar) in Unicode. There are also quite a few to which only
1) applies (circles with things other than ASCII letters in them,...)
and a few to which only 2) applies (e.g. fi, ffi,... ligatures).

I propose that all of them should be rejected, for the reasons above,
rather than folded.


Regards,   Martin.