[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

RE: [idn] Should we add U+FF0E FULLWIDTH FULL STOP to section 5. 10 of Nameprep?



> However, mapping U+3002 to U+002E "before doing nameprep" 
> isn't specific
> enough.  If it happens immediately before doing nameprep, 
> then nameprep
> will fail.  It needs to happen before the domain name is split into
> labels.  Therefore I think the text should read:
> 
>     U+3002 is used as if it were U+002E in many input mechanisms,
>     particularly in Asia.  This prohibition allows input mechanisms to
>     safely map U+3002 to U+002E before splitting a domain name into
>     labels, without worrying about preventing users from accessing
>     legitimate host name parts.

Yes, it would be better. "Before Nameprep" is misleading you're right. Also,
the two uses of "input mechanisms" are misleading too. I think it should be
applications in the second case. Now, it could read:

    U+3002 is generated instead of U+002E by many input mechanisms,
    particularly in Asia.  This prohibition allows applications to
    safely map U+3002 to U+002E before splitting a domain name into
    labels, without worrying about preventing users from accessing
    legitimate host name parts.  Implementors should also be aware
    that U+FF0E is also generated instead of U+002E by many input
    mechanisms, and applications may want to map it to U+002E for
    the same reason they map U+3002 to U+002E.

(I did actually say "is generated instead of U+002E by many input
mechanisms" instead of "is used as it were U+002E in many input mechanisms".
I think that's clearer.)

> This recalls an earlier discussion about just how much processing
> should be done before a domain name is split into labels.  Should
> fullwidth-full-stop be mapped to full-stop before the splitting?
> Probably.  How about one-dot-leader (for which full-stop is a
> compatibility decomposition)?  Maybe.  Should digit-one-full-stop be
> replaced by its compatibility decomposition "1."?  Or square-CO by its
> compatibility decomposition "CO."?  Maybe not.
> 
> Should this be left up to the user interface designer, or should there
> be recommendations or requirements?

It's a big can of worms, yes. It's going to be interesting indeed to see (as
in see on your screen) some dots that are not label separators. Another case
of glyphs vs characters issues. Oh well.

Note that my rationale for asking for U+FF0E (as U+3002) is not about the
fact that it decomposes to U+002E. It is that when one user *physically
types the name* using a keyboard, her "dot" key does not produce the ASCII
dot (a workaround is to use the numerical keypad on Windows; oh great:
switch keyboard parts!). Since the only key that is labelled "." does not
produce a "." in some cases, this sounds like a very good idea. Also note
that applications still can decide not to do it. But who would do that to
their users? So I argue that we should at least acknowledge these physical
limitations to help implementers of applications. 

The other cases you present are definitely interesting. It points one of the
main interesting things of this work which is that everything is done at the
level of a label, and never at the level of a full domain name... This is
not only interesting because of the examples you're giving but also because
it will raise interesting questions about what does an application, versus
say a new UTF-16 gethostbyiname() [i for internationalized] API call: will
gethostbyiname() do these kind of mappings (U+3002 etc... to U+002E) for
application's convenience? If so, will it also pre-NFKD the whole thing,
adding more labels through decomposition? Hmm... The less we recommend or
standardize on this, the more changes for non-interoperable DNS resolver
interfaces (or other uses of DNS with IDNs) there may be.

But first, we can address the simple thing: physical limitations of input
methods that prevent users from getting the expected behavior in an
internationalized context.

YA