[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [idn] Charter, refocus




Dan Oscarsson wrote:

> I wanted to make it clear that we should have one normalised format
> for all domain names (both host names, and all other types).
> This is to have a simple foundation for all software handling
> text strings. Normalisation may not do case folding.

> And the second should define how names are matched. That is how
> what rules define which names match as equal.
> For example: Abc.com and ABC.com match as equal.
> This is done on top of the (1) normalised strings, when you need
> to check if two domain names represent the same name.

Let's pursue this approach for a while and see where we end up. I see at
least two issues with this tactic:


First of all, not all domain names can be normalized.

STD13 allows octet codepoints to be specified, and for RRs like TXT or RP
or something similar, it is feasible to imagine that an RR exists with a
parent domain name that is using the eight-bit codepoints. These are
manageble via the \esc processing, but only if they exist as exceptions to
normalization. The codepoints MUST NOT be interpreted as character data
and normalized as such.

A different version of the above is using non-normalized IDNs for TXT and
other RRs which do not represent IHNs explicitly, which SHOULD be allowed
to use any UCS character code (even unassigned characters). However, this
isn't something people can do today, so not letting them do it in the
future is at least a legitimate option and possibly even a valid tradeoff
(EG, all RRs with IDN parents MUST use normalized IDNs; if you need
specific values, specify the octets with \esc). I do not like such
mandates (more precisely, I know that others will not like such mandates,
and I don't want to defend it).

Another example is email addresses, which some future rev of 2822 MAY
allow to be non-normalized. In theory, a normalized version of email
addresses should be supported but who knows what will happen. It seems to
me that the proper way to handle this data-type is to defer ownership to
RFC 2822 and its successors, rather than attempting to treat it as DNS
data (it isn't).

In the cases above, an RR of TXT (or some other type with a "binary"
usage) is created with a parent IDN in non-normalized form, so that entry
SHOULD match in its non-normalized form, and the normalized form SHOULD
NOT match since they are different sequences. Forcing normalization on
these entries will be wrong.


As for case-neutral comparisons, the biggest problem is with ACE, which
requires downcasing in order for the encoded representation to be
consistently matched in legacy systems. A query for Ex'AmPlE.com encoded
as zz-12345 will have a different representation than ex'ample.com at
zz-54321, which will be different from EX'AMPLE.com at zz--15243, etc. It
seems to me that downcasing is required and non-optional for delegation
entries and all other IHNs.

For the "binary" IDNs described above, case-neutral is forbidden anyway,
so we couldn't do it for those.

I just don't think that case-neutral comparison is an option, unless every
system decodes the values for comparison, which we know won't happen.
Legacy systems will not decode ACE.


What are your thoughts on the above? It seems that we cannot normalize
everything, and that we can't do case-neutral comparison on the IHNs that
are normalized. Am I missing something? I agree with the spirit of your
objectives, but I don't see any easy way to do it cleanly.

-- 
Eric A. Hall                                        http://www.ehsco.com/
Internet Core Protocols          http://www.oreilly.com/catalog/coreprot/