[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [idn] An experiment with UTF-8 domain names



Dan,

you're testing the wrong things.  we already know that the DNS protocol
and common DNS implementations will handle arbitrary 8bit character
strings, and those can be UTF-8 if we want them to.  we also know that 
the vast majority of applications cannot properly input or display
UTF-8 strings, and that some applications will either break or
improperly handle UTF-8 strings when they appear in domain names.

> This sent a message to postmaster@S.cr.yp.to. I read the message using
> the UTF-8 version of less, and it was displayed comprehensibly, with the
> To line shown as follows:
> 
> To: postmaster@"S".cr.yp.to
> 
> qmail-inject doesn't mind weird characters, such as control characters
> and 8-bit characters, in atoms. It converts the atoms to quoted strings.
> Of course, RFC 822 doesn't allow quoted strings, never mind 8-bit
> characters, in domain names, but these are easy protocol extensions.
...
> What's wrong with handling S this way? The answer seems to be that some
> other programs don't work. What are those programs? What exactly do they
> do wrong? How hard is it to fix them? Why should we believe that the
> other IDN proposals will require less effort?
> 
> Keith Moore writes:
> > all apps will have to be modified (many will require significant
> > modification) if they want to deal meaningfully with IDNs.
> 
> False. As demonstrated above, qmail and djbdns already work with UTF-8
> domain names. They're both widely deployed. Apparently Microsoft also
> has some clients and servers that work with UTF-8 domain names.

no, you've just demonstrated that qmail does not work with UTF-8 domain
names, because it generates an illegal message header.  other mail
parsers will choke on the message header that qmail generates because
it violates RFC822 syntax.  other programs may (in some sense) also 
treat the same input "reasonably" but will behave differently than
qmail under such conditions.  from experience, some of them will
generate RFC 2047 encoded-words (different ones in different ways) - 
which is also a violation of the specifications, but a different 
violation of the specifications.  replies to such messages will fail.

> Changing those programs has a cost. What is the benefit?

interoperability.

Brian W. Spolarich writes:
> > Using 8-bit data will break some applications.
> > Using 7-bit data (presumably) will not.
> 
> False. If the user's S.cr.yp.to has to be encoded inside DNS and mail
> messages as ace-blah.cr.yp.to, then qmail will be faced with S.cr.yp.to
> in (e.g.) /var/qmail/control/virtualdomains, and ace-blah.cr.yp.to in
> SMTP. This simply won't work unless the software is changed.

that's a good point.  but until qmail is updated to accept UTF-8 
in virtualdomains and other places, users can at least get IDNs
to work by cutting-and-pasting the ACE form of the name.  presumably
we would want similar support for DNS zone files.

but it's even worse than that, because most UNIX users today are not
using UTF-8 as a local charset.  so they'll quite naturally edit the
virtualdomains file using iso-8859-1, or iso-2022-jp, or whatever is
the default for their environment.  either tools that read IDNs will
have to know how to translate from local chraset to UTF-8, or users
will have to learn to use different commands to edit files containing
IDNs than they use to edit "normal" (for them) files.  (I think the
former is more likely; users are loathe to give up their familiar
editors)

(the problem is slightly different on windows and mac platforms, but
exists nonetheless)

actually the problem is much the same in both cases, because in 
both cases the sysadmin has to type in a form of the domain name
which is not the same as is used natively - unless the local
charset just happens to be UTF-8.

> If, on the other hand, S.cr.yp.to is used as is, then the software will
> work fine.

no, it just breaks a different set of things.  and (at least in this case)
it moves the points at which the failures are noticed away from the systems 
that are controlled by the folks who are setting up the IDN, to remote 
systems that are used by folks who are less likely to be able to identify,
much less fix, the problem.

> James Seng/Personal writes:
> > Patching sendmail might be trival for a good programmer like yourself.
> > How fast do you think you can get everyone to use your patch and would
> > unpatch software fallback safely?
> 
> All the proposals require a sendmail patch. To tell sendmail to accept
> mail for S.foo.dom, the user adds S.foo.dom to a file with his UTF-8
> editor; sendmail mishandles the \210 if it isn't patched.
> 
> The patch required for direct use of UTF-8 is by far the simplest. No,
> deployment isn't free, but the other proposals don't change this fact.

true, deployment isn't free for any of these proposals, nor can it be.
but your scenario doesn't even begin to address the impact on MUAs
which cannot properly parse or generate replies for messages with UTF-8
in the message header.  nor does it consider what happens with different
legacy mail software handle the UTF-8 IDN in different ways.

and of course email is just one application; lots of applications use DNS names.

                                  --

you're a smart guy, so I suspect that you've realized most or all of 
this already.  I'm wondering whether the real difference of opinion 
might be about how much disruption to the installed base is acceptable
in the name of producing an apparently "cleaner" result in the long run.

taking email as an example, what percentage of each of the following would
be acceptable (relative to the number of messages sent), in your opinion:

- catastrophic MTA failures (crashes)
- catastrophic UA failures (crashes)
- unreported delivery failures
- reported delivery failures
- messages delivered but undisplayable due to parse errors etc.
- messages delivered but unreplyable due to lack of UA support
  (would work if a different or upgraded UA were used) 
- messages delivered but unreplyable due to being modified in transit


Keith

p.s.

> P.S. I'm a subscriber to this mailing list. I don't want to receive
> extra copies of messages sent to the list. I've set Mail-Followup-To
> accordingly.

and since Mail-Followup-To is a nonstandard extension, I assume you're
willing to accept the lack of reliability that naturally comes from
use of nonstandard extensions.  :)
 
> P.P.S. You may have noticed an unusual From line on this message. The
> problem is that the software running this mailing list can't deal with
> the concept of sublist subscribers; it forwards my messages to Seng.
> Seng eventually approved two of my messages, editing the Date field and
> removing the Received lines to hide the delay. He has refused to approve
> this one unless I take special actions to fool the list software. So I'm
> using his address in From, and my address in Reply-To.

Hmm.  I use subaddresses also, and at first I had this problem also.
If memory serves it was worked around by adding my address to a
list of non-subscribers who are authorized to post.  Not an ideal
solution to be sure, but the problem of mailing lists that don't
allow postings from non-subscribers isn't likely to get fixed here.