[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [idn] UTF-8 as the long-term IDN solution



>I would like to hear some support on this before we run this poll.
>
>-James Seng
>
>----- Original Message -----
>From: "D. J. Bernstein" <djb@cr.yp.to>
>To: <idn@ops.ietf.org>
>Sent: Tuesday, May 29, 2001 12:25 PM
>Subject: [idn] UTF-8 as the long-term IDN solution
>

>> If we do, in fact, have consensus on this point, then we can stop
>> wasting time considering ACE-now-and-forever. We can instead compare
>> the actual possibilities: (1) ACE-now-UTF-8-later; (2) simply UTF-8.
>>

This will be a mixture off comments to several letters. There are too
many to comment separately.

I have since the beginning wanted UCS as standard in DNS, and have seen
ACE as a needed backward compatibility. My drafts have been supporting
both at the same time.

Unless we go for "simpy UTF-8", we should go for "ACE and UTF-8 now".
This works, unless my UDNS draft is wrong, and can work with IDNA.

There is one big difference between ACE/IDNA and UTF-8:
ACE: to work as in IDNA the name must be nameprepped before encoded
     using ACE.
UTF-8:  need only be normalised. (though you could force it to be nameprepped
     to work in servers withou understanding how to compare UTF-8, but
     then all names will be forced into a specific form (like lower case)).
     As names are not nameprepped they can contain the name owners favorite
     form (for example using mixed case).
The above is one reason why I cannot accept to use IDNA as I want to be
able, just like in DNS today, to use names containg upper case letters.

As a software developer I have one major reason why I want UTF-8:
- I can spend time to add code to convert between ONE external character
encoding and my own local one. I do not have time to implement handling
of international characters if every protocol uses its own encoding.

If we use UTF-8 with ACE for backward compatibility, I could implement handling
of UTF-8 but ignore most handling of ACE names. This because having the ACE
name inside my UTF-8 text will do no harm (more than looking ugly) and will
work when pasted into other applications.

As many have pointed out, if UTF-8 is selected it cannot just go into
everybodys text, because locally many use something different than UTF-8.
But, as a software developer it is much easier to handle one standard
character encoding instead of many. Just think about:
In my e-mail I get a line:
To: text QUOTED-PRINTABLE-TEXT <USER-NAME-COMPATABLE-TEXT@ACE-NAME.ACE-NAME.com>

To handle this I must parse the line, identify each part that have their
own encoding, have an decode for each type and decode them.
If we used just UTF-8, to convert into local character set, you just
do if for the entire line without having to parse out each part and decode them
separately.

We must think outside just DNS, domain names will appear in many places
together with other elements. If each element have their own encoding
if will be very difficult to handle. It would be nicest if ACE could
be avoided all together, but I think we need it for backward compatibility.

   Dan