[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [idn] Re: converter page?



Hello Simon,

Many thanks for your quick reply.

At 23:00 03/03/08 +0100, Simon Josefsson wrote:
Martin Duerst <duerst@w3.org> writes:

> Hello Simon,
>
> Very nice to put up such a script.

I believe I have fixed the problems you mention, thanks for taking the
time to point them out.
Well, my impression is that you have fixed some, but not all.


> It would be great if the default page was served as UTF-8.
> That way, on any recent browser, any user can just copy/paste
> or type in their idn and submit the query, without having to
> worry about encoding issues.

The page is served in the charset you select.
No. When I go to http://josefsson.org/idn.php, I didn't
select iso-8859-1. In fact, my main browser (Netscape 7) sends you
Accept-Charset: UTF-8, *
in its HTTP headers. For the others I used, Opera7 sends
Accept-Charset: windows-1252,utf-8,utf-16,iso-8859-1;q=0.6,*;q=0.1
whereas IE6 and Tango don't send anything. Both Netscape 7
and Opera clearly express a preference for UTF-8 over iso-8859-1.


Chose UTF-8 if you want
UTF-8.  Only supporting UTF-8 would restrict the page's usefulness.
Yes, slightly, because of some old browsers.
But I never said that you have to support only UTF-8.


Standards compliant browsers handle charset conversions in copy/paste.
Well, yes, they handle character encoding conversion in copy/paste.
They convert from the encoding used in the clipboard to their
internal (unicode-based) encoding. That's why
you should avoid confusing the user with 'charset' stuff.


> Using various different encodings the way you do is exposing
> your system internals in a way the Web was designed (and is
> implemented) to abstract from.
>
> The 'force charset to' drop-down menu is particularly dangerous,
> because it does not force the browser to send the characters
> that the user has pasted or input to the server in that encoding,
> it just forces the server to MISinterpret the octets that the
> browser sent.
>
> At the top of the page, you write:
>     Report problems to bug-libidn@gnu.org, but first please make sure your
>     browser really is encoding the data you type in the charset you select.
>     If not, incorrect output or an error is the proper response.
>
> This is heavily backwards. The browser will do the right thing if
> you just allow it to do so, and don't allow the user to mess
> around with it.

I have tried to make the intended behaviour more clear.  You must type
characters in the charset the page uses.  If you want to use another
charset, it is a two step process: first change charset, then enter
new data.
Are you referring to "The following string must be in ISO-8859-1.
If you wish to use another charset you must select it below, submit
the form and wait for a new page, and then enter your string." ?

This shows a confusion of two concepts: The set (also called repertoire)
of characters covered by an encoding, and the actual encoding of these
characters into bytes.

The term 'charset' refers to the later rather than the former.
But the user of a standard browser has no way to input something
in a particular encoding, because the browser takes care of
doing the right conversions. So what you wanted to say was
something like:

"The following string must only contain characters that can be
represented in ISO-8859-1."

Obviously, such a warning is not at all needed for UTF-8.
So the best thing is to start with UTF-8, and let the user
just input her characters, and then have some fallback
page (e.g. triggered by checking whether you really get
UTF-8 back) so users with outdated browsers also have a chance.


> Also, some browsers tend to send named or numeric character references
> when characters in a text field are outside of the encoding of the
> page. That as such is non-standard, and you don't necessarily
> have to deal with it. However, you should make sure that the
> output you send back is properly escaped. For example not
>
> $ echo 'D&uuml;rst.josefsson.org' | /usr/local/bin/idn --idna-to-ascii 2>&1
>
> but
>
> $ echo 'D&amp;uuml;rst.josefsson.org' | /usr/local/bin/idn
> --idna-to-ascii 2>&amp;1

Since it is non-standard, I'll deal with it using the garbage in
garbage out philosophy.  Someone might even find the current behaviour
useful.
But there are cases where you produce garbage on your own.
For example, if I input &uuml;<u">.josefsson.org,
 &uuml; followed by an actual u-umlaut, (where <u"> is actually an u-umlaut),
and switch on UseSTD3ASCIIRules, I get:
/usr/local/bin/idn: idna_to_ascii_from_locale() failed with error 3.
which I guess means IDNA_CONTAINS_LDH = 3.
Now if I switch off UseSTD3ASCIIRules and use the same input,
what I see as a result is xn--<u">-8ya.josefsson.org. The correct
result is of course xn--&uuml;-8ya.josefsson.org, which is in the
source, but not visible. So you have to fix the source to
be xn--&amp;uuml;-8ya.josefsson.org.



> I tested this with several browsers. With IE, there were difficulties
> to interpret the encoding of your page correctly in the first place.
> My current guess is that this is due to the fact that you use additional
> double quotes in
> <meta http-equiv='Content-Type' content='text/html; charset="ISO-8859-1"' />,
> instead of simply
> <meta http-equiv='Content-Type' content='text/html; charset=ISO-8859-1' />
> I might be wrong, but other than that, I can't see any reason at the moment.

I don't see anything wrong with the code, and I don't have access to
IE to test this further. If you, or someone else, wants to
investigate this further, it would be appreciated.
Strictly speaking,
<meta http-equiv='Content-Type' content='text/html; charset="ISO-8859-1"' />
is okay. But I have confirmed that IE (version 6, on Win2000) works well
without the additional quotes, but not with them.


Regards,    Martin.