[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [idn] Re: converter page?



On the ICU site there is a page that you might find helpful. For example, if
you input text with foreign characters, like

     Počítače; Οι ηλεκτρονικοί.

and set Compound 1 to

     \P{ascii}  hex

you'll get

Po\u010D\u00EDta\u010De; \u039F\u03B9
\u03B7\u03BB\u03B5\u03BA\u03C4\u03C1\u03BF\u03BD\u03B9\u03BA\u03BF\u03AF.

You can also use "\P{ascii} hex/unicode" to get this format:

PoU+010DU+00EDtaU+010De; U+039FU+03B9
U+03B7U+03BBU+03B5U+03BAU+03C4U+03C1U+03BFU+03BDU+03B9U+03BAU+03BFU+03AF.

And if you want the character names, you can use "\P{ascii} name":

Po\N{LATIN SMALL LETTER C WITH CARON}\N{LATIN SMALL LETTER I WITH
ACUTE}ta\N{LATIN SMALL LETTER C WITH CARON}e; \N{GREEK CAPITAL LETTER
OMICRON}\N{GREEK SMALL LETTER IOTA} \N{GREEK SMALL LETTER ETA}\N{GREEK SMALL
LETTER LAMDA}\N{GREEK SMALL LETTER EPSILON}\N{GREEK SMALL LETTER
KAPPA}\N{GREEK SMALL LETTER TAU}\N{GREEK SMALL LETTER RHO}\N{GREEK SMALL
LETTER OMICRON}\N{GREEK SMALL LETTER NU}\N{GREEK SMALL LETTER IOTA}\N{GREEK
SMALL LETTER KAPPA}\N{GREEK SMALL LETTER OMICRON}\N{GREEK SMALL LETTER IOTA
WITH TONOS}.

(The "\P{ascii}" is a filter, added so that none of the above affect the
ASCII contents.)

Or use "Latin" to get

Počítače? Oi ēlektronikoí.

You could also use "Latin; nfd; \p{mark} remove; nfc" to strip accents,
getting:

Pocitace? Oi elektronikoi.

Mark
________
mark.davis@jtcsv.com
IBM, MS 50-2/B11, 5600 Cottle Rd, SJ CA 95193
(408) 256-3148
fax: (408) 256-0799

----- Original Message -----
From: "John C Klensin" <klensin@jck.com>
To: "Martin Duerst" <duerst@w3.org>; "Simon Josefsson" <jas@extundo.com>
Cc: "IDN" <idn@ops.ietf.org>
Sent: Saturday, March 08, 2003 13:59
Subject: Re: [idn] Re: converter page?


> Simon,
>
> Let me make one additional suggestion, which is sort of
> orthogonal to Martin's...  It would be useful, as an alternative
> to UTF-8 and the other encodings you support, to be able to put
> in a string of characters as a list of items in U+nnnn form.
> You show that form in your debugging option, but, if the
> characters going in don't match what you produce, there is no
> obvious way to provide them.  I'm particularly concerned here
> about characters my browser has no way to render (e.g.,
> appropriate fonts not installed, etc.)
>
> The script/web page itself is much appreciated.
>
> thanks,
>     john
>
>
> --On Saturday, 08 March, 2003 15:31 -0500 Martin Duerst
> <duerst@w3.org> wrote:
>
> > Hello Simon,
> >
> > Very nice to put up such a script.
> >
> > It would be great if the default page was served as UTF-8.
> > That way, on any recent browser, any user can just copy/paste
> > or type in their idn and submit the query, without having to
> > worry about encoding issues.
> >
> > Using various different encodings the way you do is exposing
> > your system internals in a way the Web was designed (and is
> > implemented) to abstract from.
> >
> > The 'force charset to' drop-down menu is particularly
> > dangerous, because it does not force the browser to send the
> > characters that the user has pasted or input to the server in
> > that encoding, it just forces the server to MISinterpret the
> > octets that the browser sent.
> >
> > At the top of the page, you write:
> >     Report problems to bug-libidn@gnu.org, but first please
> > make sure your     browser really is encoding the data you
> > type in the charset you select.     If not, incorrect output
> > or an error is the proper response.
> >
> > This is heavily backwards. The browser will do the right thing
> > if you just allow it to do so, and don't allow the user to mess
> > around with it.
> >
> > Also, some browsers tend to send named or numeric character
> > references when characters in a text field are outside of the
> > encoding of the page. That as such is non-standard, and you
> > don't necessarily have to deal with it. However, you should
> > make sure that the output you send back is properly escaped.
> > For example not
> >
> > $ echo 'D&uuml;rst.josefsson.org' | /usr/local/bin/idn
> > --idna-to-ascii 2>&1
> >
> > but
> >
> > $ echo 'D&amp;uuml;rst.josefsson.org' | /usr/local/bin/idn
> > --idna-to-ascii 2>&amp;1
> >
> >
> >
> > Regards,    Martin.
> >
> > P.S.:
> >
> > I tested this with several browsers. With IE, there were
> > difficulties to interpret the encoding of your page correctly
> > in the first place. My current guess is that this is due to
> > the fact that you use additional double quotes in
> > <meta http-equiv='Content-Type' content='text/html;
> > charset="ISO-8859-1"' />, instead of simply
> > <meta http-equiv='Content-Type' content='text/html;
> > charset=ISO-8859-1' /> I might be wrong, but other than that,
> > I can't see any reason at the moment. (you should also make
> > sure that you properly escape the '&' in things such as
> > "&mode=toascii&charset=UTF-8").
> >
> >
> >
> > At 01:10 03/03/02 +0100, Simon Josefsson wrote:
> >> "Eric A. Hall" <ehall@ehsco.com> writes:
> >>
> >> > Anybody know of a web form that does IDNA conversion
> >> > on-the-fly? Something that will let me enter the domain
> >> > name and get the IDNA encoded form back. I find myself
> >> > needing to do do some quicky conversions periodically.
> >>
> >> <http://josefsson.org/idn.php>
> >
> >
>
>
>
>
>
>