[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
OT - Re: [idn] URL encoding in html page
From my experience talking with customers in the
field, the main reason that people are not serving up UTF-8 pages is not the
bandwidth, it is the fact that there are still some browsers out in the field
that do not yet handle it correctly. While they are dying off fairly quickly, it
is not quite at the point where people are willing to write them
off.
As far as size goes, it is worthwhile looking at
some data samples. The following are from a page on the Unicode site that is
translated into different languages, so it has essentially the same information
on each page.
Size |
Page |
8882 |
s-chinese.html |
8946 |
t-chinese.html |
9347 |
esperanto.html |
9498 |
maltese.html |
9739 |
icelandic.html |
9833 |
czech.html |
9944 |
welsh.html |
10064 |
danish.html |
10109 |
swedish.html |
10127 |
polish.html | |
Size |
Page |
10219 |
interlingua.html |
10221 |
italian.html |
10297 |
spanish.html |
10308 |
portuguese.html |
10312 |
lithuanian.html |
10329 |
german.html |
10376 |
romanian.html |
10401 |
korean.html |
10506 |
french.html |
|
Size |
Page |
10726 |
japanese.html |
10953 |
hebrew.html |
11192 |
arabic.html |
13292 |
greek.html |
13870 |
russian.html |
13892 |
persian.html |
14549 |
hindi.html |
15337 |
georgian.html |
15853 |
deseret.html |
|
So the best case is about 50% of the worst case. Some of this is due to the
encoding, and some is due to different languages just using different numbers of
characters. However, when you look at web pages in general use, the amount of
text (in bytes) is really swamped by graphics, Javascript, HTML code, and so on.
So fundamentally, even the variations above are not that important in
practice.
BTW This is getting way off
topic.
Mark
—————
----- Original Message -----
Sent: Friday, March 22, 2002 08:16
Subject: Re: [idn] URL encoding in html
page
>
> ----- Original Message -----
> From: "Mark Davis"
<mark@macchiato.com>
> To:
"Soobok Lee" <lsb@postel.co.kr>; "IETF idn
working group" <idn@ops.ietf.org>
> Sent:
Saturday, March 23, 2002 12:18 AM
> Subject: Re: [idn] URL encoding in
html page
>
>
> > Compliant browsers already have to
handle Unicode, since NCRs (e.g.
> > ሴ ) are always Unicode
code points. All XML parsers also have
> > to handle Unicode (UTF-8 and
UTF-16).
>
> Right, Already.
> MS IE and NEtscape
already have been supporting UNICODE
> from serveral year ago, but
still most homepages are in legacy encodings.
> MS WORD (already unicode
based) have features to produce (from
> unicode-based .doc
files) legacy encoded .html files for web publishing
>
> Korean/Japanese/Chinese texts in UTF8 are 50% bigger than legacy
ones.
> 50% more disk space and bandwidth will be required.
>
Each Cyrillic alhpabet in legacy code occupy one octet, while in UTF8,
>
it requires 3 octets. 200% more space is needed.
> I cannot imagine the
entire Russians make transition to UTF8.
> Legacy encnodings are more
space efficient than UNICODE.
>
> legacy-to-legacy conversions like
BIG5->KSX1001 are really being implemented
> as two steps of
BIG5->UNICODE and UNICODE->KSX1001. UNICODE
> are actively
used as such intermediate encodings, but still not be used and
entered
> directly by end users so actively. Rather, UNICODE
may be a hub to facilitate interchange
> of informations in
different legacy encodings or font sharing for differently legacy-encoded
chars.
>
> I regard UNICODE as a substrate (not as a
competitor) upon which legacy encodings are built.
>
> >
>
> > Legacy encodings
> > > will dominates even in the future,
because it is compact and
> > > inexpensive.
> >
>
> While I do expect the transition to Unicode to take some time, once
>
> some of the older browsers die off it may shift more rapidly than
we
> > think.
>
> I am not UNICODE expert nor character
expert. But, everyday, i feel
> the strong inertia toward legacy
encodings in our local language communties.
> language-tagging-enabled
text format like HTML will lengthen the lifespan
> of legacy encodings by
great amounts and allow legacy-coded HTML texts
> are
internationally interchanged without problems.
>
> Soobok
Lee
>
> >
> > Mark
> > —————
>
>
> > Γνῶθι σαυτόν — Θαλῆς
> > [For transliteration, see
http://oss.software.ibm.com/cgi-bin/icu/tr]
> >
> > http://www.macchiato.com
>
>
> > ----- Original Message -----
> > From: "Soobok Lee"
<lsb@postel.co.kr>
> > To:
"IETF idn working group" <idn@ops.ietf.org>
> > Sent: Friday, March 22, 2002 02:04
> >
Subject: Re: [idn] URL encoding in html page
> >
> >
>
> >
> > > ----- Original Message -----
> > > From:
"Bruce Thomson" <bthomson@fm-net.ne.jp>
> > > To: "Soobok Lee" <lsb@postel.co.kr>; "IETF idn
working group"
> > <idn@ops.ietf.org>
> > > Sent: Friday, March 22, 2002 6:29 PM
> >
> Subject: Re: [idn] URL encoding in html page
> > >
> >
>
> > > > > What if all the html viewable text is in
english, but, only the
> > href url contains
> > > >
> legacy (korean) encoded hostnames? chinese visitors would see
>
> clean english homepage,
> > > > > but fail to click
through the korean link.
> > > > >
> > > >
Well, that could happen, but a META tag would solve that so
> > easily.
Personally
> > > > I often use a simple text editor to deal with
HTML, and would find
> > it easier to
> > > > use legacy
encodings or UTF-8 than cut-and-paste ACE from
> > somewhere.
>
> > > Of course the user could do it either way and it would
work.
> > >
> > > Yes. Charset META tags help. But, many
homepages have assumptions
> > on the main audience's
>
> > default char encodings and very often omit the META tag for
the
> > encoding like :
> > > <meta
http-equiv="Content-Type" content="text/html;
> >
charset=euc-kr">
> > >
> > > Moreover, IDN url would
be used in a pure FRAMESET document that
> > defines frame URLs
>
> > and contains no viewable texts. Such FRAMESET documents often
omit
> > charset META tags.
> > > (look into the html
source of http://www.freeway.co.kr/ )
>
> >
> > > AFIAK, 99.99999% of korean homepages have
implicit/explicit
> > > legacy korean encoding (KS_C_5601-1987 or
euc-kr). So do most
> > japanese/chineses homepages.
> > >
UTF8/UCS-2 encodings are rarely used in global WEB publishing.
> >
Legacy encodings
> > > will dominates even in the future, because it
is compact and
> > inexpensive.
> > >
> > > IF
we want to make IDN truly internationally interoperable, all
> >
IDN-aware webbrowsers/applications
> > > should contain libaries of
all kinds of legacy-to-Unicode conversion
> > routines. It will
burden
> > > too much memory load on handheld devices like
PDA.
> > >
> > > Moreover, legacy encodings are revised
separately from unicode. We
> > may face with as toughest
> >
> versioning problems as we did in stringprep/nameprep versioning
>
> problems for newly added unicode points.
> > > How to
guarantee stability and intergrity of IDN operations in the
> >
all combinations of numerous kinds and versions of iDN-aware
> >
> applications and legacy encodings?
> > >
> > >
Soobok Lee
> > >
> > >
> > >
>
>
>