[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

OT - Re: [idn] URL encoding in html page




From my experience talking with customers in the field, the main reason that people are not serving up UTF-8 pages is not the bandwidth, it is the fact that there are still some browsers out in the field that do not yet handle it correctly. While they are dying off fairly quickly, it is not quite at the point where people are willing to write them off.
 
As far as size goes, it is worthwhile looking at some data samples. The following are from a page on the Unicode site that is translated into different languages, so it has essentially the same information on each page.
 
Size Page
8882 s-chinese.html
8946 t-chinese.html
9347 esperanto.html
9498 maltese.html
9739 icelandic.html
9833 czech.html
9944 welsh.html
10064 danish.html
10109 swedish.html
10127 polish.html
Size Page
10219 interlingua.html
10221 italian.html
10297 spanish.html
10308 portuguese.html
10312 lithuanian.html
10329 german.html
10376 romanian.html
10401 korean.html
10506 french.html

 

Size Page
10726 japanese.html
10953 hebrew.html
11192 arabic.html
13292 greek.html
13870 russian.html
13892 persian.html
14549 hindi.html
15337 georgian.html
15853 deseret.html

 

So the best case is about 50% of the worst case. Some of this is due to the encoding, and some is due to different languages just using different numbers of characters. However, when you look at web pages in general use, the amount of text (in bytes) is really swamped by graphics, Javascript, HTML code, and so on. So fundamentally, even the variations above are not that important in practice.

BTW This is getting way off topic.
 
Mark
—————
 
Γνῶθι σαυτόν — Θαλῆς
[For transliteration, see
http://oss.software.ibm.com/cgi-bin/icu/tr]
 
http://www.macchiato.com
----- Original Message -----
From: "Soobok Lee" <lsb@postel.co.kr>
To: "Mark Davis" <mark@macchiato.com>; "IETF idn working group" <idn@ops.ietf.org>
Sent: Friday, March 22, 2002 08:16
Subject: Re: [idn] URL encoding in html page

>
> ----- Original Message -----
> From: "Mark Davis" <
mark@macchiato.com>
> To: "Soobok Lee" <
lsb@postel.co.kr>; "IETF idn working group" <idn@ops.ietf.org>
> Sent: Saturday, March 23, 2002 12:18 AM
> Subject: Re: [idn] URL encoding in html page
>
>
> > Compliant browsers already have to handle Unicode, since NCRs (e.g.
> > &#x1234; ) are always Unicode code points. All XML parsers also have
> > to handle Unicode (UTF-8 and UTF-16).
>
> Right, Already.
> MS IE and NEtscape  already have been supporting  UNICODE
> from serveral year ago, but still most homepages are in legacy encodings.
> MS WORD (already unicode based) have features to produce  (from
> unicode-based .doc files)   legacy encoded .html files  for web publishing
>
> Korean/Japanese/Chinese texts in UTF8 are 50% bigger than legacy ones.
> 50% more disk space and bandwidth  will be required.
> Each Cyrillic alhpabet in legacy code occupy one octet, while in UTF8,
> it requires 3 octets. 200% more space is needed.
> I cannot imagine the entire Russians make transition to UTF8.
> Legacy encnodings are more space efficient than UNICODE.
>
> legacy-to-legacy conversions like BIG5->KSX1001 are really being implemented
> as two steps of BIG5->UNICODE  and UNICODE->KSX1001. UNICODE
> are actively used  as such intermediate encodings, but still not  be  used and entered
> directly by  end users so actively. Rather, UNICODE  may be a hub to facilitate  interchange
> of informations in different legacy encodings or  font sharing for differently legacy-encoded chars.
>
> I  regard UNICODE as a substrate (not as a competitor) upon which legacy encodings are built.
>
> >
> > > Legacy encodings
> > > will dominates even in the future, because it is compact and
> > > inexpensive.
> >
> > While I do expect the transition to Unicode to take some time, once
> > some of the older browsers die off it may shift more rapidly than we
> > think.
>
> I am not UNICODE expert nor character expert. But, everyday, i  feel
> the strong inertia toward legacy encodings in our local language communties.
> language-tagging-enabled text format like HTML will lengthen the lifespan
> of legacy encodings by great amounts  and allow legacy-coded HTML texts
> are internationally interchanged without problems.
>
> Soobok Lee
>
> >
> > Mark
> > —————
> >
> > Γνῶθι σαυτόν — Θαλῆς
> > [For transliteration, see
http://oss.software.ibm.com/cgi-bin/icu/tr]
> >
> >
http://www.macchiato.com
> >
> > ----- Original Message -----
> > From: "Soobok Lee" <
lsb@postel.co.kr>
> > To: "IETF idn working group" <
idn@ops.ietf.org>
> > Sent: Friday, March 22, 2002 02:04
> > Subject: Re: [idn] URL encoding in html page
> >
> >
> > >
> > > ----- Original Message -----
> > > From: "Bruce Thomson" <
bthomson@fm-net.ne.jp>
> > > To: "Soobok Lee" <
lsb@postel.co.kr>; "IETF idn working group"
> > <
idn@ops.ietf.org>
> > > Sent: Friday, March 22, 2002 6:29 PM
> > > Subject: Re: [idn] URL encoding in html page
> > >
> > >
> > > > > What if all the html viewable text is in english, but, only the
> > href url contains
> > > > > legacy (korean) encoded hostnames?  chinese visitors would see
> > clean english homepage,
> > > > > but fail to click through the korean link.
> > > > >
> > > > Well, that could happen, but a META tag would solve that so
> > easily. Personally
> > > > I often use a simple text editor to deal with HTML, and would find
> > it easier to
> > > > use legacy encodings or UTF-8 than cut-and-paste ACE from
> > somewhere.
> > > > Of course the user could do it either way and it would work.
> > >
> > > Yes. Charset META tags help. But, many homepages  have assumptions
> > on the main audience's
> > > default char encodings and very often omit the  META tag for the
> > encoding like :
> > >   <meta http-equiv="Content-Type" content="text/html;
> > charset=euc-kr">
> > >
> > > Moreover, IDN url would be used in a pure FRAMESET document that
> > defines frame URLs
> > > and contains no viewable texts. Such FRAMESET documents often omit
> > charset META tags.
> > >  (look into the html source of
http://www.freeway.co.kr/ )
> > >
> > > AFIAK, 99.99999% of korean homepages have implicit/explicit
> > > legacy korean encoding (KS_C_5601-1987 or euc-kr). So do most
> > japanese/chineses homepages.
> > > UTF8/UCS-2 encodings are rarely used in global WEB publishing.
> > Legacy encodings
> > > will dominates even in the future, because it is compact and
> > inexpensive.
> > >
> > > IF we want to make IDN truly internationally interoperable, all
> > IDN-aware webbrowsers/applications
> > > should contain libaries of all kinds of legacy-to-Unicode conversion
> > routines. It will burden
> > > too much memory load on handheld devices like PDA.
> > >
> > > Moreover, legacy encodings are revised separately from unicode. We
> > may face with as toughest
> > > versioning problems as we did in stringprep/nameprep versioning
> > problems for newly added unicode points.
> > > How to guarantee  stability and intergrity of IDN operations in the
> > all combinations of  numerous kinds and versions of iDN-aware
> > > applications and legacy encodings?
> > >
> > > Soobok Lee
> > >
> > >
> > >
>
>
>