Dear Mark,
UNICODE wil get more and more popularity as
time goes by.
But, that does not mean that legacy encodings
will disappear or will be obsoleted by UTF8.
There are at least 2 reasons why legacy
encodings will be forever.
1. most legacy codes are
standardized by local *governmants* that are best qualified
to
find
and reflect local
communities's character needs.
For example,
Korean GOV has been constantly revising its KSX100? local legacy
codes
to
include new Graphic characters and new rarely-used Chinese letters , even
before
UNICODE decided to include them.
In other words,
legacy codes are under control of their language communites. But
UNICODE
are
not, and has its own schedules and principles and
motivations.
It may be
*politically* impossible for legacy codes to be obsoleted by
UNICODE.
Can we
imagine Korean Gov publish its laws and rules documents in UNICODE, not in
KSX100x ?
2. legacy encodings are
already internationally interoperable in popular HTML/MIME
contents.
There is no reason
why KSX100x-encoded homepage owners/message
senders
should
abandon legacy encodings and make transitions into
UTF8 at the cost
of additional
space and operational
inefficiency now
and even in the forseeable future.
I believe UNICODE is now everywhere
and will be everywhere even in the future. In the same
time,
UNICODE has provided legacy
encodings/codes with more opportunities to be interoperable
with
minimum
costs.
Soobok Lee
----- Original Message -----
Sent: Saturday, March 23, 2002 3:21 AM
Subject: OT - Re: [idn] URL encoding in html
page
From my experience talking with customers in the
field, the main reason that people are not serving up UTF-8 pages is not the
bandwidth, it is the fact that there are still some browsers out in the field
that do not yet handle it correctly. While they are dying off fairly quickly,
it is not quite at the point where people are willing to write them
off.
As far as size goes, it is worthwhile looking at
some data samples. The following are from a page on the Unicode site that is
translated into different languages, so it has essentially the same
information on each page.
Size |
Page |
8882 |
s-chinese.html |
8946 |
t-chinese.html |
9347 |
esperanto.html |
9498 |
maltese.html |
9739 |
icelandic.html |
9833 |
czech.html |
9944 |
welsh.html |
10064 |
danish.html |
10109 |
swedish.html |
10127 |
polish.html | |
Size |
Page |
10219 |
interlingua.html |
10221 |
italian.html |
10297 |
spanish.html |
10308 |
portuguese.html |
10312 |
lithuanian.html |
10329 |
german.html |
10376 |
romanian.html |
10401 |
korean.html |
10506 |
french.html |
|
Size |
Page |
10726 |
japanese.html |
10953 |
hebrew.html |
11192 |
arabic.html |
13292 |
greek.html |
13870 |
russian.html |
13892 |
persian.html |
14549 |
hindi.html |
15337 |
georgian.html |
15853 |
deseret.html |
|
So the best case is about 50% of the worst case. Some of this is due to the
encoding, and some is due to different languages just using different numbers
of characters. However, when you look at web pages in general use, the amount
of text (in bytes) is really swamped by graphics, Javascript, HTML code, and
so on. So fundamentally, even the variations above are not that important in
practice.
BTW This is getting way off
topic.
Mark
—————
----- Original Message -----
Sent: Friday, March 22, 2002 08:16
Subject: Re: [idn] URL encoding in html
page
> > ----- Original Message ----- > From: "Mark Davis"
<mark@macchiato.com> > To:
"Soobok Lee" <lsb@postel.co.kr>; "IETF idn
working group" <idn@ops.ietf.org> > Sent:
Saturday, March 23, 2002 12:18 AM > Subject: Re: [idn] URL encoding in
html page > > > > Compliant browsers already have to
handle Unicode, since NCRs (e.g. > > ሴ ) are always
Unicode code points. All XML parsers also have > > to handle Unicode
(UTF-8 and UTF-16). > > Right, Already. > MS IE and
NEtscape already have been supporting UNICODE > from
serveral year ago, but still most homepages are in legacy encodings. >
MS WORD (already unicode based) have features to produce (from >
unicode-based .doc files) legacy encoded .html files for web
publishing > > Korean/Japanese/Chinese texts in UTF8 are 50%
bigger than legacy ones. > 50% more disk space and bandwidth will
be required. > Each Cyrillic alhpabet in legacy code occupy one octet,
while in UTF8, > it requires 3 octets. 200% more space is
needed. > I cannot imagine the entire Russians make transition to
UTF8. > Legacy encnodings are more space efficient than UNICODE. >
> legacy-to-legacy conversions like BIG5->KSX1001 are really being
implemented > as two steps of BIG5->UNICODE and
UNICODE->KSX1001. UNICODE > are actively used as such
intermediate encodings, but still not be used and entered >
directly by end users so actively. Rather, UNICODE may be a hub to
facilitate interchange > of informations in different legacy
encodings or font sharing for differently legacy-encoded chars. >
> I regard UNICODE as a substrate (not as a competitor) upon
which legacy encodings are built. > > > > > >
Legacy encodings > > > will dominates even in the future, because
it is compact and > > > inexpensive. > > > >
While I do expect the transition to Unicode to take some time, once >
> some of the older browsers die off it may shift more rapidly than
we > > think. > > I am not UNICODE expert nor character
expert. But, everyday, i feel > the strong inertia toward legacy
encodings in our local language communties. > language-tagging-enabled
text format like HTML will lengthen the lifespan > of legacy encodings
by great amounts and allow legacy-coded HTML texts > are
internationally interchanged without problems. > > Soobok
Lee > > > > > Mark > > ————— >
> > > Γνῶθι σαυτόν — Θαλῆς > > [For transliteration, see
http://oss.software.ibm.com/cgi-bin/icu/tr] > > > > http://www.macchiato.com >
> > > ----- Original Message ----- > > From: "Soobok Lee"
<lsb@postel.co.kr> > >
To: "IETF idn working group" <idn@ops.ietf.org> > >
Sent: Friday, March 22, 2002 02:04 > > Subject: Re: [idn] URL
encoding in html page > > > > > > > > >
> ----- Original Message ----- > > > From: "Bruce Thomson"
<bthomson@fm-net.ne.jp> >
> > To: "Soobok Lee" <lsb@postel.co.kr>;
"IETF idn working group" > > <idn@ops.ietf.org> > >
> Sent: Friday, March 22, 2002 6:29 PM > > > Subject: Re: [idn]
URL encoding in html page > > > > > > > >
> > > What if all the html viewable text is in english, but, only
the > > href url contains > > > > > legacy (korean)
encoded hostnames? chinese visitors would see > > clean english
homepage, > > > > > but fail to click through the korean
link. > > > > > > > > > Well, that could
happen, but a META tag would solve that so > > easily.
Personally > > > > I often use a simple text editor to deal
with HTML, and would find > > it easier to > > > > use
legacy encodings or UTF-8 than cut-and-paste ACE from > >
somewhere. > > > > Of course the user could do it either way
and it would work. > > > > > > Yes. Charset META tags
help. But, many homepages have assumptions > > on the main
audience's > > > default char encodings and very often omit
the META tag for the > > encoding like : > >
> <meta http-equiv="Content-Type"
content="text/html; > > charset=euc-kr"> > > > >
> > Moreover, IDN url would be used in a pure FRAMESET document
that > > defines frame URLs > > > and contains no
viewable texts. Such FRAMESET documents often omit > > charset META
tags. > > > (look into the html source of http://www.freeway.co.kr/ ) >
> > > > > AFIAK, 99.99999% of korean homepages have
implicit/explicit > > > legacy korean encoding (KS_C_5601-1987 or
euc-kr). So do most > > japanese/chineses homepages. > >
> UTF8/UCS-2 encodings are rarely used in global WEB publishing. >
> Legacy encodings > > > will dominates even in the future,
because it is compact and > > inexpensive. > > > >
> > IF we want to make IDN truly internationally interoperable,
all > > IDN-aware webbrowsers/applications > > > should
contain libaries of all kinds of legacy-to-Unicode conversion > >
routines. It will burden > > > too much memory load on handheld
devices like PDA. > > > > > > Moreover, legacy
encodings are revised separately from unicode. We > > may face with
as toughest > > > versioning problems as we did in
stringprep/nameprep versioning > > problems for newly added unicode
points. > > > How to guarantee stability and intergrity of
IDN operations in the > > all combinations of numerous kinds
and versions of iDN-aware > > > applications and legacy
encodings? > > > > > > Soobok Lee > >
> > > > > > > > > >
|