[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [idn] An ignorant question about TC<-> SC

To: Martin Duerst <duerst@w3.org>
Subject: Re: [idn] An ignorant question about TC<-> SC
From: John C Klensin <klensin@jck.com>
Date: Fri, 26 Oct 2001 06:37:54 -0400
cc: idn@ops.ietf.org
Martin,

For better or worse, your explanation and history is somewhat
revisionist.   While I believe the UTC has done an outstanding
job in more recent years to try to clean things up and
generalize them --in the directions you suggest-- my strong
recollection of the battles over "UCS" and what became 10646
follows.   NOTE: most of the WG may not want (or need) to read
this.  It goes over ground that we've been over before, and I am
writing this primarily in the hope that we can avoid cycling
through it again, in pieces.  The conclusion is that
Unicode/10646 is what we have today, and what we are stuck with,
and that arguments about history and design criteria are largely
pointless.

For those who want the history, at least as I remember it, read
on...

The original UCS effort, leading to what was known as 10646 
DIS-1, was focused on an effort whose criteria included a strong
focus on comparability and generalized applications use.  That
effort included a rather extensive liaison/ requirements
statement from the then-ISO TC97 subcommittee on programming
languages to the then TC97 SC2 that was accepted by SC2.  That
statement of requirements asked for a UCS that contained many of
the properties that we've discovered in the last few years would
be really helpful if they were present in Unicode, but aren't
there.  For example, there was to be no mixing of precomposed
characters and those that were built up as a sequence of code
points, using either non-spacing or overstrike ideas.   And
there was to be no overlaying of characters that looked alike
but were different.   

This is obviously the dream of someone doing string comparisons
and report-writing (which programming language people do, or
did, very often), since every character image seen on a page
would represent exactly the same string length in a computer,
i.e., 
  length-in-visual-units (string) =
      length-in-character-units (string)
always.

And, e.g., what I see as the hardest part of the TC <-> SC
problem would not have existed, aince there would have been no
confusion between, e.g., a Chinese character and a
similar-looking Korean or Japanese one.

The result, coming into the DIS-1 vote, did not completely meet
those critera (for example, it did contain provision for
overlaying glyphs), although it was arguably close to it.  It
was also large (requiring four octets) and, in some ways, fairly
unwieldy.  And, of course, by virtue of not having overlaying,
transposition from written text in an unknown language into a
coded character set representation was a far worse problem than
the one we struggle with today.  And, if I recall, the draft
hadn't preserved contingency and ordering of important and
frequently used existing character sets as block subsets.

In the ISO ballot resolution process, supposedly to handle some
negative votes against DIS-1, the then-complete proposal was
replaced, as an editorial substitution, with the then-current
version of Unicode.  Doing it as an editorial substitution
prevented any further review by interests within most (or any?)
of the ISO Member Bodies, nor was it possible for the other
subcommittees who had stated requirement to SC2 to comment on
the new version.  

The Unicode standard of that time had come out of a consortium
of, essentially, printer manufacturers.  Their primary design
criteria apparently included preservation of existing and
heavily-used character sets as proper block subsets (to avoid
having to change ROMs); keeping to a minimum number of octets
per character, especially [then] frequently-used characters
(memory was expensive and printer memory (and "board real
estate") was even more expensive than computer memory); and, to
the extent that they didn't conflict with the above, "if they
look alike, they are alike" and "if it is possible to think of
it as a font difference, it is" rules.  And, using those rules,
they assumed that 16 bits would be enough, forever.

At this late date, the most reaaonable thing that can be said
about the different between the approaches represented by
Unicode and 10646 DIS-1 is that they started from different
criteria and made the tradeoffs differently.  Which approach is
"better" depends on where one starts and what one considers most
important.   And both represent compromises with their original
criteria, due to conflicts among their requirements, so neither
was (or would have been) completely consistent internally.

Any complex, international, character set effort that started
more or less from scratch would face similar tradeoffs, e.g.:

	* preservation of the properties and locality of
	existing, heavily-used character sets versus optimality
	for, e.g., collation (for example, in a mixed case
	language in which both cases are expected to be used
	regularly, it would be much more convenient to have the
	characters arranged as 
          {Upper1, Lower1, Upper2,	Lower2, ... }
	rather than
	  {Upper1, Upper2, ..., Lower1, Lower2, ... }

	* Accurate transcription from a printed form, especially
	in an unknown language, into the coded character set
	versus keeping languages distinct or scripts together.
	
	* String comparisons versus the compactness of a
	character set
	
	* Overall size versus avoiding "unification".

and so on.  

And, in my personal opinion (then and now) the Unicode designers
were a little naive about the number of characters which a UCS
would ulimately need to accomodate and were a little too biased
in their thinking by their then-existing market.  That market
was mostly North American and Western European, their deployed
products were primarily deployed in those markets, and backward
compatibility with existing devices and thinking probably
impacted optimization decisions about the character set's
organization.   But, as I said at the beginning, times have
changed -- the current UTC is clearly trying to work around the
design limitations imposed by that starting point.

The bottom line is that Unicode is what we have.  It is lots
better for our purposes than its 1.0 version.  ISO 10646 DIS-1
is an historical footnote (as is this entire note).  We _need_
to figure out how to make what we have work.  That implies
either abandoning other sorts of ideas, or ideas that would
depend on an entirely different coding structure.  Or we need to
move other sorts of ideas into additional layers where we can
create more flexibility than the DNS gives us.  Or, of course,
we can get used to the idea that reaching a conclusion on
internationalized domain names, or even internationalized access
to them, will take a _very_ long time.   

After all, it has taken us thousands of years to create the
large variety of languages and writing systems (some might say
"the total mess") we have in the world today.  Perhaps it is
hubris to believe that we can reconcile all of the differences
as part of a DNS effort.  But, since the alternative is giving
up, and the requirements are important, I think we need to focus
in on the important and solvable problems and try to move
forward, rather than getting stuck on the historical sequences
that got us here, or on interesting research ideas, or on
unsolvable problems.

     john




--On Friday, 26 October, 2001 16:31 +0900 Martin Duerst
<duerst@w3.org> wrote:

> At 02:37 01/10/25 -0700, liana Ye wrote:
> 
>> In another word, the fundimental work from UCS is the to table
>> glyphs based on their visual distinction.
> 
> That is simply not true. The fundamental work is to make
> sure that the result is best usable by average users for
> their average purpose (writing electronic texts).
> 
> There have been a lot of requests to use finer distinctions,
> e.g. from researchers or from people who publish dictionaries.
> They were rejected, because they would confuse the average
> user and the average use more than they would help.
>...
Prev by Date: Re: [idn] Update Charter revision 2<4.2.0.58.J.20011025122342.067d7100@localhost><030701c15c37$e463d8b0$1401000a@jamessonyvaio><20011024124949X.yone@po.ntts.co.jp><4.2.0.58.J.20011025122342.067d7100@localhost><4.2.0.58.J.20011026155311.02ef0100@localhost>
Next by Date: Re: [idn] An ignorant question about TC<-> SC
Prev by thread: Re: [idn] An ignorant question about TC<-> SC
Next by thread: Re: [idn] An ignorant question about TC<-> SC
Index(es):
- Date
- Thread