[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Unicode/10646 History (was Re: [idn] An ignorant question about TC<-> SC)

To: "John C Klensin" <klensin@jck.com>
Subject: Unicode/10646 History (was Re: [idn] An ignorant question about TC<-> SC)
From: "Mark Davis" <mark@macchiato.com>
Date: Wed, 31 Oct 2001 12:00:35 -0800
Cc: <idn@ops.ietf.org>
The history of the Unicode/10646 merger is really irrelevant to TC<->SC, but
there are some inaccuracies that should not be left hanging.


> This is obviously the dream of someone ...

Your dream of having every character image seen on a page representing
exactly the same string length in a computer basically only works for ASCII.
Most of the properties that you find unhelpful in Unicode are a property of
human scripts, and many were already in DIS-1 (the 10646 DIS that failed its
ballot). Scripts such as Arabic, Hebrew, and most South and Southeast Asian
languages require combining marks; so for them your equation is broken. For
compatibility with the ISO bibliographic standards, combining marks are also
required for Latin, Cyrillic and Greek.

What the Unicode standard did put into place is a programmatic mechanism for
*identifying* sequences that should always be visually indistinguishable and
should always be considered equivalent -- the equivalency issue happens even
if there are no precomposed characters simply because of the possibility of
different orderings of combining characters. So there is a well-defined,
standard mechanism, rather than leaving the identification up to each
individual implementer, and having thousands of different implementations
have different rules.

DIS-1 suffered from a number of severe flaws:

* Since no code points were permitted where any byte was a C0 or C1 control,
there were only 35,271 possible code points on the BMP, instead of 64K.
Moreover, there were compatibility issues with C wide characters, since
zero-extended ASCII was not a valid DIS 10646 wide char.

* The 2-byte form in DIS-1 was variable -- one would use announcer sequences
to pick subsets in the 2-byte form. This is ISO 2022 all over again, and
horrible to process. The adoption rate by industry would have been
*severely* compromised at the unpalatable choice of either quadrupling text
memory (and it was hard enough to get them to move to 16 bits per character)
or having a 2022-like mess to deal with. There would still be no getting
around the multiple representations for what would otherwise be
indistinguishable characters shared by the CJK languages.

* There was no UTF-8: the first UTF (UTF-1) was developed by Unicode to
encapsulate the merged Unicode/10646 in a sequence of compatible bytes. That
UTF was superceded by the superior UTF-8, but the principles were the same.

For these and other reasons, the DIS-1 failed its ballot. The result of the
merger between Unicode and 10646 is far better, has been widely implemented
and meets the requirements of many organizations: software vendors, W3C,
ECMA, and the many national standards organizations such as Japan, China,
Republic of Korea, Denmark, Sweden and many other countries that have
adopted and translated the ISO standard.

> Doing it as an editorial substitution
> prevented any further review by interests within most (or any?)
> of the ISO Member Bodies, nor was it possible for the other
> subcommittees who had stated requirement to SC2 to comment on
> the new version.

There was a rather long ballot period for the DIS, with ample time for
comment by both member bodies and other ISO subcommittees.

> The Unicode standard of that time had come out of a consortium
> of, essentially, printer manufacturers.

We have heard the "printer company" urban legend before. Some of the
companies originally involved might be thought of as "printer companies",
but most were not. For example, Microsoft and IBM are not usually thought of
as "printer companies"! From the beginning, the focus was on Unicode as an
encoding that would work as an internal process code in implemenations, as
well as an encoding for interchange. Items that *look* like printer-ROM
hacks, like the addition of the Zapf dingbats, were pushed by software
companies like WordPerfect, and not by HP, Apple, or Xerox.

We are very curious as to the origin of this "printer company" story; where
did you hear this?

> Their primary design
> criteria apparently included

The design criteria were not at all as you state:

> preservation of existing and
> heavily-used character sets as proper block subsets (to avoid
> having to change ROMs); keeping to a minimum number of octets

Both the DIS-1 and Unicode segmented scripts into blocks, and generally
ordered characters in a script to match existing ordering in standards in
use if there were not reasons to change it, and both preserved Latin-1, but
that was just the sensible thing to do. There was no notion of having
unchanged printer ROMs.

> keeping to a minimum number of octets per character, especially [then]
frequently-used characters
> (memory was expensive and printer memory (and "board real
> estate") was even more expensive than computer memory);

Unicode was designed as a 16 bit standard. Before the merger, there was no
UTF-8. The design was precisely 2 octets per character, independent of
frequency.

>  and, to
> the extent that they didn't conflict with the above, "if they
> look alike, they are alike" and "if it is possible to think of
> it as a font difference, it is" rules.

The design criteria was to encode characters, not glyphs. For more on the
difference, see http://www.unicode.org/unicode/reports/tr17/

> And, using those rules,
> they assumed that 16 bits would be enough, forever.

Yes, under the original design principles of Unicode -- although not
matching your list -- 16 bits would have been enough for the commercially
significant characters of the world. (That depended on the use of IDS-like
mechanisms for uncommon CJK characters, using jamo sequences instead of
composed Hangul syllables, etc.) One of the changes brought about by the
merger was the recognition that in the new merged architecture -- and with
the increased number of presentation forms â€“ this was insufficient. This led
to the development of UTF-16.

> * preservation of the properties and locality of
> existing, heavily-used character sets versus optimality
> for, e.g., collation (for example, in a mixed case
> language in which both cases are expected to be used
> regularly, it would be much more convenient to have the
> characters arranged as
>           {Upper1, Lower1, Upper2, Lower2, ... }
> rather than
>   {Upper1, Upper2, ..., Lower1, Lower2, ... }

This actually does *not* solve the collation problem for any language that I
know of. Case (and usually accent) differences are at a different level than
base letter differences. They *cannot* be solved by a rearrangement of
codes.  See UCA (or the synchronized, equivalent ISO standard ISO/IEC 14651)
for more details (http://www.unicode.org/unicode/reports/tr10/).

However, you are right in terms of implemenation that if upper/lowercase
differences had always been at a fixed offset, it would be somewhat easier
to build more compact mapping tables for upper/lower/titlecasing operations.
And there are other cases where in hindsight a rearrangement of codes would
have lead to simpler implementations.

> * Accurate transcription from a printed form, especially
> in an unknown language, into the coded character set
> versus keeping languages distinct or scripts together.

I see what you are trying to get at, but *no* character standard attempts to
preserve printed forms (see the discussion of characters vs glyphs).

This is perhaps better stated as:

* Determining the set of underlying abstract characters -- and the
interactions among them -- that are to represent the wide variety of glyphs
used to write different languages.

> * String comparisons versus the compactness of a
> character set
>
> * Overall size versus avoiding "unification".

As in the last point -- *every* character standard "unifies" glyphs into
abstract characters. GB is unified. JIS is unified. The rules for
unification used by the IRG was developed from the original rules used in
unifying JIS. Even Latin-1 is unified: after all, we could also have encoded
different Ã–, Ãœ, Ã¡, etc. for French, German, Polish, etc, since there are
different typographic standards in those countries for the placement and
shape of accents.

>
> and so on.
>
> And, in my personal opinion (then and now) the Unicode designers
> were a little naive about the number of characters which a UCS
> would ulimately need to accomodate and were a little too biased
> in their thinking by their then-existing market.  That market
> was mostly North American and Western European, their deployed
> products were primarily deployed in those markets, and backward
> compatibility with existing devices and thinking probably
> impacted optimization decisions about the character set's
> organization.   But, as I said at the beginning, times have
> changed -- the current UTC is clearly trying to work around the
> design limitations imposed by that starting point.

This is absolutely untrue. The developers of Unicode had worked
*extensively* with non-Roman markets. Although many of the developers were
Americans, many were not. And the companies forming Unicode had active
subsidiaries in a wide range of non-Roman markets -- including CJK, and
those subsidiaries certainly made their issues known.

Joe Becker was up front in 1989 that the number of "things" that people
would eventually want to represent was somewhere in the vicinity of 250,000:
that estimate has held up remarkably well, 12 years later. The naivite  was
not regarding the number of characters, but rather regarding the prospects
for being able to constrain the architecture of the UCS (as discussed
above), and use Private Use for archaic scripts, special-use symbols, etc.

>
> The bottom line is that Unicode is what we have.  It is lots
> better for our purposes than its 1.0 version.  ISO 10646 DIS-1
> is an historical footnote (as is this entire note).  We _need_
> to figure out how to make what we have work. That implies
> either abandoning other sorts of ideas, or ideas that would
> depend on an entirely different coding structure.

Yes, Unicode/10646 is what we have now. It is not perfect -- those of us who

worked on it from the beginning know the warts the best! But most of the
complexities are simply due to the complexities of human scripts, and the
Unicode consortium continually works to enhance the standard, data tables,
properties, technical reports, technical standards, guidelines, and other
information. This information is absolutely vital for producing
interoperable implementations.

Unicode/10646 is in widespread use, has a large body of implementations and
is the foundation for any system that is to deal with text around the world.

> Or we need to
> move other sorts of ideas into additional layers where we can
> create more flexibility than the DNS gives us.  Or, of course,
> we can get used to the idea that reaching a conclusion on
> internationalized domain names, or even internationalized access
> to them, will take a _very_ long time.
>
> After all, it has taken us thousands of years to create the
> large variety of languages and writing systems (some might say
> "the total mess") we have in the world today.  Perhaps it is
> hubris to believe that we can reconcile all of the differences
> as part of a DNS effort.  But, since the alternative is giving
> up, and the requirements are important, I think we need to focus
> in on the important and solvable problems and try to move
> forward, rather than getting stuck on the historical sequences
> that got us here, or on interesting research ideas, or on
> unsolvable problems.

Your bottom line is core, and absolutely correct: "I think we need to focus
in on the important and solvable problems and try to move forward".

I only take the time to correct some of the items above because a mistaken
impression of the process of development of Unicode and ISO 10646 might lead
people to have a mistaken impression of the quality of Unicode and 10646, or
the organizations behind them.

>      john



Mark


â€”â€”â€”â€”â€”

Î”ÏŒÏ‚ Î¼Î¿Î¹ Ï€Î¿á¿¦ ÏƒÏ„á¿¶, ÎºÎ±á½¶ ÎºÎ¹Î½á¿¶ Ï„á½´Î½ Î³á¿†Î½ â€” á¼ˆÏÏ‡Î¹Î¼á½µÎ´Î·Ï‚
[http://www.macchiato.com]

----- Original Message -----
From: "John C Klensin" <klensin@jck.com>
To: "Martin Duerst" <duerst@w3.org>
Cc: <idn@ops.ietf.org>
Sent: Friday, October 26, 2001 02:37
Subject: Re: [idn] An ignorant question about TC<-> SC


> Martin,
>
> For better or worse, your explanation and history is somewhat
> revisionist.   While I believe the UTC has done an outstanding
> job in more recent years to try to clean things up and
> generalize them --in the directions you suggest-- my strong
> recollection of the battles over "UCS" and what became 10646
> follows.   NOTE: most of the WG may not want (or need) to read
> this.  It goes over ground that we've been over before, and I am
> writing this primarily in the hope that we can avoid cycling
> through it again, in pieces.  The conclusion is that
> Unicode/10646 is what we have today, and what we are stuck with,
> and that arguments about history and design criteria are largely
> pointless.
>
> For those who want the history, at least as I remember it, read
> on...
>
> The original UCS effort, leading to what was known as 10646
> DIS-1, was focused on an effort whose criteria included a strong
> focus on comparability and generalized applications use.  That
> effort included a rather extensive liaison/ requirements
> statement from the then-ISO TC97 subcommittee on programming
> languages to the then TC97 SC2 that was accepted by SC2.  That
> statement of requirements asked for a UCS that contained many of
> the properties that we've discovered in the last few years would
> be really helpful if they were present in Unicode, but aren't
> there.  For example, there was to be no mixing of precomposed
> characters and those that were built up as a sequence of code
> points, using either non-spacing or overstrike ideas.   And
> there was to be no overlaying of characters that looked alike
> but were different.
>
> This is obviously the dream of someone doing string comparisons
> and report-writing (which programming language people do, or
> did, very often), since every character image seen on a page
> would represent exactly the same string length in a computer,
> i.e.,
>   length-in-visual-units (string) =
>       length-in-character-units (string)
> always.
>
> And, e.g., what I see as the hardest part of the TC <-> SC
> problem would not have existed, aince there would have been no
> confusion between, e.g., a Chinese character and a
> similar-looking Korean or Japanese one.
>
> The result, coming into the DIS-1 vote, did not completely meet
> those critera (for example, it did contain provision for
> overlaying glyphs), although it was arguably close to it.  It
> was also large (requiring four octets) and, in some ways, fairly
> unwieldy.  And, of course, by virtue of not having overlaying,
> transposition from written text in an unknown language into a
> coded character set representation was a far worse problem than
> the one we struggle with today.  And, if I recall, the draft
> hadn't preserved contingency and ordering of important and
> frequently used existing character sets as block subsets.
>
> In the ISO ballot resolution process, supposedly to handle some
> negative votes against DIS-1, the then-complete proposal was
> replaced, as an editorial substitution, with the then-current
> version of Unicode.  Doing it as an editorial substitution
> prevented any further review by interests within most (or any?)
> of the ISO Member Bodies, nor was it possible for the other
> subcommittees who had stated requirement to SC2 to comment on
> the new version.
>
> The Unicode standard of that time had come out of a consortium
> of, essentially, printer manufacturers.  Their primary design
> criteria apparently included preservation of existing and
> heavily-used character sets as proper block subsets (to avoid
> having to change ROMs); keeping to a minimum number of octets
> per character, especially [then] frequently-used characters
> (memory was expensive and printer memory (and "board real
> estate") was even more expensive than computer memory); and, to
> the extent that they didn't conflict with the above, "if they
> look alike, they are alike" and "if it is possible to think of
> it as a font difference, it is" rules.  And, using those rules,
> they assumed that 16 bits would be enough, forever.
>
> At this late date, the most reaaonable thing that can be said
> about the different between the approaches represented by
> Unicode and 10646 DIS-1 is that they started from different
> criteria and made the tradeoffs differently.  Which approach is
> "better" depends on where one starts and what one considers most
> important.   And both represent compromises with their original
> criteria, due to conflicts among their requirements, so neither
> was (or would have been) completely consistent internally.
>
> Any complex, international, character set effort that started
> more or less from scratch would face similar tradeoffs, e.g.:
>
> * preservation of the properties and locality of
> existing, heavily-used character sets versus optimality
> for, e.g., collation (for example, in a mixed case
> language in which both cases are expected to be used
> regularly, it would be much more convenient to have the
> characters arranged as
>           {Upper1, Lower1, Upper2, Lower2, ... }
> rather than
>   {Upper1, Upper2, ..., Lower1, Lower2, ... }
>
> * Accurate transcription from a printed form, especially
> in an unknown language, into the coded character set
> versus keeping languages distinct or scripts together.
>
> * String comparisons versus the compactness of a
> character set
>
> * Overall size versus avoiding "unification".
>
> and so on.
>
> And, in my personal opinion (then and now) the Unicode designers
> were a little naive about the number of characters which a UCS
> would ulimately need to accomodate and were a little too biased
> in their thinking by their then-existing market.  That market
> was mostly North American and Western European, their deployed
> products were primarily deployed in those markets, and backward
> compatibility with existing devices and thinking probably
> impacted optimization decisions about the character set's
> organization.   But, as I said at the beginning, times have
> changed -- the current UTC is clearly trying to work around the
> design limitations imposed by that starting point.
>
> The bottom line is that Unicode is what we have.  It is lots
> better for our purposes than its 1.0 version.  ISO 10646 DIS-1
> is an historical footnote (as is this entire note).  We _need_
> to figure out how to make what we have work.  That implies
> either abandoning other sorts of ideas, or ideas that would
> depend on an entirely different coding structure.  Or we need to
> move other sorts of ideas into additional layers where we can
> create more flexibility than the DNS gives us.  Or, of course,
> we can get used to the idea that reaching a conclusion on
> internationalized domain names, or even internationalized access
> to them, will take a _very_ long time.
>
> After all, it has taken us thousands of years to create the
> large variety of languages and writing systems (some might say
> "the total mess") we have in the world today.  Perhaps it is
> hubris to believe that we can reconcile all of the differences
> as part of a DNS effort.  But, since the alternative is giving
> up, and the requirements are important, I think we need to focus
> in on the important and solvable problems and try to move
> forward, rather than getting stuck on the historical sequences
> that got us here, or on interesting research ideas, or on
> unsolvable problems.
>
>      john
>
>
>
>
> --On Friday, 26 October, 2001 16:31 +0900 Martin Duerst
> <duerst@w3.org> wrote:
>
> > At 02:37 01/10/25 -0700, liana Ye wrote:
> >
> >> In another word, the fundimental work from UCS is the to table
> >> glyphs based on their visual distinction.
> >
> > That is simply not true. The fundamental work is to make
> > sure that the result is best usable by average users for
> > their average purpose (writing electronic texts).
> >
> > There have been a lot of requests to use finer distinctions,
> > e.g. from researchers or from people who publish dictionaries.
> > They were rejected, because they would confuse the average
> > user and the average use more than they would help.
> >...
>
>
>
>
>
Prev by Date: Re: [idn] mailing list malfunction
Next by Date: Re: Unicode/10646 History (was Re: [idn] An ignorant question about TC<-> SC)
Prev by thread: Re: Unicode/10646 History (was Re: [idn] An ignorantquestion about TC<-> SC)
Next by thread: Re: Unicode/10646 History (was Re: [idn] An ignorant question about TC<-> SC)
Index(es):
- Date
- Thread