[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Unicode/10646 History (was Re: [idn] An ignorantquestion about TC<-> SC)



Mark,

It probably isn't worth your time, my time, or especially that
of the WG to go through this in detail.  The bottom line is that
we would be stuck with Unicode if it were an act of complete
beauty that had resolved all tradeoffs to everyone's
satisfaction for all purposes, and we would be stuck with it if
it got many of those tradeoffs wrong from the standpoint of
IDN/DNS use (regardless of whether they were made optimally from
other perspectives).   And, while both Eric and I might be
faulted for not adopting a more positive and constructive tone
in discussions on this subject, his experience and observations
during the relevant period are consistent with mine and I won't
spend time and bandwidth repeating his observations.

That said, three comments that seem to need making:

(i) The impression that the original Unicode effort was driven
by printer vendors (and, as Eric suggests, printer-focused
groups within companies with broader interests) came from two
sources.  One was a series of statements from them to various
other ISO and ANSI committees and liaison groups that made that
quite clear.  The other was comments from standards
representatives from several of those companies who, when asked
about particular properties of Unicode, produced responses that
might be summarized by "the printer guys made us do it".

There are other seeming anachronisms in your version of the
story (e.g., the original design base for 10646 was purely as a
32-bit character set, so a criticism on the basis of what fit
into "the BMP" is a little strange -- while there were some
early attempts to push 16-bit subsets (mostly from printer
vendors, if I recall), unless my memory is failing severely, the
concept of a "BMP" originated with the Unicode merger.

(ii) At no point did I mean to imply that DIS-1 was a perfect
solution (to anything).  As you point out, the effort had
abandoned the "one code point, one glyph-width" principle.  As
Jon points out, they had botched Hebrew and several other
things. And those are two examples among many.  Could it have
been "fixed"?  Perfectly, almost certainly not, although we
would have ended up with a different set of tradeoffs and
(intentional or not) optimizations (see below).  Was it closer
to its original design criteria than the Unicode version that
substituted for it?  Almost certainly (and not surprisingly)
yes.  I did go to some lengths to suggest that UTC has done good
work to fix some of the difficulties Unicode inherited from its
early history -- it is significantly closer today to the balance
of design goals originally set in TC97 for the UCS than it was
when the substitution was made.

(iii) The real points of my raising those historical issues were
the one you seem to have missed, so let me assume I wasn't clear
and say it explicitly.  As I hope most of the participants in
IDN have long ago figured out, this business of trying to create
a single "UCS" is one that involves many complex tradeoffs and
for which there are often no easy answers.  If one of those
tradeoffs is resolved in one way, some applications get easier
and others get harder.  There is even a case to be made that the
stated current design criteria for Unicode are not completely
orthogonal, leading to even more tradeoffs.

To give just a few examples,...

	* Keeping scripts together and preserving widely-used
	earlier CCSs as blocks is A Good Thing.    But having
	exactly one code point associated with a given glyph/
	letter shape is also A Good Thing. One can't have both.
	
	* Han unification provides many benefits.  Han
	unification also causes some problems (one of which, in
	the present instance, is that one appears to need
	metadata to map between TC and SC without doing nasty
	things to Kanji).  One cannot both do unification and
	not do unification.
	
* There are several different ways to handle character
presentation ordering (e.g., right-to-left/ left-to-right),
especially in the edge cases in which right-to-left and
left-to-right (or one of them and top-to-bottom) scripts are
intermixed.  Similarly, there are multiple possible ways to
handle optional vowels, word-breaking indicators, tone and
stress markers, and so on.   In each case, different techniques
are better for different circumstances and conditions; none is
optimal for all cases.  No matter what one chooses, it will be
suboptimal for some problems.  And, if one does not choose, but
incorporates several of the possibilities, one will be subject
to accusations of excessive complexity and too many options.  

In each case, and for many others, there are engineering
tradeoffs involved.  And, for each case, second-guessing the
decisions in the light of a particular application is a popular
sport, but one that accomplishes very little -- I don't see
alternatives to Unicode/ 10646 and the IETF is certainly not
going to invent one.  But I believe it is equally important that
all of us, even the members of the UTC, understand and remember
that Unicode _is_, ultimately, a product of choices among
engineering tradeoffs (which include its history).  And those
choices impose other engineering constrants that we need to
learn to live with and/or work around.  Unicode is not
divinely-inspired Received Wisdom and we need to avoid thinking
that assumes that it is.

regards,
    john

--On Wednesday, 31 October, 2001 12:00 -0800 Mark Davis
<mark@macchiato.com> wrote:

> The history of the Unicode/10646 merger is really irrelevant
> to TC<->SC, but there are some inaccuracies that should not be
> left hanging.
>...