[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Unicode/10646 History (was Re: [idn] An ignorantquestion about TC<-> SC)



John,

> It probably isn't worth your time, my time, or especially that
> of the WG to go through this in detail.  The bottom line is that
> we would be stuck with Unicode if it were an act of complete
> beauty that had resolved all tradeoffs to everyone's
> satisfaction for all purposes, and we would be stuck with it if
> it got many of those tradeoffs wrong from the standpoint of
> IDN/DNS use (regardless of whether they were made optimally from
> other perspectives).

As with Mark, I am in agreement with this assessment.

> That said, three comments that seem to need making:
> 
...

But in turn, I find that I still need to respond to some of
your comments.

> 
> There are other seeming anachronisms in your version of the
> story (e.g., the original design base for 10646 was purely as a
> 32-bit character set, so a criticism on the basis of what fit
> into "the BMP" is a little strange -- while there were some
> early attempts to push 16-bit subsets (mostly from printer
> vendors, if I recall), unless my memory is failing severely, the
> concept of a "BMP" originated with the Unicode merger.

This is the kind of reconstructed reality that Mark was objecting
to in his initial post. Memory is a tricky beast.

DP 10646 (which preceded DP2 10646 and DIS-1 10646, let alone
the result of the Unicode merger, which was DIS-2 10646, dated
26-December-1991) *already* had the concept *and* exact
term "Basic Multilingual Plane". And incidentally, it was
Group 032 Plane 032, to be exact. SPACE was encoded as
G=032 P=032 R=032 C=032, *not* U-00000020 as we have so conveniently
gotten used to in 10646 now.

10646 was *never* "purely...a 32-bit character set". It had an
architecture, from the start, which consisted of cells, rows,
planes, and groups, that together constituted a 32-bit encoding
space, but it was always a multiple-octet character set.

DP 10646 had 7 forms of use: 1, 1A, 2, 2A, 3A, 4 and 5. Those
were single-byte, double-byte, triple-byte, quadruple-byte,
and a (limited) form of mixed-byte, respectively, with the "A"
forms also allowing use of a SINGLE GRAPHIC CHARACTER INTRODUCER
byte. With the exception of form of use 4, these forms were
often referred to at the time as "compaction methods".

U.S. ballot comments on DP 10646 (dated 28 April 1989, Doc. X3L2/89-76)
requested that this be reorganized to 5 forms of use: 1, 2, 3, 4, 5
(as above), with two levels corresponding to use or non-use of
the SGCI. (Needless to say, that was before the Unicode advocates
had much influence on the wording of U.S. ballot comments on 10646.)

All that was drastically simplified later in the DIS-2 draft
(the product of the Unicode/10646 merger talks)
which dropped the 1-, 3-, and mixed-byte forms of use, the SGCI
(as well as the HOP, which had allowed in-stream announcements of
the forms of use). DIS-2 had simply two forms of use: UCS-2 and
UCS-4 -- and that continues through to today in ISO/IEC 10646-1:2000.
UCS-2 was, of course, the (then) Unicode-compatible way of
using 10646. DIS-2 also dropped the C0/C1 restrictions on octets
and rearranged everything down to the origin point, so that
UCS-4 could be used as a wchar_t implementation, among other things.

As for the push for "16-bit subsets", the following note may
be a useful counter-tonic. This is from the official Chinese
national body comments, dated May 29, 1991, in their disapproval 
of DIS-1 10646:

"The current DIS 10646 allocates Chinese Hanzi, Japanese Kanji
and Korean Hanja into different planes, while the three zones of
B.M.P., I-01, I-10 and I-11, have not been defined yet. There is
not even any explaination [sic] found in the DIS text for the
vacancy. Therefore, the plane 032 can not be regarded as a real
Basic Multilingual Plane because of the absence of ideographs.
Consequently, we would like to request to include Unified
CJK Ideographs into BMP based on the structure proposed in
the above item 1. [Item 1 refers to removal of the C0/C1
restriction. --kenw] In order to avoid unnecessary duplicate
work, China is willing and pleased to contribute the document
of the repertoire of HCS [Han Character Set --kenw] as the 
basis for discussion."

Earlier, in the official Chinese national body comments on
DP2 10646, dated February 16, 1990, China commented in
their negative ballot on that draft:

"In the second DP, the sparate [sic] assignment of C/J/K
characters not only wastes the valuable B.M.P. and excludes
the unsimplified Chinese characters which have important
practical value, but also directly violates the principle of
Character encoding by script rather than language/country which
is laid down in DP10646. And the situation of one character
with more than one codes [sic] resulting from this will
cause serious frustration to the future multi-lingual
applications. Therefore, The Chinese National Body does not
approve the arrangement of Han Characters as specified in
the second DP 10646."

I don't think the Chinese national standards body could
realistically be considered as a "printer vendor". These
comments speak both to the then general desire to have
the Basic Multilingual Plane be a usable international
subset of the entire construct of 10646, as well as to
the Chinese disaffection with language/country-specific
encoding of Han characters and requirement for a meaningful
Han unification in 10646.

> 
> (iii) The real points of my raising those historical issues were
> the one you seem to have missed, so let me assume I wasn't clear
> and say it explicitly.  As I hope most of the participants in
> IDN have long ago figured out, this business of trying to create
> a single "UCS" is one that involves many complex tradeoffs and
> for which there are often no easy answers.

On this we are in agreement. The problem is in misrepresentations
of Unicode and/or 10646 history in service of making the point.

> 
> To give just a few examples,...
> 
> 	* Keeping scripts together and preserving widely-used
> 	earlier CCSs as blocks is A Good Thing.    But having
> 	exactly one code point associated with a given glyph/
> 	letter shape is also A Good Thing. One can't have both.

For a limited subset of Latin/Greek/Cyrillic, having one
code point associated with a given glyph has (by some) been
considered A Good Thing. But in the general context of
a Universal Character Set, it is clearly A Bad Thing. That
approach leads to total botches of Arabic or Indic processing
for example. And the tradeoff here has little to do with
preserving the structure of earlier CCSs as blocks.

> 	* Han unification provides many benefits.  Han
> 	unification also causes some problems (one of which, in
> 	the present instance, is that one appears to need
> 	metadata to map between TC and SC without doing nasty
> 	things to Kanji).  One cannot both do unification and
> 	not do unification.

Han unification has nothing to do with the TC/SC problem. There
are tradeoffs, but they aren't *this* tradeoff. Han
unification neither created nor eliminated the TC/SC
distinctions.

Regards,

--Ken