[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [idn] UTC recommendations on TC/SC



Eric, I'll interleave some responses below.

Mark
—————

Γνῶθι σαυτόν — Θαλῆς
[http://www.macchiato.com]
----- Original Message -----
From: "Eric Brunner" <brunner@nic-naa.net>
To: "Mark Davis" <mark@macchiato.com>
Cc: <idn@ops.ietf.org>; "Paul Hoffman / IMC" <phoffman@imc.org>
Sent: Sunday, September 02, 2001 18:13
Subject: Re: [idn] UTC recommendations on TC/SC


> > Thanks for your note. I would be glad to try to clarify it. The UTC
position
> > does disagree with the tsconv document, in maintaining that nameprep
does
> > not require the addition of TC/SC folding.
>
> The first being the product of a vendor consortia, the second a product of
> the .CN and .TW NICs, and the subject matter being in the context of dns
> identifiers. This is doubly awkward.

It is a bit misleading to characterize the Unicode consortium as a 'vendor
consortia', any more than it would be to apply the same moniker to the W3C
or IEEE. The consortium includes more than just vendors; I'll include the
current membership here, in case people find it useful:

Corporate Members

Adobe Systems, Inc.; Apple Computer, Inc.; Basis Technology Corporation;
Compaq Computer Corporation; Government of India Ministry of Information
Technology; Government of Pakistan, National Language Authority;
Hewlett-Packard Company; IBM Corporation; Justsystem Corporation; Microsoft
Corporation  NCR Corporation; Oracle Corporation; PeopleSoft, Inc.; Progress
Software Corporation; The Research Libraries Group, Inc. (RLG); Reuters,
Ltd.; RWS Group, LLC; SAP AG; Sun Microsystems, Inc.; Sybase, Inc.;
Trigeminal Software, Inc.; Unisys Corporation.

Associate Members

Agfa Monotype Corporation; Beijing Zhong Yi Electronics Co.; BMC Software,
Inc.; Booz, Allen, & Hamilton, Inc.; Cable & Wireless HKT Limited;
CDAC-Centre for Development of Advanced Computing; China Electronic
Information Technology Ltd.; The Church of Jesus Christ of Latter-day
Saints; Columbia University; Data Research Associates; DecoType, Inc.;
Endeavor Information Systems, Inc.; eNIC Corporation; epixtech, Inc.;
Ericsson Mobile Communications; eTranslate, Inc.; Ex Libris, Inc.;
GlobalMentor, Inc.; GlobalSight Corporation; The Government of Tamil Nadu,
India; iDNS; i-EMAIL.net Pte Ltd; Innovative Interfaces, Inc.; Internet Mail
Consortium; Langoo.com; Language Analysis Systems, Inc. Language Technology
Research Center; Netscape Communications; Nokia; Nortel Networks; Novell;
OCLC, Inc.; Openwave Systems, Inc.; Optio Software; Palm, Inc.; Production
First Software; The Royal Library, Sweden; Sagent Technology, Inc.; SAS
Institute, Inc.; SHARE; Siebel Systems; SIL International; SIRSI
Corporation; SLANGSOFT; Software AG; StarTV - Satellite Television Asia
Region Ltd.; Symbian, Ltd.; Uniscape, Inc.; Verisign Global Registry
Services; VTLS, Inc.; WALID, Inc.; WordWalla, Inc.; Yet Another Society

>
> > I believe the consensus view* in the UTC is that SC/TC folding -- when
done
> > correctly -- is quite complex (cf http://cjk.org/cjk/c2c/c2cbasis.htm),
that
> > there is no established standard that precisely defines the conversion,
and
>
> I'm interested in the consensus view in the PRC, and in Taiwan, also.

As are we. The UTC, as other organizations, works with the best available
knowledge available to it: if there is evidence that it is common to mix SC
and TC arbitrarily within names -- and that there are widespread
expectations among users that such arbitrary mixtures must always match,
then that
would certainly be useful information to bring forward.

For example, samples of publications -- books, magazines, journals, etc. --
in which the same names or phrases are spelled with different mixtures of SC
and TC (not just all SC or all TC) would be valuable.

>
> This next bit, starting at "since" to the end of the sentence.
>
> > that it is not a required feature of nameprep, since multiple
registrations
> > will reasonably meet the user requirements. The latter point is based on
> > there being typically only two variant spellings: one TC and the other
SC.
>
> Earlier someone proposed, and eventually withdrew, a proposal to allow
> a character-by-character, registrant election of equivalency, which has
> combinatorial scaling properties. A vendor may also have advocated this.
> I had dinner with their CTO once and got indigestion.
>
> In any event, I can't imagine how a conversation in a code-point standards
> body ended with a paraphrasing of "multiple (zone) registrations
reasonably
> meet user requirements". It is as odd as an IETF conversation ending with
> the observation that "multiple code-point allocations reasonably meet the
> glyph <foo>'s requirements as a character in scripts <bar> and <baz>".

Perhaps it should have been better phrased (and remember, I was speaking for
myself, not for the UTC). The point is that if there is little arbitrary
mixing of SC and TC in the same names, then there will usually be two
variants.

I do not presume to know the internals of the DNS; but if I were designing a
software library that matched names, and if the matching process was not
well understood, then only having a small number of variants would be a very
important fact that can be taken into account when doing the design. I would
at that point very strongly consider just having the variants in the table,
rather than try to devise software that produced a match (and did not match
items that shouldn't match).

Forgive me if I stepped over some line in the description.

>
> In this one note I'm troubled to see the UTC presume to know better than
> the sources available to the authors of tsconv, and to know better than
> the authors of some eventual STANDARD and/or BCP trace APPS or DNS IDs.

The UTC does not 'presume to know'; it is making a recommendation for a
solution to a issue that is important to be resolved, and resolved in a
timely manner. It is a recommendation to the idn committee; by no means a
fiat!

>
> I probably misread your mail.
>
> I would like to know how the "typically only two variant spellings" rule
> is manifested, or could be expressed as an equivalence rule. We did work
> on the question of equivalency rule scope in the discussion of (former)
> requirement [30], to motivate an interest in SC/TC conversion at the
protocol
> vs zone manager (nameprep or not) level.

I don't quite understand what you mean here. The point I was trying to make
is that if it is
difficult to come up with precise matching algorithm and data, and there are
typically only two variants, then you don't try to institute an equivalence
rule: you simply "register" (using this in a broad sense, not in an IETF
sense) both names.

Thus you don't try to programmatically predict that
"theater" and "theatre" should match, and "maker" and "makre" shouldn't,
nor "aluminum" and "aluminum" but not "lithium" and
"lithum", "catalog" and "catalogue" but not "league" and "leag". (Let alone
"petrol" and "gasoline"). Instead, you simply add both sets of alternatives
to the internal word-list.

(Of course, British and American variants of English are not precisely
analogous to the case of TC and SC, but you see the point.)

>
> > * I'm speaking for myself (not the UTC) in the material following the
> > asterisk, since this material was not an explicit part of the UTC
decision.
>
> Informed guessing is good, and thanks for responding. I hope to see you in
> DC next year.

I would enjoy that. These email discussions are never as interesting nor as
productive as discussions in person! Unfortunately, I probably will not be
in DC, so it will be another time.

>
> Eric
>
>