[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[idn] An ignorant question about TC<-> SC

To: idn@ops.ietf.org
Subject: [idn] An ignorant question about TC<-> SC
From: John C Klensin <klensin@jck.com>
Date: Tue, 23 Oct 2001 08:59:23 -0400

While reading David's NFC versus NFKC note, I had an odd thought.
I've been dissatisfied, as have many others, with the notion that
TC <-> SC mapping is analogous to case mapping in Roman-derived
alphabets.   Arguments about whether that analogy applies have
helped to make the discussion of what is, to me, a very difficult
topic even more obscure.

To quote the Unicode standard, "Serbo-Croatian is a single
language with paired alphabets".  This is a definition with which
native speakers of the language agree (although, when tensions in
the Balkans are high, I assume some of them are not completely
happy about it).  Would it be constructive to think about Chinese
as "one language, two alphabets"?  If it is, then nameprep or a
related process ought to be able to map back and forth between
the Roman-based characters usually used in Croatian contexts and
the Cyrillic characters usually used in Serbian ones (people do
this all the time, and certainly expect the two to match).

Of course, the analogy is not exact (these things never are):
perhaps partially because there are just fewer characters to deal
with, there are no cases in which there are potential ambiguities
in the mappings.  On the other hand, one problem is more severe
than in the Chinese case: in the general case, a Serbo-Croatian
string written in Cyrillic cannot be distinguished, on a
character string basis, from uses of Cyrillic for other languages
(e.g., Russian), which should not be mapped and, similarly, a
string written in Roman-based characters cannot be distinguished,
on a character string basis, from the Roman-based characters of
another language (English?) which, again, cannot be mapped.

In either case, the mapping becomes readily plausible if the
language, in addition to the content of the character string, is
known, but is hard to think about without causing side-effects in
other languages if not.

Is that helpful?
     john

Prev by Date: Re: [idn] Which lanuages/scripts to reorder?
Next by Date: Re: [idn] Which lanuages/scripts to reorder?
Prev by thread: Re: [idn] questions: unassigned code points in nameprep
Next by thread: Re: [idn] An ignorant question about TC<-> SC
Index(es):
- Date
- Thread