[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[idn] The Business Card problem (was: Re: An experiment with UTF-8 domainnames)



--On Saturday, 06 January, 2001 18:01 +0000 "Brian W. Spolarich"
<briansp@walid.com> wrote:

> On Sat, 6 Jan 2001, J. William Semich wrote:
> 
> | >...but could also on my business card have (if the ACE ends
> up being | RACE):
> | >   bq--abtoi3duon2hf5tn .com
> |...
>   I don't think this is in reality a requirement.  If you
>   decide to put an IDN on your business card, some people may
> have problems using it, but I would suspect rather quickly
> that most users, particularly non-English speakers, will
> become aware that they will need to update their software to
> use these names.

Brian, the problem really isn't updating software but updating
people -- an activity that we know, experimentally, tends to
take a few generations at least.  


Let me use myself as an example (this one is inherently
European-language-oriented, but I assume it has Asian and
African analogues that are as bad or worse).  If I'm handed a
business card with an IDN on it, in the "native" character glyph
set for that name, the next encoding step depends on my
knowledge and pattern-recognition and discrimination skills, and
not on any computer issues.  I need to be able to take that
collection of glyphs and somehow transcribe and enter them into
a computer.

Now, if the characters used are Latin-based, I can probably cope
(and I do today).  I'm going to recognize the basic structures.
Even if I have to consult a table, I can match what I see on the
card with what I see on the screen.  In the worst case --in
which I have no idea how to "type" the character-- I can then
enter the character position from the table into a software
program.  Interestingly, for that purpose, a simple hexadecimal
encoding of UCS-4 (or UCS-2) is "easier" than UTF-8 which is a
good deal "easier" than ACE.  While tedious, 0x000000E4 is quite
easy to figure out and write.  And selection of characters from
tablets is not exactly a new technology; I'd expect user
interface software to show up quickly and be widely deployed
that uses table-selection technologies to build up strings that
could then be pasted.

But that is the easy part and it is conditioned on my being able
to recognize Latin-based characters well enough to tell them
apart.  Extrapolating from it as an example is dangerous:

First, I can't tell European characters apart to the degree
needed by IDN.  If you had me a card with a native-glyph IDN on
it, I need to know the language (or a surrogate for the
language) to pick the right glyphs from a table.   I simply
cannot tell U+0041 from U+0391 from U+0410 by looking at a glyph
written on a page (and those are, of course, not the only
examples).  I have to know the language context, or apply
heuristics, to pick one.  And, if I can't be guaranteed to
consistently get it right, we are going to need
character-matching rules much more subtle than anything yet
seriously contemplated.

But that isn't the serious problem either.  It turns out that I
can do just about the same pattern-matching job --looking at a
glyph, deducing its essential characteristics because I'm
familiar with the repertorie from which it comes, finding it in
a table and picking it or entering it by table position-- if
that card contains Greek, Cyrillic, or Hebrew glyphs.  That puts
me a bit ahead of most of my American and Western European
colleagues, for which those sets of glyphs may look like
indistinguishable chicken-scratches.   I've even been around
this work long enough that I've learned to distinguish enough
kana that I might be able to successfully look them up in a
table (but not pronounce them), although it would take me a long
time since I don't know the table order.   But most Han, Kanji,
and Hangul are hopeless for me: there are lots of them, I can't
(in general) tell them apart, and I just don't have the pattern
recognition skills (or training in distinguishing radicals) to
be able to look them up, especially if the font-rendering on the
business card is different from that in my table.

That is not, of course, an alphabetic versus ideographic issue:
I'm equally hopeless with Arabic, and Devanagari, and Bengali,
and Thai, and dozens (if not hundreds) of languages and
character collections, most of which I've never even seen in
print.  
Anything I can't recognize well enough to match with a table is
something I can't copy from a business card.  Whether the DNS
interface is UTF-8, or an ACE, or UCS-4, or some other encoding
is irrelevant, and software isn't going to fix that.

So, for the business card, we need, I think, to start asking
different questions.  If want an address on my business card to
be recognized and usable, after the translator leaves, by a Thai
native who lacks familiarity with Latin alphabets, I had best
figure out how to get a Thai rendering (or something else
recognizable) of my address on that card.   And, operationally,
I had better fix it so that rendering does the right thing when
looked up in the DNS.  Note that this is my problem and not
hers: she has a problem only if I don't do something reasonable
and she decides to try to communicate with my anyway.

Another important inference from this story is almost certainly
that a solution that works for "ASCII" and "other language" is,
at least in the long term, inadequate.  The same issues exist
between "other language 1" and "other language 2", and the
virtual business card will need to be many-sided.  And that may
imply that I may need many DNS identities, associated with
different character sets, to be accessible from all over the
world.

Or we can try to turn it into an ACE-like question, where what
goes on the business card is the "native" set of glyphs plus
some Latin-based encoding.  I'm not sure, as argued above, that
there is a "native" set of glyphs for the general case.  But,
more important, if one is going to put an encoding on a card
from which one can get back and forth from 10646/Unicode, it is
not clear to me that ACE is superior to hexified UTF-8.  And
neither may be as convenient as 
hexified UCS-4 (or UCS-2) or a PGP-like biometric word list
encoding.  Any of these are going to require software to get
from the business card form to the DNS one; the correct answer
for encoding them may be different from the correct answer for
what should actually go into the DNS.

>   Probably a common scenario will be a non-technical person
>   receiving a business card with an IDN on it.  They'll try to
> type the name into their browser, and it will fail to resolve.
> They'll contact their IT support staff, who will tell them
> what they need to do to use the IDN (and educate them a little
> bit on this 'new thing'), they'll update their software as
> required, and they'll move on.

	Sigh.
	
	JU:  Hello, IT support staff, this is Joe User.  I've
	got a business card in front of me with a URL on it, and
	I don't know how to type it in.
	
	ITSS: Just copy the letters from the card into the "Go
	to" box on the browser.  Haven't you learned that yet?
	
	JU: But I don't recognize any of the characters.. they
	aren't on my keyboard.
	
	ITSS: We can send you out a keyboard for Latin-1, which
	covers Swedish, German, French,... and instructions for
	installing it.
	
	(A week later)
	
	JU: Hello, IT Support Staff, your technician put in the
	nice new keyboard and software.  Some of my applications
	now say "Guten Tag", rather than "Hello", but that is
	ok.  I still can't type that URL in, the characters
	still aren't on the keyboard.
	
	ITSS:  Huh?  What language did you say it was in?
	
	JU: Beats me.   Doesn't look much like either English or
	Klingon.
	
	ITSS:  (The next two or three exchanges are left as an
	exercise for the reader.  You may assume, if convenient,
	that the IT support staff speaks Klingon and has been
	studying the use of the Batlef (sp?))

Sorry Brian, I don't believe "a little education, update the
software, and move on" is going to do it.

     john