[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [idn] How to match letters
Hi, Dan,
I have the same problem with you as I am not a speaker of all
the languages in the world. I suggest to encode alphabets to ASCII
based on equivalent glyphs for Latin and some
other scripts, while to encode International Phonetic Alphabet
based on similar sound table for easy describing any minor languages
I have known nothing about. For other languages for example Arabic,
we need bilingual linguists to help in defining what are the equivalent
sounds or what is a sensible way to organize the mapping for the users.
If there are no input from such a source, I would use the UNICODE table
to a simple transcription, for the sake of completeness, as it is the
case
for Math symbols. By the way, anyone out there has comments on this?
For Greek or Cyrillic, I am using 26x10 matrix in the following sprit:
0 a-z
1 a1-z1
2 a2-z2
3 a3-z3
4 other symbols, say Latin
9-4 often used other language symbols, say Latin capital letters
9-3 A3-Z3
9-2 A2-Z3
9-1 A1-Z1
9-0 A-Z
Without full understanding how Latin script is used , I am
using 26x10 matrix in the following sprit, and I hope it is good enough
as a general encoding to all Latin languages, although, it seems
wasting a lot of code points on consonants.
0 a-z
1 tone mark
2 tone mark
3 tone mark
4 tone mark
9-4 tone mark
9-3 tone mark
9-2 tone mark
9-1 tone mark
9-0 A-Z
On the other hand a Latin language code table may use the
following English definition table as a template.
For English, en--pn-, I am proposing the following assignment:
0 a-z
1 Greek a-z
2 subscripts
3 26 keyboard symbols
4 26 Math symbols
9-4 26 Dingbats (2 faces, 4 in weather, 4 music, 1 Yinyang, 4 hands, 4
chess, 2 moon)
9-3 circled symbols
9-2 superscript
9-1 Greek A-Z
9-0 A-Z
For math symbols, it may need several ways to encode it.
1. UNICODE table to a simple math-glyph transcription as mentioned above
for people (Robots?) who speaks math formulas. There are 242
symbols in Unicode table, and StepCode has 260 spaces for it
without include a-z. (A robot without variable perimeters?)
2. For me speaking English, I would stick it into the middle
of an English phrase, for example: A>B coded as
en-wb--agreaterthanb949
where greaterthan is codified in en-mathwebpage as ">" or
it may be an register trademark.
3. For a symbol with much larger user base as in English above:
en-pn--abb949
where > is using code point b4 in en--pn above.
So, I like more inputs from the WG before I put in more work. Thanks.
Liana Ye
On Sun, 24 Jun 2001 11:24:41 +0200 (MEST) Dan <Dan.Oscarsson@trab.se>
writes:
>
> We have now and then discussed what letters are to be matched as
> equals. In the nameprep document some work is done and there is
> no other draft available. I could write one for the Latin based
> alphabets (this includes Greek and Russian), if that is wanted.
> (I cannot write how to do matching for other parts as I have
> no language knowledge about them).
>
> There are two important things I have come across related to
> matching names (I have worked a lot with LDAP/X.500 directories):
>
> 1) matching based on equivalent glyphs
> 2) matching based on equivalent sounds
>
> An example of 1) is that in UCS, Latin upper case A, Greek upper
> case Alpha
> and Cyrillic upper case A have the same glyph.
> Using my Swedish keyboard I could enter names in Greek or Cyrillic,
> but
> I do not have three equivalent looking "A"s on my keyboard. Instead
> I
> would use the same A for all names.
> From this I think name matching must treat all equivalent looking
> letters as the same - this also resulting in lower/upper case
> versions
> of those letters treated as the same even if their glyph does not
> match.
> Does anybody have any problems with this? Or is there some other way
> to do it?
>
> An example of 2) is Swedish "? (o with diearesis) and Danish
> "? (or with stroke). They both represent the same vowal and
> somebody
> in Denmark would often enter a Swedish name using the Danish version
> of
> the letter, and the other way round from Sweden.
> So the letters "? and "? should match as the same.
> How many more of this kind is there? Any problems doing this kind
> of matching?
>
> Dan
>
>