[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [idn] How to match letters



Hi, Dan,
	I have the same problem with you as I am not a speaker of all
the languages in the world.  I suggest to encode alphabets to ASCII
based on equivalent glyphs for Latin and some 
other scripts, while to encode International Phonetic Alphabet 
 based on similar sound table for easy describing any minor languages
I have known nothing about.  For other languages for example Arabic,
 we need bilingual linguists to help in defining what are the equivalent 
sounds or what is a sensible way to organize the mapping for the users.
If there are no input from such a source, I would use the UNICODE table 
to a simple transcription, for the sake of completeness, as it is the
case
 for Math symbols.  By the way, anyone out there has comments on this?

	For Greek or Cyrillic, I am using 26x10 matrix in the following sprit: 
0	a-z
1	a1-z1
2	a2-z2
3	a3-z3
4	other symbols, say Latin
9-4	often used other language symbols, say Latin capital letters
9-3	A3-Z3
9-2	A2-Z3
9-1	A1-Z1
9-0	A-Z
	
	Without full understanding  how Latin script is used , I am 
using 26x10 matrix in the following sprit, and I hope it is good enough 
as a general encoding to all Latin languages, although, it seems 
 wasting a lot of code points on consonants.  
0	a-z
1	tone mark
2	tone mark
3	tone mark
4	tone mark
9-4	tone mark
9-3	tone mark
9-2	tone mark
9-1	tone mark
9-0	A-Z

	On the other hand a Latin language code table may use the 
following English definition table as a template. 

	For English,  en--pn-,  I am proposing the following assignment:
0	a-z
1	Greek a-z
2	subscripts
3	26 keyboard symbols
4	26 Math symbols 
9-4	26 Dingbats (2 faces, 4 in weather, 4 music,  1 Yinyang, 4 hands, 4
chess, 2 moon)
9-3	circled symbols
9-2	superscript
9-1	Greek A-Z
9-0	A-Z

	For math symbols, it may need several ways to encode it.  
1.  UNICODE table to a simple math-glyph transcription as mentioned above
	for people (Robots?) who speaks math formulas. There are 242 
	symbols in Unicode table, and StepCode has 260 spaces for it 
	without include a-z. (A robot without variable perimeters?)

2.  For me speaking English, I would stick it into the middle
	of an English phrase, for example: A>B  coded as 
		en-wb--agreaterthanb949 
	where greaterthan is codified in en-mathwebpage  as   ">" or
	it may be an register trademark.

3. For a symbol with much larger user base as in English above:
		en-pn--abb949 
	where > is using code point b4 in  en--pn above. 
	

So, I like more inputs from the WG before I put in more work. Thanks.

Liana Ye


On Sun, 24 Jun 2001 11:24:41 +0200 (MEST) Dan <Dan.Oscarsson@trab.se>
writes:
> 
> We have now and then discussed what letters are to be matched as
> equals. In the nameprep document some work is done and there is
> no other draft available. I could write one for the Latin based
> alphabets (this includes Greek and Russian), if that is wanted.
> (I cannot write how to do matching for other parts as I have
> no language knowledge about them).
> 
> There are two important things I have come across related to
> matching names (I have worked a lot with LDAP/X.500 directories):
> 
> 1) matching based on equivalent glyphs
> 2) matching based on equivalent sounds
> 
> An example of 1) is that in UCS, Latin upper case A, Greek upper 
> case Alpha
> and Cyrillic upper case A have the same glyph.
> Using my Swedish keyboard I could enter names in Greek or Cyrillic, 
> but
> I do not have three equivalent looking "A"s on my keyboard. Instead 
> I
> would use the same A for all names.
> From this I think name matching must treat all equivalent looking
> letters as the same - this also resulting in lower/upper case 
> versions
> of those letters treated as the same even if their glyph does not 
> match.
> Does anybody have any problems with this? Or is there some other way
> to do it?
> 
> An example of 2) is Swedish "? (o with diearesis) and Danish
> "? (or with stroke). They both represent the same vowal and 
> somebody
> in Denmark would often enter a Swedish name using the Danish version 
> of
> the letter, and the other way round from Sweden.
> So the letters "? and "? should match as the same.
> How many more of this kind is there? Any problems doing this kind
> of matching?
> 
>    Dan
> 
>