[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [idn] Re: Back to work (Nameprep) (was: Re: Just send UTF-8 with nameprep (was: RE: [idn] Reality Check))



> Very low for lower case. Between Latin and Greek, or Cyrillic and Greek,
> it's only the 'o'.

This is inaccurate. For example, phi (φ,ϕ) and o-slash (ø) are confusable depending on the font, i.e., where the slash in the phi is at an angle. Rho (ρ) and p (p) are confusible in certain fonts, etc. And you can't forget all the extended Latin characters. There is Latin small letter open e (ɛ) and epsilon (ε). For Cyrillic and Greek, you are omitting many characters that are confusable (к, п, р, ф, є, ѡ, ѱ, ѳ, ...) . For capitals, don't forget the extended Latin and Cyrillic characters, like O with tilde (Ɵ) vs. theta (Θ), or Latin letter Esh (Ʃ) and Sigma (Σ).
 
But there are also many confusable characters even within the same script, like b (b) and Latin letter tone 6 (ƅ), p (p) and latin letter wynn (ƿ), many hooked letters which can fool someone who isn't looking for them: k (k) vs hooked k (ƙ), capital I or one and dental clicks (ǀ), variants like script g (ɡ), etc. Forbidding script mixing would not solve those.
 
It would take some real work to look at all the Unicode characters under a variety of common fonts at 9-12 point (normal body text) to determine which are confusable and which are not, and cross-check all the results. I'm not saying it can't be done -- and it would be a worthwhile effort in the long run -- but it is not trivial.
 
That is why I think the most practical solution is for GUI support that allows for human judgment.

Mark
 
P.S. Depending on what font your are using, of course, the above examples may or may not be distinguishable.

—————

πάντων μέτρον ἄνθρωπος — Πρωταγόρας
[http://www.macchiato.com]

----- Original Message -----
From: "Martin Duerst" <duerst@w3.org>
To: "Keith Moore" <moore@cs.utk.edu>; "Soobok Lee" <lsb@postel.co.kr>
Cc: "Keith Moore" <moore@cs.utk.edu>; <idn@ops.ietf.org>
Sent: Thursday, July 19, 2001 00:04
Subject: Re: [idn] Re: Back to work (Nameprep) (was: Re: Just send UTF-8 with nameprep (was: RE: [idn] Reality Check))


> At 11:10 01/07/18 -0400, Keith Moore wrote:
> > > Now, let's think about another case of   all-Greek "oo.com" and all-Latin
> > > "oo.com":
> > > Either of the two consists of scripts from only single character sets.
> > > But the two still look very similiar. Do you have any good idea about
> > this ?
> >
> >first, how likely is this in practice that a label of all Greek letters
> >will accidentally collide with a label of all Latin letters?   (as opposed
> >to a deliberate collision)
>
> Very low for lower case. Between Latin and Greek, or Cyrillic and Greek,
> it's only the 'o'.
> Between Latin and Cyrillic, it's the 'a', 'e', 'o', 'p', 'c', 'y', 'x',
> plus 's', 'i', and 'j' in some languages.
>
> For upper-case, it's more. Regards,   Martin.
>
>
>
> >for second-level domains at least, and for some third-level domains,
> >it would make sense for the registry to disallow labels whose appearances
> >collide with all-ASCII labels.
>
> Most probably yes.
>
>
> >there are already some such rules in place
> >even for ASCII, these would just be extended.
>
> Can you tell me more about these, or give some pointer?
>
> Regards,   Martin.
>
>