[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Matching and comparison



At 12:33 PM 1/21/00 +0900, Martin J. Duerst wrote:
>Paul - My Japanese version of Eudora doesn't let me read your
>variants, but I think I can guess them. I wouldn't know why
>I would want to register all these.

For the same reason that IBM would want to registier "ibM.com" and 
"iBm.com" and so on: because they are similar to their "main" domain name. 
You were the one who brought it up, yes?

>  Maybe I would want to
>register Durst.com (because that's what somebody might type
>if they don't have an appropriate keyboard,

Keyboard? Why are we concerned about keyboards? Today, essentially no one 
has a keyboard that doesn't do lowercase. Further, what about people who 
enter domain names with non-keyboard entry systems such as pens or voice?

What I'm saying here is that capitalization is no different than many of 
the other issues that appear when we go outside of the restricted ASCII 
range. I do not think we should put in a requirement for one small part of 
the problem space. Either we fix it all, or we punt.

> > We shouldn't pretend to fix the "too many similar names" problem by only
> > talking about capitalization.
>
>Definitely not. But we should not throw all 'similar names' problems
>in the same pot. Some of them are very productive (in particular
>casing), some are much less productive.

I don't see what you mean by "productive" here. "Solvable"?

>  Some are highly regular
>(e.g. casing, although it's not completely regular), some need
>much more human judgement (e.g. traditional/simplified Chinese).
>Some have the potential for spoofing on type-in (e.g. the
>Unicode canonical equivalences), others have less potential for
>spoofing.

In my mind, Latin capitalization has very low potential for spoofing: 
almost none of the capital letters look like their lower-case equivalences. 
Visually spoofing using similar-looking diacritics seems like a much bigger 
issue for Latin characters, and I believe that Arabic and Indic characters 
have similar problems.

Again, I think we need to put as few restrictions as possible in the 
requirements and let the protocol decide what to restrict or not restrict.

> > >Telling people that in an URI, domain names are case-insensitive,
> > >but file names are/may be case-sensitive is already hard. Telling
> > >them that a name is case-insensitive it if is ASCII only, and case-
> > >sensitive otherwise would be a really hard job.
> >
> > Indeed. Telling them about anything having to do with internationalization
> > will be.
>
>Not if done the right way. We are not trying to teach the Americans
>Chinese or Japanese. Chinese will understand Chinese, and so on.

But we will have to tell everyone (or at least developers) enough to help 
enter internationalized characters that end users don't understand. That 
is, if I see a URL with hiragana in it, I should at least have a chance of 
entering it correctly even if I don't understand Japanese.

>Some special ways of affecting conjunct formation from the
>character codes have to be looked at. But general conjunct
>formation is just a display issue.

Exactly right. And it needs to be dealt with.

> > and Tamil vowel splitting.
>
>Dealt with by Unicode TR #15.

Only if the protocol only uses Unicode. :-)

>  There is one problem, namely
>that Tamil letter LLA (U+0BB3) and Tamil AU length mark (U+0BD7)
>look the same, but this can be solved by disallowing U+0BD7,
>because in the cases where this can really appear, it will
>be removed by applying canonical normalization anyway.

Yes, exactly! Almost all of these problems will be dealt with fully by 
applying canonical normalization of each character set allowed in the 
domain name part. But that is not yet considered a requirement by this group.

--Paul Hoffman, Director
--Internet Mail Consortium