[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Combining characters (was: Re: [idn] hostname historyhell)



Hi, Doug:

Due to large character set like UCS, mixed use of scripts
and look-alikes cases are so prominent, some sort
of classification is unavoidable.

Should it be

> Spanish or Italian?  Should we care?

No, we don't care about these cases, since they 
are Latin users.  Latin has taken care of many different 
spoken languages already, so do Cyrillic, Arabic and
Chinese. 

On Mon, 26 Nov 2001 11:40:52 EST DougEwell2@cs.com writes:
> In a message dated 2001-11-26 0:31:52 Pacific Standard Time, 
> liana.ydisg@juno.com writes:
> 
> > Have you thought about " Mixed language URLs "
> > with language tags, for example:
> >
> > www.zh-china/mo-mogolia/zh-county/mybusiness.com
> >
> > shall be able to work?
> 
> I thought one of the fundamental characteristics of domain names, 
> host names, 
> URLs, etc. is that they were identifiers, not true names, and hence 
> they were 
> not intended to be language-tagged.
> 
> Just as an example, two popular search engines are teoma.com and 
> altavista.com.  What language is "Teoma"?  Is "Alta Vista" supposed 
> to be 
> Spanish or Italian?  Should we care?
> 
> -Doug Ewell
>  Fullerton, California

However,  Mixed script is a lot more complex.  For 
example, Japanese uses Kanji, but it is not only phoneticaly  
different from Chinese, its grammar is completely different
from Chinese.  The difference is so great that we have to 
reflect them separeatly in structured data too.  My Chinese
address label in previous message is an example of 
such a difference.  My Mogolia and Chinese example is 
another example of mixed used of structured labels, though
the Chinese group has not raise this in front of this group. 
They have too much on their hands already :-(

The classification, language tags or script tags,  must be 
used sometime in URL to deal with these issues.  I have
used "language tag" instead of "script tag", since
1) Different languages uses the same script, such as 
  CJK.
2) Language tag has been defined in [iso639], and 
  some of the issues have been solved already. For 
 example:  Does Cantonese have a language tag or not?
3) From engineering point of view, IETF has a list of language
 reqiurements to consider.  That is, can we come up a 
solution to cover these cases in DNS?

If in the future, down the line in IDN, someone challenge
us regarding diacrtic marks between French and Dutch, 
for example, then we have to be able to say this case is  
covered with a Latin tag.  If someone wants more localized
features with a French tag, then we may question if such a 
feature can be accomodated with existing methods or 
not before we lunch into another tag. 

I would suggest the following language tags to be 
considered first:

CJK
Latin
Cyrillic
Arabic
Bengali
Greek

Although Greek has smaller group of native users, it is
familar to many Latin users, and can serve as a study
case in discussion. 

Liana