[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Layer 2 and "idn identities" (was: Re: [idn] what are the IDN identifiers?)



We are discussing how the registras to 
"to avoid registering mixed scripts 'names' ".
Can you suggest any way of doing this, or
any feasible guide lines?

On Wed, 5 Dec 2001 16:52:40 -0800 "Michel Suignard"
<michelsu@microsoft.com> writes:
> This has nothing to do with the CJK TC/SC issue. It should be a
> guideline to registars to avoid registering mixed scripts 'names' 
> when
> they don't make sense. And finding any similiarities between Latin
> characters and Armenian characters is a far fetched example. It 
> doesn't
> take much when reading an Armenian document that it is not Latin, 
> even a
> single word is sufficient. Thinking that Armenian or Hebrew for that
> matter could be confused with Latin characters is simply incorrect.
> It is a global issue of any large character set repertoire that you 
> may
> have some confusion possible between 'look-alike' characters. 
> Nameprep
> is there to simplify somehow the issue, registars guidelines are 
> another
> part .
> 
> And I don't know what you are talking about when you say that you 
> know
> the user's language from input processing, globalized platforms have
> input mechanism that are typically independent from language 
> context.

Then what about your IMEs, are they independent from 
language context?  What is the percentage code points
in UCS under IMEs' processing?

> The only thing that is known at input time is a code point. Upper 
> layers
> may decide to apply heuristic to guess a language but that is 
> clearly
> beyond the scope of input mechanism in modern input processing
> mechanism. And there is not such a thing as the selection of a 
> 'language symbol set'.
> 
> The only case where there is a reasonable determination of 
> 'language'
> for input is East Asian Input Method Editors (IMEs), and it could be
> reasonable to assume that an application layer could offer some 
> TC/SC
> services before feeding the code points to a DNS service in that 
> case
> for CJK characters (but even that is not a simple case to solve 
> because
> of the contextual ambiguity as mentioned several times in this 
> forum). 
> 
> Frankly you are not helping by creating these unrelated 'analogies' 
> with
> other languages and scripts that have nothing to do with the issue 
> at
> hand.
> 
> Michel
> 
> (PS I should add that I am also a Unicode technical director and the
> project editor for ISO 10646)

Now, I got it.  So,  I particularly think that because you are 
not an DNS expert, and you are not using IMEs to select 
codepoins to express yourself much either, that there are 
other dimensions you should look into, before you make
accusations like 

>And finding any similiarities between Latin
> characters and Armenian characters is a far fetched example.

Liana 

 
> -----Original Message-----
> From: liana Ye [mailto:liana.ydisg@juno.com] 
> Sent: Wednesday, December 05, 2001 11:29 AM
> To: Michel Suignard
> Cc: idn@ops.ietf.org; maynard@pobox.org.sg; bthomson@fm-net.ne.jp;
> DougEwell2@cs.com
> Subject: Re: Layer 2 and "idn identities" (was: Re: [idn] what are 
> the
> IDN identifiers?)
> 
> 
> It doesn't matter from input processing, 
> because you know the users' language. It is 
> defined by the user when they select the 
> language symbol set.  
> 
> But when you turn those symbols into codepoints
> and strip the language context, then comparing 
> them in languageless context of IDN, the problem is
> arised.  That is the reason I have to put Latin
> together with Armenian to make the CJK problem 
> a little easier for people do not know CJK.  There are 
> 6 lower case letters similar with Latin [a-z].
> 
> Liana
> 
> On Wed, 5 Dec 2001 12:09:00 -0800 "Michel Suignard"
> <michelsu@microsoft.com> writes:
> > Liana, stop associating Armenian with Latin in your explanation. 
> All 
> > writing systems based on Latin and the single writing system using 
> the
> > Armenian share nothing except maybe some punctuation. It doesn't 
> > make
> > sense to make a parallel between
> > (Latin + Armenian + Cyrillic + Hebrew) and (CJK), because in the 
> > first
> > group no writing system share characters between the subsets 
> > (although
> > I would even object at creating such a 'logical' collection of 
> > largely
> > unrelated scripts), while in CJK all writing systems use 
> characters 
> > in
> > the same CJK blocks.
> > 
> > I doesn't help to explain the issue of TC/SC which is a valid
> > concern
> > for CJK users by using flawed analogy to a non existing model.
> > 
> > I myself see a need to help Chinese users deal with TC/SC, but I
> > don't
> > see it to belong in the scope currently covered by IDN.
> > 
> > Michel
> > 
> > -----Original Message-----
> > From: liana Ye [mailto:liana.ydisg@juno.com]
> > Sent: Wednesday, December 05, 2001 8:14 AM
> > To: DougEwell2@cs.com
> > Cc: idn@ops.ietf.org; maynard@pobox.org.sg; bthomson@fm-net.ne.jp
> > Subject: Re: Layer 2 and "idn identities" (was: Re: [idn] what are 
> 
> > the
> > IDN identifiers?)
> > 
> > 
> > 
> > On Wed, 5 Dec 2001 01:17:36 EST DougEwell2@cs.com writes:
> > > In a message dated 2001-12-04 20:11:19 Pacific Standard Time, 
> > > maynard@pobox.org.sg writes:
> > > 
> > > >> SC/TC equivalence itself is far simpler than the "four winds,
> > two
> > > eggs"
> > > >> equivalences, and has quite a bit of merit. I won't express 
> any
> > 
> > > >> real opinion on it until I study it further.
> > > >
> > > > It is not so simple as to be able to be done _accurately_ by 
> an
> > > code-based1-1
> > > > bit-string matching process. There are semantic, syntactic and
> > > contextual
> > > > considerations that require at the very least a morphological
> > > analysis
> > > process
> > > > in order for TC/SC to be done with a reasonable amount of
> > accuracy
> > > (i.e.
> > > > orthographically).
> > > 
> > > Thanks for saying with some authority what I have apparently 
> been 
> > > unable to communicate effectively, namely that TC/SC is not 
> merely a
> 
> > > 1-1 operation
> > > comparable to Latin case folding.
> > > 
> > > -Doug Ewell
> > >  Fullerton, California
> > > 
> > 
> > Excuse me for jump in, I have been keep silent on this
> > view,  and I'd like to comment on this issue now. 
> > 
> > TC/SC is not merely a 1-1 operation, if you only compare it
> > with Latin case folding in what the names imply:
> > 
> > TC/SC is a subset of Han, and Han is subset of C,J,K.
> > Latin is a super set of English, French,....
> > 
> > Can you see the flaw on such a comparison?
> > 
> > So when you look at Latin in the context of UCS code points,
> > since UCS is the set we are hoping to use blanketly in IDN, then 
> > Latin
> > is a subset of (Latin + Armenian + Cyrillic + Hebrew) since I 
> think 
> > this
> > is the area that Latin is mostly likely be 
> > used too.  
> > 
> > So this means if you compare TC/SC set of 1-1 cases
> > then the  Latin is 1-1.
> > 
> > If you compare TC/SC with 1-n, n-1, 1-1, that is in Chinese,
> > then Latin should be put into UCS Plane 0, 1, 2 too. 
> > So this Latin is n-1, 1-n too.
> > 
> > If you compare TC/SC in the sense of C,J,K block,
> > then Latin + Armenian is the minimum case to think about.
> > 
> > Cheers.
> > 
> > Liana
> >