[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [idn] what are the IDN identifiers?



Liana ruminated:

>   We have [STD13] defines that LDH are the DNS identifiers, 
> then what are the IDN identifiers?  UCS is too big and contains 
> many semantically equivalent characters for IDN.  Should we 
> ask for a table of semantically equivalent character sets 
> definition table from Unicode Consortium?

In my opinion, no. This concept of "semantically equivalent
character sets" is way too imprecisely defined to make sense.

What the Unicode Consortium provides is a large number of
precisely defined data tables, giving various properties
of the entire set of characters in the UCS. It is then
up to a group such as this, in the context of their particular
requirements (as for IDN identifiers) to make use of those
property tables to pick and choose among the characters
as appropriate to their application(s). (As has been done
for nameprep.)

> 
> If we are agree on the first RFC in Dan's list,
> I suggest to ask Unicode group to provide a table of
> "Semantically equivalent chatacters of UCS", where
> we can define which characters are used for 
> 1) label separators, ie puncturations and formating marks
> 2) structured data indicators, ie. $/%/& ...
> 3) unstructured data identifiers, ie. alphabet, CJKs, 
>  sound marks...
> "IDN identifiers" should be subset of such a table,
> to determine IDN nomalization protocol in the RFC.

The Unicode Consortium's take on identifiers is already
published in section 5.16 of the Unicode Standard, Version
3.0. Updated summary table information, covering the
repertoire of Unicode 3.1.1 can be found in:

ftp://www.unicode.org/Public/UNIDATA/DerivedCoreProperties.txt

and look for the ID_Start and ID_Continue properties.

By the way, "format[t]ing marks" are typically not label
separators. They are mostly ignorable for identifier
formation -- and can either be omitted from your identifier
repertoire or can be ignored (included silently) if not
omitted from your identifier repertoire. In either case they
do not *delimit* identifiers.

> "Semantically equivalent chatacters of UCS" means
> characters are equivalent to be used as an IDN identifier 
> when they are 
> 1)case insensitive, 
> 2)size or width insensitive,
> 3)font insensitive (include majority of TC/SC)
> 4)language insensitive (include CJK), 
> 5)combination insensitive(regardless NFC or KNFC). 
> 
>   Case, size, font insensitive is easy to understand,
> and have been addressed. 

What you seem to be aiming at here is a collection of various
kinds of character foldings. Character foldings -- even case
foldings -- are a rather murky area. The UTC position on
case folding is summarized in:

ftp://www.unicode.org/Public/UNIDATA/CaseFolding.txt

Regarding other kinds of foldings, the UTC is currently working
on a Unicode Technical Report on the subject. Nameprep involves
a number of foldings -- but these issues are not, in fact all
that easy to understand.

> TC/SC shall be under font 
> category, which is not addressed in Unicode. 

That you characterize TC/SC as a font folding illustrates part
of the problem. It is not a font folding, and cannot be handled
that way, except in the grossest manner.

> But 
> language and combination insensitive are the ones I'd 
> like to explain.
> 
>   Language insensitive: ie. circled numbers, circled
> Han numerals, Dingbats, subset of CJKs.  But other
> subset of CJK will be different semantically for each 
> languages, then we have to have separated tables to 
> work with for each or them.

Even with your examples, it isn't clear what you are talking
about here with the term "language insensitive".

> I think we are designing future IDN, we 
> assume all IDN has to be loaded somehow.   If Japanese
> agree on the semantic equivalence on the symbol to be 
> used in IDN, then we can ask  if  the current <business2> 
> handled by existing JIS local system can stay local without 
>  leaking into new IDN, and let <business2> be in 
> the semantically equivalent set for globle communication. 
>  Unicode group has to make such a choice for IETF.

Why? Language use and country conventions are not areas
that the Unicode Consortium holds expertise nor wishes
to establish standards in. 

> Case study 3):
>   Armenian samll n should be in with Latin n or not 
> is depending on the users' decision, that is we 
> take Unicode group advice on this, since they are the 
> language usage experts to make such a decision. 

Well, we aren't language use experts. But if you want
an expert opinion on *character* identity, ARMENIAN SMALL
LETTER NOW is not and never will be grouped with,
confused with, interchanged with LATIN SMALL LETTER N, any
more than it would be with U+30F3 KATAKANA LETTER N or
for that matter the Han character U+53C3 ni2, used in
Chinese transliteration of Nepal, Nero, Nile, nylon,
nicotine, Nicaragua, Nietzsche, Nice, and Nixon.

> If 
> the Armenian samll n is in with Latin, then we have 
> another case similar with CJK unification case 1).
> If they are not in with Latin, then we have another 
> case of Bengali and the alikes.

?? I presume this is an allusion to your concern that
U+09EA BENGALI DIGIT FOUR is confusable (out of context)
with the appearance of U+0038 DIGIT EIGHT. But Armenian
doesn't look the slightest bit like Latin, so it isn't
clear what you are on about here.

>   
> Combination insensitive: <i><acute on top>,<i><acute>
> <acute on top><i> shall be the same,

Well, these aren't all the same. This is a fundamental
misunderstanding of how combining characters work in
Unicode.

> all in Set
> <i+acute on top>.  This is the base for normalizing 
> from either a table (TC/SC like) or by a procedure 
> (NFC or KNFC like).

The Unicode Normalization Algorithm (which defines NFC
and NFKC  -- not "KNFC") is based on tables, too.

And it is not done by some vague notion of assembling all
the sets that we think ought to be "the same". It is
done by applying the algorithm rigorously to the defining
tables (of decompositions and of composition exclusions).

> 
> So the format is something like:
> <i>:           <I>,<tilt i>,<fat i>,<Greek i>,<Greek I>,...

Wrong from the start. Trying to mix scripts together like
that is completely unextensible.

> <i with acute>:<I with acute>,<i><acute>,<i><acute on top>,<I><acute>...
>  
> For reasonable request, I suggest we limit our scope
> to UCS Plain 0 characters. And we will end up with a 
> nicely display on the Web for us to read and for the 
> public to judge, instead of ieft draft with all the 
> U+E456.. which is meant for forting data and spotting
> checks.

The relevant data tables regarding Unicode normalization are
all already posted and are public to judge.

Beyond that, this effort to get someone (who?) to define all
the "semantically equivalent characters" just seems like
an ill-destined detour from actually getting the work on
IDN accomplished.

--Ken