[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[idn] what are the IDN identifiers?



Hi, James,

  We have [STD13] defines that LDH are the DNS identifiers, 
then what are the IDN identifiers?  UCS is too big and contains 
many semantically equivalent characters for IDN.  Should we 
ask for a table of semantically equivalent character sets 
definition table from Unicode Consortium?

If we are agree on the first RFC in Dan's list,
I suggest to ask Unicode group to provide a table of
"Semantically equivalent chatacters of UCS", where
we can define which characters are used for 
1) label separators, ie puncturations and formating marks
2) structured data indicators, ie. $/%/& ...
3) unstructured data identifiers, ie. alphabet, CJKs, 
 sound marks...
"IDN identifiers" should be subset of such a table,
to determine IDN nomalization protocol in the RFC.

"Semantically equivalent chatacters of UCS" means
characters are equivalent to be used as an IDN identifier 
when they are 
1)case insensitive, 
2)size or width insensitive,
3)font insensitive (include majority of TC/SC)
4)language insensitive (include CJK), 
5)combination insensitive(regardless NFC or KNFC). 

  Case, size, font insensitive is easy to understand,
and have been addressed.  TC/SC shall be under font 
category, which is not addressed in Unicode.  But 
language and combination insensitive are the ones I'd 
like to explain.

  Language insensitive: ie. circled numbers, circled
Han numerals, Dingbats, subset of CJKs.  But other
subset of CJK will be different semantically for each 
languages, then we have to have separated tables to 
work with for each or them. 

Case study 1): 
Kanji <business> has three forms, 
  <business1> <business1'> and <business2>, 
which are the same with Chinese 
 <business1> <business1'> and <business2>. 

 Chinese use <business2> as IDN id, for all three.
Japanese agrees on put <business1><business1'> in, and 
want to have <business2> as a different semantic 
set, since they are different semanticly in their
accounting data base.  

The issue is which class Kanji<bussiness2> should
be.  The current [TSconv] takes it out of the table, so it is
undecided.   I think we are designing future IDN, we 
assume all IDN has to be loaded somehow.   If Japanese
agree on the semantic equivalence on the symbol to be 
used in IDN, then we can ask  if  the current <business2> 
handled by existing JIS local system can stay local without 
 leaking into new IDN, and let <business2> be in 
the semantically equivalent set for globle communication. 
 Unicode group has to make such a choice for IETF.

Case 2) :
If there is <business3> in Kanji, but not
in Chinese, then <business3> is a set by itself. 

Case study 3):
  Armenian samll n should be in with Latin n or not 
is depending on the users' decision, that is we 
take Unicode group advice on this, since they are the 
language usage experts to make such a decision.  If 
the Armenian samll n is in with Latin, then we have 
another case similar with CJK unification case 1).
If they are not in with Latin, then we have another 
case of Bengali and the alikes.
  
Combination insensitive: <i><acute on top>,<i><acute>
<acute on top><i> shall be the same, all in Set
<i+acute on top>.  This is the base for normalizing 
from either a table (TC/SC like) or by a procedure 
(NFC or KNFC like).

So the format is something like:
<i>:           <I>,<tilt i>,<fat i>,<Greek i>,<Greek I>,...
<i with acute>:<I with acute>,<i><acute>,<i><acute on top>,<I><acute>...
 
For reasonable request, I suggest we limit our scope
to UCS Plain 0 characters. And we will end up with a 
nicely display on the Web for us to read and for the 
public to judge, instead of ieft draft with all the 
U+E456.. which is meant for forting data and spotting
checks.

Regards,

Liana