[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [idn] what are the IDN identifiers?
On Wed, 28 Nov 2001 16:00:29 -0800 (PST) Kenneth Whistler
<kenw@sybase.com> writes:
> Liana ruminated:
>
> > We have [STD13] defines that LDH are the DNS identifiers,
> > then what are the IDN identifiers? UCS is too big and contains
> > many semantically equivalent characters for IDN. Should we
> > ask for a table of semantically equivalent character sets
> > definition table from Unicode Consortium?
>
> In my opinion, no. This concept of "semantically equivalent
> character sets" is way too imprecisely defined to make sense.
>
> What the Unicode Consortium provides is a large number of
> precisely defined data tables, giving various properties
> of the entire set of characters in the UCS. It is then
> up to a group such as this, in the context of their particular
> requirements (as for IDN identifiers) to make use of those
> property tables to pick and choose among the characters
> as appropriate to their application(s). (As has been done
> for nameprep.)
>
So that now, I think we need something less precise based
on their precise definitions, to group a larger equivalent set for
IDN comparison purpose.
> >
> > If we are agree on the first RFC in Dan's list,
> > I suggest to ask Unicode group to provide a table of
> > "Semantically equivalent chatacters of UCS", where
> > we can define which characters are used for
> > 1) label separators, ie puncturations and formating marks
> > 2) structured data indicators, ie. $/%/& ...
> > 3) unstructured data identifiers, ie. alphabet, CJKs,
> > sound marks...
> > "IDN identifiers" should be subset of such a table,
> > to determine IDN nomalization protocol in the RFC.
>
> The Unicode Consortium's take on identifiers is already
> published in section 5.16 of the Unicode Standard, Version
> 3.0. Updated summary table information, covering the
> repertoire of Unicode 3.1.1 can be found in:
>
> ftp://www.unicode.org/Public/UNIDATA/DerivedCoreProperties.txt
>
> and look for the ID_Start and ID_Continue properties.
>
Thanks for the extracted pointer. And
these sections shows the problem:
3400..4DB5 ; ID_Start # Lo [6582] CJK UNIFIED IDEOGRAPH-3400..CJK
UNIFIED IDEOGRAPH-4DB5
4E00..9FA5 ; ID_Start # Lo [20902] CJK UNIFIED IDEOGRAPH-4E00..CJK
UNIFIED IDEOGRAPH-9FA5
> By the way, "format[t]ing marks" are typically not label
> separators. They are mostly ignorable for identifier
> formation -- and can either be omitted from your identifier
> repertoire or can be ignored (included silently) if not
> omitted from your identifier repertoire. In either case they
> do not *delimit* identifiers.
>
> > "Semantically equivalent chatacters of UCS" means
> > characters are equivalent to be used as an IDN identifier
> > when they are
> > 1)case insensitive,
> > 2)size or width insensitive,
> > 3)font insensitive (include majority of TC/SC)
> > 4)language insensitive (include CJK),
> > 5)combination insensitive(regardless NFC or KNFC).
> >
> > Case, size, font insensitive is easy to understand,
> > and have been addressed.
>
> What you seem to be aiming at here is a collection of various
> kinds of character foldings. Character foldings -- even case
> foldings -- are a rather murky area. The UTC position on
> case folding is summarized in:
>
> ftp://www.unicode.org/Public/UNIDATA/CaseFolding.txt
>
> Regarding other kinds of foldings, the UTC is currently working
> on a Unicode Technical Report on the subject. Nameprep involves
> a number of foldings -- but these issues are not, in fact all
> that easy to understand.
>
We have been discussed the narrownese on this document.
But it is great that the problem is been working on, and
my suggestion is too late to be any useful here. Can you
provide some inside to the work?
> > TC/SC shall be under font
> > category, which is not addressed in Unicode.
>
> That you characterize TC/SC as a font folding illustrates part
> of the problem. It is not a font folding, and cannot be handled
> that way, except in the grossest manner.
>
You can complain about how grossest the term I use, but
I am talking about code points to be used in IDN as equivalent
to another code point when we are doing IDN matching.
For example TC/SC, in addition to Kanji TC, Kanji SC, Hanja
TC. I am NOT talking about folding, sorry.
> > But
> > language and combination insensitive are the ones I'd
> > like to explain.
> >
> > Language insensitive: ie. circled numbers, circled
> > Han numerals, Dingbats, subset of CJKs. But other
> > subset of CJK will be different semantically for each
> > languages, then we have to have separated tables to
> > work with for each or them.
>
> Even with your examples, it isn't clear what you are talking
> about here with the term "language insensitive".
>
> > I think we are designing future IDN, we
> > assume all IDN has to be loaded somehow. If Japanese
> > agree on the semantic equivalence on the symbol to be
> > used in IDN, then we can ask if the current <business2>
> > handled by existing JIS local system can stay local without
> > leaking into new IDN, and let <business2> be in
> > the semantically equivalent set for globle communication.
> > Unicode group has to make such a choice for IETF.
>
> Why? Language use and country conventions are not areas
> that the Unicode Consortium holds expertise nor wishes
> to establish standards in.
>
Are you push the codepoint issue back to JET? And aren't we
only dealing with codepoints comes from UTC, remember?
> > Case study 3):
> > Armenian samll n should be in with Latin n or not
> > is depending on the users' decision, that is we
> > take Unicode group advice on this, since they are the
> > language usage experts to make such a decision.
>
> Well, we aren't language use experts. But if you want
> an expert opinion on *character* identity, ARMENIAN SMALL
> LETTER NOW is not and never will be grouped with,
> confused with, interchanged with LATIN SMALL LETTER N, any
> more than it would be with U+30F3 KATAKANA LETTER N or
> for that matter the Han character U+53C3 ni2, used in
> Chinese transliteration of Nepal, Nero, Nile, nylon,
> nicotine, Nicaragua, Nietzsche, Nice, and Nixon.
>
So, according to your assertion ( I'd like UTC's confirmation
on this of course, can you give me a pointer?) this is not
a question any more. Then how do you propose to deal
with an n from Armenian than the n from Latin in a label?
> > If
> > the Armenian samll n is in with Latin, then we have
> > another case similar with CJK unification case 1).
> > If they are not in with Latin, then we have another
> > case of Bengali and the alikes.
>
> ?? I presume this is an allusion to your concern that
> U+09EA BENGALI DIGIT FOUR is confusable (out of context)
> with the appearance of U+0038 DIGIT EIGHT. But Armenian
> doesn't look the slightest bit like Latin, so it isn't
> clear what you are on about here.
>
It seems you have a way to handle context. Then please
describe how is your context is saved from input and passed
onto IDN and goes to DNS, and come out at another end
correctly.
> >
> > Combination insensitive: <i><acute on top>,<i><acute>
> > <acute on top><i> shall be the same,
>
> Well, these aren't all the same. This is a fundamental
> misunderstanding of how combining characters work in
> Unicode.
Okay, give me another pointer so that I can learn.
>
> > all in Set
> > <i+acute on top>. This is the base for normalizing
> > from either a table (TC/SC like) or by a procedure
> > (NFC or KNFC like).
>
> The Unicode Normalization Algorithm (which defines NFC
> and NFKC -- not "KNFC") is based on tables, too.
>
> And it is not done by some vague notion of assembling all
> the sets that we think ought to be "the same". It is
> done by applying the algorithm rigorously to the defining
> tables (of decompositions and of composition exclusions).
>
What is the purpose of these decompositions and
compositions? For input matching? for DNS matching?
for IDN folding? I am looking for IDN matching. Are we
talking about the same thing?
> >
> > So the format is something like:
> > <i>: <I>,<tilt i>,<fat i>,<Greek i>,<Greek I>,...
>
> Wrong from the start. Trying to mix scripts together like
> that is completely unextensible.
>
You are right, this is an example of possible not look-alike
cases from different scripts but for the same purpose to be
used in IDN and to be treated equivalent, for example, Chinese
period and an ASCII period.
> > <i with acute>:<I with acute>,<i><acute>,<i><acute on
> top>,<I><acute>...
> >
> > For reasonable request, I suggest we limit our scope
> > to UCS Plain 0 characters. And we will end up with a
> > nicely display on the Web for us to read and for the
> > public to judge, instead of ieft draft with all the
> > U+E456.. which is meant for forting data and spotting
> > checks.
>
> The relevant data tables regarding Unicode normalization are
> all already posted and are public to judge.
>
> Beyond that, this effort to get someone (who?) to define all
> the "semantically equivalent characters" just seems like
> an ill-destined detour from actually getting the work on
> IDN accomplished.
>
> --Ken
>
I wander, what the work on IDN are you referring to?
Liana