[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
[idn] new I-D: Safely Encoding of likeness information into ACE label version 0.2
Hi,
I post this new I-D to this mailing list to be reviewed before
14/AUG. This I-D is somewhat half-baked, but helps to solve
the problems of look-alike characters with some pains.
I will repost the revised one next week.
Welcome any criticisms and supports for further discussions.
Regards,
Soobok Lee, lsb@postel.co.kr
=========================================================================
Internet Draft Soobok Lee
draft-lsb-lookalike-00.txt Postel Services, Inc
28 Jul, 2001
Expires in six months
Safely Encoding of likeness information into ACE label
version 0.2
Status of this Memo
This document is an Internet-Draft and is in full conformance with
all provisions of Section 10 of RFC2026.
Distribution of this document is unlimited. Please send comments to
the author lsb@postel.co.kr
Abstract
For a unicode character which has one or more other look-alike characters,
we define both its look-alike normalization form and a likeness index and
suggest new ACE prefixing rule for likeness indices which are encoded using
a sequence of mixed-case latin alphabets. The likeness index is used to
restore the pre-normalization form of the ACE-decoded label and in the same
time does not affect case-insensitive label comparison
in existing applications and DNS servers.
Contents
Overview
ACE comparison
Maximum length of ACE label
Implication of restoring pre-normaliation forms
More works to be done
Security considerations
References
Author
Overview
The Unicode Standard has many sets of look-alike characters that are not
documented in detail yet. But introducing the unicode characters set
into IDN, the primary internet identifier, require some rigorous works
to be done in this area near future for security reasons.
ACE algorithms such as [DUDE],[AMC-ACE-Z], preserve the case information
in the original IDN label by augmenting base32 latin digits to their
uppercase ones, which do not affect case-insensitive label comparison
operations in applications and DNS servers, while preserving the case
information.
This draft is based on such case-preserving and case-insensitive nature
of IDNA architecture and extends them to fullful the need to map
look-alike letters into a unified one while retaining the information
about the pre-unification letter for later rendering of ACE-decoded
labels for end users.
Let's assume we have a unicode look-alike normalization NFLA in the future
version of the unicode standard.
For certain unicode points a1,a2,a3 that satisfy
NFLA(a1) ==NFLA(a2) == NFLA(a3) == a1.
let's define a function LA_SET(uc) so that:
LA_SET(a1)={a1,a2,a3}
In this case
the size of LA_SET(a1) is 3, is called 'current minimal likeness size',
We shoud define 'maximal likeness size' for each unicode point.
For that, we define a function LikeSize(uc) so that
LikeSize(a1) = 4, LikeSize(a2)= 4, LIkeSize(a3)=4.
It is wise to choose 4 instead of 3 to make room for future additions
to UNICODE repertoire of scripts that may contain new look-alike characters.
We need to look carefully into proposed or approved new scripts to be added
unicode standards near future.
We can express a1,a2,a3 as these binary tuples
a1=(a1,0)
a2=(a1,1)
a3=(a1,2)
The each second index value of these tuples is called 'likeness index',
for which we define a function so that
LikeIndex(a1) = 0,
LikeIndex(a2) = 1,
LikeIndex(a3) = 2.
Let's define a function
LA_TUPLE(uc) = ( NFLA(uc), LikeIndex(uc), LikeSize(LA_SET(uc)) ).
Let's define a restoring function LA_CHAR(uc,i) so that:
LA_CHAR(a1,0)=a1,
LA_CHAR(a1,1)=a2,
LA_CHAR(a1,2)=a3.
The main idea of this draft is to incorporate the ascii encoded likeness
index and likeness size of a ambiguous character of a IDN label
using a sequence of uppercase and lowercase latin alpahbets inserted after
ACE prefix.
For example, for an cyrillic IDN label <cyrillic a><cyrillic zhe><cyrillic o>:
we have a ACE label without looka-like normalization of
dq--{<cyrillic a>}{cyrillic zhe}{<cyrillic o>}
Only cyrillic 'a' and 'o' has latin look-alikes, so that we can make
new ACE label WITH looka-like normalization
dq--AaCcC--{<latin a>}{cyrillic zhe}{<latin o>}
The uppercase letters in "aAcCc" denote bit '1' and
lowercase letters denote bit '0' and these sequence of uppercase and lower
case letters of a alphabet form a bitstring to denote the likeness index.
The alpabet (a-z) denotes the offset index of the corresponding unicode
character in the IDN label character sequence.
In the example above, "aA" denote a likeness index value 1 and a likeness
size 4. "cCc" denote a likeness index value 2 and a likeness size 8.
"aA" supplements to <latin a> to form <cyrillic a> and
"cCc" supplements to <latin o> to form <cyrillic o>.
<cyrillic zhe> has no case information since it is assumed to have no
lookalike..
2^(The number of repeated alphabets) is identical to the likeness size and
shall be large enough not to be changed in future version of unicode
look-alike normalization.
If the offset index of a long IDN label needs to exceed 26, we can insert
un-used digit '9' to mark a milestone from which the offset is added by 26
and the alphbet should begin with 'a' again.
If applications and dns server would not casefold ACE label, this
look-alike information would be retained throught the process to
restore the pre-normalization native-script.
The restoring process is as follows:
First,
we ACE-decode the label part of our new ACE label and get
a sequence of unicode points LABEL.
Second,
from likeness indices part of our new ACE label, we construct
a array of binary tuples of likeness index and offset index of
target character into the decoded label LABEL.
Third,
for each binary tuple (i,offset), we replace each target
code point in the decoded label LABEL with the restored original
character LA_CHAR(LABEL[offset],i).
ACE comparison
We have three IDN labels like these:
IDN1 = <cyrillic a><cyrillic zhe><cyrillic o>
IDN2 = <latin a><cyrillic zhe><cyrillic o>
IDN3 = <latin a><cyrillic zhe><latin o>
ACE(IDN1) = dq--aAcCc--{<latin a>}{cyrillic zhe}{<latin o>}
ACE(IDN2) = dq--aacCc--{<latin a>}{cyrillic zhe}{<latin o>}
ACE(IDN3) = dq--aaccc--{<latin a>}{cyrillic zhe}{<latin o>}
IDN1,IDN2,IDN3 are equivalent modulo look-alike normalization,
but these three ACE labels have differently-cased likeness indices,
but are regarded as the same domain in case-insensitive comparison
in applications and dns servers.
If a IDN has no ambiguous characters, we can omit '--' in some ACEs.
IDN4 = <cyrillic zhe><cyrillic zhe>
ACE(IDN4) = dq--{<cyrillic zhe>}{cyrillic zhe}
And if an IDN is look-alike normalized into all-latin LDH domain,
it should not be registered as a IDN but as an LDH domain, and in
this case, we cannot provide likeness preservation.
Maximum length of ACE label
This new scheme comes with some overheads in ACE label :
additional "--" and encoded likeness informations.
If we assume the mean average of likeness size to be 2 (1 bit):
overhead = 2 + 1 * (number of ambiguous characters in a label)
Since most Han/Hangeul letters have no other look-alike characters,
Overall ACE label efficiency for han/hangeul would not be affected.
Latin,Greek,Cyrillic,Katakana and many Indian scripts have
many look-alike characters.
Efficient ACE-encoding of IDN label is required for this scheme.
Implication of restoring pre-normalization forms
This new ACE prefixing scheme retains the look-alike information,
so that we can restore the original native-script labels before
look-alike normalization even when they contain look-alike characters
across scripts.
For example, Katakana 'ka' (U+30AB) and Chinese letter 'power' (U+529B).
look the same. We can assign likenessindex 0 and 1 to 'ka' and 'power',
respectively.
If we normalized 'ka' into 'power' without encoded case information,
we could not restore 'ka' anymore. We could not avoid font-rendering
problems and conflicts of interests between related countries.
But, with this encoded likeness information, we can restore 'ka'
and we have no such problem.
More works to be done
Some sequences of characters look similiar to a character or
other sequences of characters.
Most of these sequences are normalized and unified in KC
Normalization in NAMEPREP. But still some visual similarities
are not completely eliminated.
We need more elaborations on this subject.
Security Consideration
These scheme suggests ACE labels to be prefixed by additional
look-alike information encoded in sequences of cased alphabets
and does not introduce any security hole into IDN.
References
[UNICODE] The Unicode Consortium, "The Unicode Standard",
http://www.unicode.org/unicode/standard/standard.html.
[IDNA] Patrik Faltstrom, Paul Hoffman, "Internationalizing Host
Names In Applications (IDNA)", draft-ietf-idn-idna-03.
[NAMEPREP03] Paul Hoffman, Marc Blanchet, "Preparation
of Internationalized Host Names", July 19, 2001
draft-ietf-idn-nameprep-05.
[DUDE02] Mark Welter, Brian Spolarich, Adam Costello,
"DUDE: Differential Unicode Domain Encoding", 2001-May-31,
draft-ietf-idn-dude-02.
[AMCACEZ] Adam Costello, "AMC-ACE-z version 0.2.1",
2001-May-31, draft-ietf-idn-amc-ace-z-00, latest version at
http://www.cs.berkeley.edu/~amc/charset/amc-ace-z.gz
Author
Soobok Lee <lsb@postel.co.kr>
Postel Services, Inc.
http://www.postel.co.kr
Tel: +82-11-9774-2737