[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[idn] rederivation of an IDN architecture



idn-arch-deriv 0.0.0 (2001-Aug-09-Thu)
Adam M. Costello <amc@cs.berkeley.edu>

This is a derivation of an Internationalized Domain Name architecture
that is extremely similar (if not perfectly identical) to IDNA.  My
purpose in composing this derivation is to expose dependencies between
requirements, goals, and assumptions, in order to facilitate fruitful
discussions.  For example, if a certain requirement is necessary to
achieve a certain goal, it is pointless to attack the requirement when
arguing with someone who believes in the goal; ones needs to argue
against the goal itself, and perhaps against the assumptions underlying
the goal.

The bulk of this derivation is definitions.  Readers may wish to skip
the definitions the first time through, and refer back to them when
necessary.

Since there is a preexisting standard for domain name syntax, it is
useful to have some definitions for talking about that syntax:

    basic character:  Any of the 128 characters in the ASCII repertoire
    (not to be confused with the integers or octets mapped to those
    characters in the ASCII coded character set).

    basic string:  A sequence of basic characters.

    tolower():  A function from a basic string to a basic string.  It
    converts A-Z to a-z respectively, and leaves the other characters
    unchanged.  [Observation: tolower() is idempotent, that is, for any
    basic string B, tolower(B) = tolower(tolower(B)).]

    equivalent basic strings:  Basic strings for which tolower()
    gives identical results.  [Observation:  Because tolower() is
    idempotent, a basic string and its tolower() image are equivalent
    basic strings.]

    basic label:  A basic string containing from 1 to 63 characters.
    [This is the preexisting standard for textual domain labels.]

    host restrictions (applicable to any sequence of characters):
    Contains no basic characters other than the letters, digits, and
    hyphen-minus; and neither begins nor ends with hyphen-minus.

    basic host label:  A basic label that satisfies the host
    restrictions.  [This is the preexisting standard for host labels.]

    equivalent basic labels:  Basic labels that are equivalent basic
    strings.  [This is the preexisting standard for equivalence of
    textual domain labels.]

    basic domain name:  A sequence of basic labels (typically separated
    by full-stop characters).  [This is the preexisting standard for
    textual domain names.]

    basic host name:  A basic domain name for which every label is a
    basic host label.  [This is the preexisting standard for internet
    host names.]

    equivalent basic domain names:  Basic domain names that have the
    same number of labels and for which corresponding labels are
    equivalent basic labels.  [This is the preexisting standard for
    equivalence of textual domain names.]

    domain name slot:  A field or sub-field in a protocol message, or a
    function argument in an interface, etc, explicitly designated for
    carrying a domain name.

Fact 1:  Many existing protocols, interfaces, etc. have domain name
slots and require the names in those slots to conform to the syntax and
equivalence rules for basic host names (or at least basic domain names).

Fact 2:  Even when existing protocols, interfaces, etc. do not
require domain names to conform to the preexisting syntax, different
implementations of the same protocol/interface will sometimes behave
differently if presented with domain names that do not conform, because
there has been no preexisting standard for how such domain names should
be handled.  (For example, some implementations will reject the names,
some will alter the names, and some will pass them through unchanged.
There is similar uncertainty when names are compared.)

Fact 3:  Many people would like to use domain names containing
additional characters not allowed by the preexisting syntax.

Conjecture 1:  Many people will be reluctant to create domains using
additional characters if those domains would be inaccessible to
preexisting protocols, interfaces, software, etc.

Goal 1 (motivated by facts 1-3 and conjecture 1):  Allow domain names to
contain additional characters, while still allowing all domains to be
referred to by all preexisting protocols, interfaces, software, etc.

Requirement 1 (follows from goal 1):  Part of the preexisting namespace
must be used to represent names containing additional characters,
which requires altering the semantics of that part of the preexisting
namespace.

Goal 2:  The part of the preexisting namespace that is repurposed should
be a part that is not currently in use, or at least is used to an
extremely small extent, so that disruption is minimized.

Goal 3:  The part of the preexisting namespace that is repurposed
should be easily identifiable and easily avoidable (both by humans and
machines).

Fact 4:  Many protocols and standards split domain names into labels,
re-join them, count the number of labels, and test whether two domain
names agree on individual labels.

Goal 4 (motivated by fact 4):  The way that preexisting names are used
to represent names containing additional characters should not disrupt
operations that treat labels individually.

Requirement 2 (follows from goals 2-4):  The representation of
previously disallowed names by previously allowed names must be done
independently per label, and labels using this representation must have
a well-known prefix or suffix that was almost never in use.

Fact 5:  Internet protocols usually support US-ASCII, and future
internet protocols are encouraged to support UTF-8.

Fact 6:  UTF-8 is an extension of US-ASCII, that is, characters in the
ASCII repertoire are represented by the same bytes in UTF-8 as in ASCII.

Goal 5 (motivated by facts 5 and 6):  Given a protocol that uses ASCII
to encode domain names, it should be easy to design an updated version
of the protocol that uses UTF-8 to encode domain names.

Requirement 3 (follows from goal 5):  The preexisting restrictions on
what ASCII characters are allowed in host names, and in what positions,
must be retained in the expanded namespace.

Requirements 1, 2, and 3 motivate the following definitions for the
expanded namespace:

    international character:  A character in the Unicode repertoire (not
    to be confused with the integers or byte sequences used to represent
    those characters in the various encodings defined by the Unicode
    standard).

    international string:  A sequence of international characters.
    [Observation:  Every basic string is an international string.]

    canon():  A function (defined elsewhere) from an international
    string to an international string.  It is an extension of tolower(),
    that is, for any basic string B, canon(B) = tolower(B).  Like
    tolower(), canon() is defined everywhere and is idempotent (canon(S)
    = canon(canon(S)) for any international string S), but unlike
    tolower(), canon(S) and S might have different lengths.

    equivalent international strings:  International strings for which
    canon() gives identical results.  [This is consistent with the
    definition of equivalence of basic strings, because canon() behaves
    like tolower() for basic strings.  Observation:  Because canon()
    is idempotent, an international string and its canon() image are
    equivalent international strings.]

    degenerate international string:  An international string that
    is equivalent to a basic string.  [The definitions of equivalent
    international strings and canon() imply that an international string
    S is degenerate iff canon(S) is a basic string.]

    ACE prefix:  A particular sequence of basic characters (defined
    elsewhere).  [A suffix would work as well as a prefix.  The phrase
    "begins with the ACE prefix" should be understood to mean "ends with
    the ACE suffix" if a suffix is used instead.]

    compat():  A function (defined elsewhere) from an international
    string to a basic label.  For some international strings S,
    compat(S) does not exist, either for policy reasons, or because
    of the length restriction on basic labels.  The compat() function
    satisfies the following properties:

        Stability:  For any basic string B, compat(B) = B.

        Preservation of equivalence:  For any equivalent international
        strings S1 and S2, if compat(S1) exists, then compat(S2) exists,
        and compat(S1) and compat(S2) are equivalent basic labels.

        Preservation of non-equivalence:  For any nondegenerate
        international strings S1 and S2 for which compat(S1) and
        compat(S2) exist, if S1 and S2 are not equivalent international
        strings, then compat(S1) and compat(S2) are not equivalent basic
        labels.

        Preservation of host restrictions:  For any international string
        S, if S satisfies the host restrictions and compat(S) exists,
        then compat(S) satisfies the host restrictions.

        Tagging:  For any nondegenerate international string S, if
        compat(S) exists then it begins with the ACE prefix.

        Non-recursion of tagging:  For any nondegenerate international
        string S that begins with the ACE prefix, compat(S) does not
        exist.

    [Observation:  The preservation of equivalence and non-equivalence
    implies that the definition of compat() will include the definition
    of canon().]

    international label:  An international string S for which compat(S)
    exists.  [Observation:  Every basic label is an international
    label.]

    equivalent international labels:  International labels for which
    compat() yields equivalent basic labels.  [This is consistent with
    the definition of equivalent basic labels, because compat(B) = B for
    any basic label B.  Observation:  For any international labels L1
    and L2 for which canon(L1) and canon(L2) do not begin with the ACE
    prefix, L1 and L2 are equivalent international labels iff they are
    equivalent international strings.]

    international domain name:  A sequence of international labels
    (typically separated by full-stop characters).  [Observation:  Every
    basic domain name is an international domain name.]

    equivalent international domain names:  International domain names
    that have the same number of labels and for which corresponding
    labels are equivalent international labels.  [This is consistent
    with the definition of equivalent basic domain names, because
    equivalence of international labels is consistent with equivalence
    of basic labels.  Observation:  For any international domain name
    there exists a basic domain name equivalent to it, which can be
    obtained by applying compat() to each label.]

    decompat():  A function from basic labels to international labels.
    For any basic label B, if there exists a nondegenerate international
    string S such that compat(S) and B are equivalent basic labels, then
    S and decompat(B) are equivalent international labels, otherwise
    decompat(B) = B.

    simple label:  A basic label that does not begin with the ACE
    prefix.  [Observation:  Simple labels have the same semantics as in
    the preexisting standard.]

    ACE label:  A basic label that is altered by decompat().
    [Observation:  All ACE labels begin with the ACE prefix.  ACE labels
    have new semantics not present in the preexisting standard.]

    invalid label:  A basic label that begins with the ACE prefix but is
    not altered by decompat().  [Observation:  Every basic label that is
    neither a simple label nor an ACE label is an invalid label.  These
    labels carry a semantic dubiousness that was not present in the
    preexisting standard.]

Fact 7:  Software that was not deliberately written to handle
international domain names cannot properly compare two international
domain names.

Fact 8:  Even though many software entities do not compare domain
names, it is common for them to pass domain names on to other software
entities, which might need to compare them.

Goal 6 (motivated by facts 2, 7, and 8):  Whenever possible, keep
non-basic international domain names out of the hands of software that
was not deliberately written to handle them.

The following definition is motivated by goal 6:

    international domain name slot:  A domain name slot explicitly
    designated to carry an international domain name.  The designation
    may be static (for example, in the specification of the protocol or
    interface) or dynamic (for example, as a result of negotiation in an
    interactive session).

Requirement 4 (follows from goal 6):  Non-basic international domain
names must not be put into non-international domain name slots.  (But of
course the equivalent basic domain name, obtainable via compat(), can
be put into the slot instead.)

Observation 1:  Obviously, requirement 4 can apply only to IDN-aware
entities.  A non-IDN-aware entity is not normally supposed to have any
non-basic international domain names in the first place, but if it
somehow obtains one, perhaps via manual entry, then of course it cannot
be expected to obey a requirement that it knows nothing about.  It will
obey whatever requirements it was written to obey.

Observation 2:  DNS is a protocol with non-international domain name
slots (in both directions).  Therefore requirement 4 applies to those
slots.  Extensions to DNS could of course define new fields that are
international domain name slots.

Observation 3:  Existing resolver interfaces have non-international
domain name slots (in both directions).  Therefore requirement 4 applies
to those slots (regardless of whether the resolver is IDN-aware, and
regardless of whether it makes use of any DNS extensions).  New resolver
interfaces could of course define new arguments that are international
domain name slots.

Goal 7:  Whenever possible, show users the meaningful form of an
international domain name rather than the ACE form (unless they request
otherwise).

Requirement 5 (follows from goal 7):  IDN-aware entities should by
default apply decompat() to all domain labels before displaying them
to the user, and before putting the domain name into an international
domain name slot likely to be seen by a user.

End of idn-arch-deriv.