[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [idn] naming syntax rules



Thanks Eric. This is definately more comprehensive than what I expected!
I was only thinking of a simple summary of the whole series of the
discussion of internationalized domain names and hostname so we know how
to capture that properly either in Nameprep, ACE or IDNA, DM or any
relevant document.

Anyway, this is great stuff, perhaps sufficient to be an draft on its
own. But I am not sure how this fits into the IDN WG yet...but something
we should discuss further.

-James Seng

----- Original Message -----
From: "Eric A. Hall" <ehall@ehsco.com>
To: "IDN" <idn@ops.ietf.org>
Sent: Wednesday, December 05, 2001 4:08 PM
Subject: [idn] naming syntax rules


>
> James has been needling me to put together a second summary of the
naming
> rules in time for the IETF meeting. However, I have been extremely
busy
> lately (it is 3 am in a strange hotel room) but I wanted to at least
> scratch together enough material for the concepts to be tangible.
>
> The following text is by no means complete. It's hardly just begun.
> However, it illustrates what the scope will be, exposes some of the
open
> issues, and may be usable as a touchstone to see if this whole IDN
thing
> is going to work or not.
>
> What I mean by that is, if systems are going to work with IDNs in
their
> raw form (not encoded form, but raw form) these are the rules they
will
> have to work with. If these rules are too complex, the whole approach
has
> to be reconsidered. That will affect a lot of other things, including
ACE.
>
> The basic idea here is to declare formal data-types for labels, and to
> incorporate the data-types into syntaxes for applications and
protocols to
> use when they need to interact with domain names.
>
>
> 1. Summary
>
> This memo describes two sets of definitions which are necessary for
the
> consistent and reliable use of internationalized domain names across
the
> Internet. First and foremost, this memo specifies the rules which
govern
> the structure and syntax of internationalized domain names in various
> scenarios, and also describes their legitimate characters and any
> normalization which may be required. Secondarily, this memo also
clarifies
> and extends usage rules of common resource records so that
> internationalized domain names can be stored and exchanged (either as
> resource record owner domain names or as resource record data) in a
form
> which is consistent across all usage environments.
>
>
> 2. Introduction
>
> There are many issues which affect the characters that are desirable
for
> use in DNS domain names. Among these considerations are obvious
aspects
> such as breadth, as well as less-obvious aspects such as normalized
forms
> of particular character sequences, comparison efficiencies, and more.
>
> The general consensus of the IDN working group is that domain names
should
> use a mildly-restricted subset of the character codes and arrangement
> sequences which are documented in the UCS for use with languages, as
this
> subset excludes non-verbal symbols and spurious punctuation which are
> likely to be problematic, while still allowing international domain
names
> to be created. Furthermore, the consensus is that these character
> sequences should be normalized and converted to lowercase [in that
order?]
> wherever this is possible, since this will provide the tightest
> syntactical representation of the supported characters with the least
> amount of ambiguity.
>
> While both of those objectives are highly desirable (and are met in
most
> of the scenarios), there are many instances where these objectives are
> incompatible with existing practice. For example, existing
> (STD13-compliant) DNS implementations are allowed to use domain names
> which contain any eight-bit character code (0x00 through 0xFF), while
> there are some protocol models which specifically require the use of
> punctuation (SRV requires underscore, for example), while some
resource
> records can contain domain names that combine both of these elements
(SOA
> and RP both provide email addresses as domain name labels that can
> contain, and those can use punctuation or case-specific US-ASCII
letters).
>
> In order to facilitate these divergent requirements, this memo
describes
> multiple types of domain name labels, including their valid
characters,
> any case-conversions and/or normalizations which may be required, and
so
> forth.
>
> Furthermore, in order to ensure that these rules are consistently
> implemented (and to minimize damage when they are not), this memo also
> states which label data-types are valid for use with many of the
common
> resource records.
>
> Cumulatively, this means that a system which attempts to use an
> internationalized domain name for a specific purpose will have to be
aware
> of the rules which govern the resource record which provides that
service,
> and will have to be aware of the rules which govern the domain name
> data-types which are valid for that resource record. For example, if
an
> application knows that an internationalized domain name will be used
for a
> forward lookup, it will have to be aware of the label data-types that
are
> usable with A (or AAAA) resource records, and must ensure that the
domain
> name is processed (normalized and lower-cased, in this example) before
it
> is used.
>
> NOTE: Legacy systems which use a backwards-compatible encoding scheme
for
> access to resources with internationalized domain names will not be
> required to perform any of these tests. However, systems which embrace
> internationalized domain names as specific data (EG, any system which
> encodes or decodes an internationalized domain name as explicit data)
will
> need to be aware of these issues and will likely be required to
enforce
> their usage.
>
>
> 3. Domain Names and Label Data-Types
>
> An internationalized domain name is a sequence of labels which are
> encapsulated in a message. The message may provide the labels as
separate
> units of data (as is the case with DNS), or may provide them as a
series
> of dot-separated textual strings (as is the case when domain names are
> "written-out" in protocol or application data streams).
>
> In global terms, an internationalized domain name has the following
> characteristics:
>
> * Series of labels (1*label)
>
> * Maximum cumulative length of 255 UCS character codes (not
necessarily
> codes with matching characters, and most definitely not octets or any
> encoded representation). This limit includes any separators which may
be
> provided (such as the full-stop character commonly used as a separator
> when the domain name is written), and also includes one character for
the
> root domain (the trailing dot).
>
> The labels that make up a domain name will vary according to the
> contextual use of the domain name.
>
>
> 3.1. Opaque Labels
>
> Some functions can use domain names which consist of unstructured or
> unknown labels. For example, a TXT resource record can describe
anything,
> and as such, it can use any sequence of UCS characters for its owner
> domain name.
>
> Opaque labels require no processing on the part of the application
which
> is using the domain name. It is the responsibility of the user to
provide
> the domain name to the application in its correct case and/or
> normalization form.
>
> Opaque labels have the following characteristics:
>
> * Any valid UCS character code (not necessarily a valid UCS
character).
>
> * Minimum length of one UCS character code.
>
> * Maximum length of 63 UCS character codes.
>
> NOTE: Even though a domain name may sometimes consist of a variable
number
> of opaque labels, most domain names will also contain at least some
host
> labels. In those cases, the entire domain name should be provided as a
> series of opaque labels, and the host labels should be determined
> beforehand. For example, a CNAME resource record can reference
anything,
> including an A RR that consists entirely of host labels, or a TXT RR
that
> consists of a mixture of opaque and host labels. As such, it will
depend
> on the formats in use by the alias target, and will inherit those
> attributes.
>
>
> 3.2. Host Identifier Labels
>
> Most functions will use domain names to identify a host, either
directly
> or indirectly. For example, a host may be identified by a relative
domain
> name which consists of only a local label, or by an FQDN which
contains a
> series of host labels. Since all forms must be supportable, all
namespace
> delegation functions also use the host label syntax.
>
> The UCS characters provided in host labels are required to be
converted to
> lowercase and normalized according to the rules in [nameprep] before
they
> are processed. Servers are likely to treat such labels as exact
matches of
> the encoded data, so it is imperative that applications perform this
work
> before they encode the label into a DNS query.
>
> Host labels are used for any lookups, protocol actions, or message
formats
> which specifically make use of internationalized domain names for host
> identification purposes.
>
> Host labels have the following characteristics:
>
> * UCS characters from the following ranges:
>
> "letters" [need a property]
>
> characters with number property [?]
>
> characters with diacritical mark property [?]
>
> hyphen-minus (U+002D)
>
> * MUST be converted to lowercase according to [nameprep].
>
> * MUST be normalized according to [nameprep].
>
> * First and last characters in the label MUST NOT be a diacritical
mark or
> hyphen-minus.
>
> * Minimum length of two characters.
>
> * Maximum length of 63 characters.
>
>
> 3.3. ASCII Labels
>
> Some functions require labels that contain extended punctuation, but
which
> also require case-neutral comparisons. The most readily apparent of
these
> usages is the SRV resource record, which makes use of the underscore
> character (U+005F) and case-neutral US-ASCII in the owner labels.
>
> ASCII labels have the following characteristics:
>
> * Any printable character from US-ASCII (0x21 through 0x7E,
inclusive).
>
> * SHOULD be converted to lowercase as specified in [nameprep] (note
that
> servers are required to perform case-neutral comparisons, but certain
> tools will likely prefer to generate and use lower-case wherever
possible,
> so lowercase is the preferred form). All comparison operations on
these
> domain names MUST be performed in a case-neutral form.
>
> * Minimum length of one character.
>
> * Maximum length of 63 characters.
>
> NOTE: some resource records may define tighter restrictions.
>
> NOTE: Even though a domain name may sometimes consist of a variable
number
> of ASCII labels, most domain names will also contain at least some
host
> labels. In those cases, the entire domain name should be provided as a
> series of opaque labels, and the ASCII and host labels should be
> determined beforehand.
>
>
> 3.4. Mailbox Labels
>
> Some functions provide SMTP mailboxes as labels within domain names.
For
> example, the SOA and RP resource records both provide email addresses,
> with the first label providing a mailbox (local-part) of the address,
and
> with the remainder of the labels providing the delivery domain of the
> address.
>
> In order for these resources to be accessible, applications must
process
> labels which are known to contain email addresses through these rules.
> This means that data must be provided in a non-normalized,
non-lowercased
> form, and must be restricted to the range of characters which are
valid,
> as specified in section XX of RFC 2822. Until RFC 2822 is deprecated
or
> until such a time as UCS characters can be stored in the mailbox
portion
> of Internet standard email addresses, the mailbox label is to
processed
> according to the rules set forth in RFC 2822.
>
> There are two additional rules which govern this data-type:
>
> * Minimum length of one character.
>
> * Maximum length of 63 characters.
>
> NOTE: mailbox labels can contain a large number of special characters
such
> as spaces or full-stop. These characters may require escaping as
described
> in section XX of this document.
>
> NOTE: Mailbox labels are NOT a subset of the ASCII labels. Mailbox
labels
> are case-sensitive, while ASCII labels are case-neutral.
>
>
> 4. Resource Records
>
> The following structure is used to describe resource records and their
> usage of internationalized domain names and labels.
>
> <owner domain name labels> <mnemonic> <[data] [data] [...]>
>
> A, always provides a host identifier
>
> <1*host> <A> <[IPv4 address]>
>
>
> AAAA, always provides a host identifier
>
> <1*host> <AAAA> <[IPv6 address]>
>
>
> CNAME, can reference anything, can target anything
>
> <1*opaque> <CNAME> <[1*opaque]>
>
>
> NS, references a host, provides a host identifier
>
> <1*host> <NS> <[1*host]>
>
>
> SOA, references a host (delegation), provides host identifier, email
> address, and custom data
>
> <1*host> <SOA> <[1*host] [1mailbox (*host)] [serial] [refresh] [retry]
> [expire] [ttl]>
>
>
> WKS, always provides a host identifier
>
> <1*host> <WKS> <[XX] [XX]>
>
>
> PTR, can reference anything, must inherit target attributes
>
> <1*opaque> <PTR> <[1*opaque]>
>
>
> HINFO, references a host, provides RR-specific data
>
> <1*host> <HINFO> <[hardware] [opsys]>
>
>
> MX, references a host, provides a preference and a host identifer
>
> <1*host> <MX> <[preference] [1*host]>
>
>
> TXT, can reference anything, provides free-text data
>
> <1*host> <TXT> <[text]>
>
>
> RP, can reference anything, provides email address and a pointer to a
TXT
> RR
>
> <1*opaque> <RP> <[1mailbox (*host)]> <1*opaque>
>
>
> SRV, references a protocol (which is specified using the ASCII
data-type),
> provides preference values and a host identifier
>
> <1*ASCII> <SRV> <[priority] [weight] [port] [1*host]>
>
> [NOTE: cannot define <2ASCII *HOST> because not all SRV protocol
labels
> are just _service._transport]
>
>