[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[idn] naming syntax rules




James has been needling me to put together a second summary of the naming
rules in time for the IETF meeting. However, I have been extremely busy
lately (it is 3 am in a strange hotel room) but I wanted to at least
scratch together enough material for the concepts to be tangible.

The following text is by no means complete. It's hardly just begun.
However, it illustrates what the scope will be, exposes some of the open
issues, and may be usable as a touchstone to see if this whole IDN thing
is going to work or not.

What I mean by that is, if systems are going to work with IDNs in their
raw form (not encoded form, but raw form) these are the rules they will
have to work with. If these rules are too complex, the whole approach has
to be reconsidered. That will affect a lot of other things, including ACE.

The basic idea here is to declare formal data-types for labels, and to
incorporate the data-types into syntaxes for applications and protocols to
use when they need to interact with domain names.


1.	Summary

This memo describes two sets of definitions which are necessary for the
consistent and reliable use of internationalized domain names across the
Internet. First and foremost, this memo specifies the rules which govern
the structure and syntax of internationalized domain names in various
scenarios, and also describes their legitimate characters and any
normalization which may be required. Secondarily, this memo also clarifies
and extends usage rules of common resource records so that
internationalized domain names can be stored and exchanged (either as
resource record owner domain names or as resource record data) in a form
which is consistent across all usage environments.


2.	Introduction

There are many issues which affect the characters that are desirable for
use in DNS domain names. Among these considerations are obvious aspects
such as breadth, as well as less-obvious aspects such as normalized forms
of particular character sequences, comparison efficiencies, and more.

The general consensus of the IDN working group is that domain names should
use a mildly-restricted subset of the character codes and arrangement
sequences which are documented in the UCS for use with languages, as this
subset excludes non-verbal symbols and spurious punctuation which are
likely to be problematic, while still allowing international domain names
to be created. Furthermore, the consensus is that these character
sequences should be normalized and converted to lowercase [in that order?]
wherever this is possible, since this will provide the tightest
syntactical representation of the supported characters with the least
amount of ambiguity.

While both of those objectives are highly desirable (and are met in most
of the scenarios), there are many instances where these objectives are
incompatible with existing practice. For example, existing
(STD13-compliant) DNS implementations are allowed to use domain names
which contain any eight-bit character code (0x00 through 0xFF), while
there are some protocol models which specifically require the use of
punctuation (SRV requires underscore, for example), while some resource
records can contain domain names that combine both of these elements (SOA
and RP both provide email addresses as domain name labels that can
contain, and those can use punctuation or case-specific US-ASCII letters).

In order to facilitate these divergent requirements, this memo describes
multiple types of domain name labels, including their valid characters,
any case-conversions and/or normalizations which may be required, and so
forth.

Furthermore, in order to ensure that these rules are consistently
implemented (and to minimize damage when they are not), this memo also
states which label data-types are valid for use with many of the common
resource records.

Cumulatively, this means that a system which attempts to use an
internationalized domain name for a specific purpose will have to be aware
of the rules which govern the resource record which provides that service,
and will have to be aware of the rules which govern the domain name
data-types which are valid for that resource record. For example, if an
application knows that an internationalized domain name will be used for a
forward lookup, it will have to be aware of the label data-types that are
usable with A (or AAAA) resource records, and must ensure that the domain
name is processed (normalized and lower-cased, in this example) before it
is used.

NOTE: Legacy systems which use a backwards-compatible encoding scheme for
access to resources with internationalized domain names will not be
required to perform any of these tests. However, systems which embrace
internationalized domain names as specific data (EG, any system which
encodes or decodes an internationalized domain name as explicit data) will
need to be aware of these issues and will likely be required to enforce
their usage.


3.	Domain Names and Label Data-Types

An internationalized domain name is a sequence of labels which are
encapsulated in a message. The message may provide the labels as separate
units of data (as is the case with DNS), or may provide them as a series
of dot-separated textual strings (as is the case when domain names are
"written-out" in protocol or application data streams).

In global terms, an internationalized domain name has the following
characteristics:

*	Series of labels (1*label)

*	Maximum cumulative length of 255 UCS character codes (not necessarily
codes with matching characters, and most definitely not octets or any
encoded representation). This limit includes any separators which may be
provided (such as the full-stop character commonly used as a separator
when the domain name is written), and also includes one character for the
root domain (the trailing dot).

The labels that make up a domain name will vary according to the
contextual use of the domain name.


3.1.	Opaque Labels

Some functions can use domain names which consist of unstructured or
unknown labels. For example, a TXT resource record can describe anything,
and as such, it can use any sequence of UCS characters for its owner
domain name.

Opaque labels require no processing on the part of the application which
is using the domain name. It is the responsibility of the user to provide
the domain name to the application in its correct case and/or
normalization form.

Opaque labels have the following characteristics:

*	Any valid UCS character code (not necessarily a valid UCS character).

*	Minimum length of one UCS character code.

*	Maximum length of 63 UCS character codes.

NOTE: Even though a domain name may sometimes consist of a variable number
of opaque labels, most domain names will also contain at least some host
labels. In those cases, the entire domain name should be provided as a
series of opaque labels, and the host labels should be determined
beforehand. For example, a CNAME resource record can reference anything,
including an A RR that consists entirely of host labels, or a TXT RR that
consists of a mixture of opaque and host labels. As such, it will depend
on the formats in use by the alias target, and will inherit those
attributes.


3.2.	Host Identifier Labels

Most functions will use domain names to identify a host, either directly
or indirectly. For example, a host may be identified by a relative domain
name which consists of only a local label, or by an FQDN which contains a
series of host labels. Since all forms must be supportable, all namespace
delegation functions also use the host label syntax.

The UCS characters provided in host labels are required to be converted to
lowercase and normalized according to the rules in [nameprep] before they
are processed. Servers are likely to treat such labels as exact matches of
the encoded data, so it is imperative that applications perform this work
before they encode the label into a DNS query.

Host labels are used for any lookups, protocol actions, or message formats
which specifically make use of internationalized domain names for host
identification purposes.

Host labels have the following characteristics:

*	UCS characters from the following ranges:

"letters" [need a property]

characters with number property [?]

characters with diacritical mark property [?]

hyphen-minus (U+002D)

*	MUST be converted to lowercase according to [nameprep].

*	MUST be normalized according to [nameprep].

*	First and last characters in the label MUST NOT be a diacritical mark or
hyphen-minus.

*	Minimum length of two characters.

*	Maximum length of 63 characters.


3.3.	ASCII Labels

Some functions require labels that contain extended punctuation, but which
also require case-neutral comparisons. The most readily apparent of these
usages is the SRV resource record, which makes use of the underscore
character (U+005F) and case-neutral US-ASCII in the owner labels.

ASCII labels have the following characteristics:

*	Any printable character from US-ASCII (0x21 through 0x7E, inclusive).

*	SHOULD be converted to lowercase as specified in [nameprep] (note that
servers are required to perform case-neutral comparisons, but certain
tools will likely prefer to generate and use lower-case wherever possible,
so lowercase is the preferred form). All comparison operations on these
domain names MUST be performed in a case-neutral form.

*	Minimum length of one character.

*	Maximum length of 63 characters.

NOTE: some resource records may define tighter restrictions.

NOTE: Even though a domain name may sometimes consist of a variable number
of ASCII labels, most domain names will also contain at least some host
labels. In those cases, the entire domain name should be provided as a
series of opaque labels, and the ASCII and host labels should be
determined beforehand.


3.4.	Mailbox Labels

Some functions provide SMTP mailboxes as labels within domain names. For
example, the SOA and RP resource records both provide email addresses,
with the first label providing a mailbox (local-part) of the address, and
with the remainder of the labels providing the delivery domain of the
address.

In order for these resources to be accessible, applications must process
labels which are known to contain email addresses through these rules.
This means that data must be provided in a non-normalized, non-lowercased
form, and must be restricted to the range of characters which are valid,
as specified in section XX of RFC 2822. Until RFC 2822 is deprecated or
until such a time as UCS characters can be stored in the mailbox portion
of Internet standard email addresses, the mailbox label is to processed
according to the rules set forth in RFC 2822.

There are two additional rules which govern this data-type:

*	Minimum length of one character.

*	Maximum length of 63 characters.

NOTE: mailbox labels can contain a large number of special characters such
as spaces or full-stop. These characters may require escaping as described
in section XX of this document.

NOTE: Mailbox labels are NOT a subset of the ASCII labels. Mailbox labels
are case-sensitive, while ASCII labels are case-neutral.


4.	Resource Records

The following structure is used to describe resource records and their
usage of internationalized domain names and labels.

<owner domain name labels> <mnemonic> <[data] [data] [...]>

A, always provides a host identifier

<1*host> <A> <[IPv4 address]>


AAAA, always provides a host identifier

<1*host> <AAAA> <[IPv6 address]>


CNAME, can reference anything, can target anything

<1*opaque> <CNAME> <[1*opaque]>


NS, references a host, provides a host identifier

<1*host> <NS> <[1*host]>


SOA, references a host (delegation), provides host identifier, email
address, and custom data

<1*host> <SOA> <[1*host] [1mailbox (*host)] [serial] [refresh] [retry]
[expire] [ttl]>


WKS, always provides a host identifier

<1*host> <WKS> <[XX] [XX]>


PTR, can reference anything, must inherit target attributes

<1*opaque> <PTR> <[1*opaque]>


HINFO, references a host, provides RR-specific data

<1*host> <HINFO> <[hardware] [opsys]>


MX, references a host, provides a preference and a host identifer

<1*host> <MX> <[preference] [1*host]>


TXT, can reference anything, provides free-text data

<1*host> <TXT> <[text]>


RP, can reference anything, provides email address and a pointer to a TXT
RR

<1*opaque> <RP> <[1mailbox (*host)]> <1*opaque>


SRV, references a protocol (which is specified using the ASCII data-type),
provides preference values and a host identifier

<1*ASCII> <SRV> <[priority] [weight] [port] [1*host]>

[NOTE: cannot define <2ASCII *HOST> because not all SRV protocol labels
are just _service._transport]