[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
No Subject
> * Minimum length of one UCS character code.
Note: we avoid the term "character code" since it is so easily
mininterpreted. For clarity, we define and use the terms "code point"
for
what is meant here, and we distinguish that from the term "code unit".
- A code unit is 8, 16, or 32 bits in length.
- A code point may be represented by one or more code units.
- A "character" (in the sense meant by end-users) may be represented by
one
or more code points.
Examples:
1. 'a' is represented by one 8-bit code unit in UTF-8 or in ASCII, and
one
16-bit code unit in UTF-16.
2. a-grave is represented by two 8-bit code units in UTF-8, and one
16-bit
code unit in UTF-16
3. katakana ka is represented by three 8-bit code units in UTF-8, and
one16-bit code unit in UTF-16
4. deseret dee is represented by four 8-bit code units in UTF-8, and two
16-bit code units in UTF-16
#1-#4 are represented by one 32-bit code unit in UTF-32.
5. q-grave is represented by a sequence of two code points <U+0071,
U+0301>,
which in turn is represented by three 8-bit code units in UTF-8, four
16-bit
code units in UTF-16, and two 32-bit code units in UTF-32.
See (http://www.unicode.org/glossary/)
—————
Ὀλίγοι ἔμφονες πολλῶν ἀφρόνων
φοβερώτεροι — λάτωνος
[For transliteration, see http://oss.software.ibm.com/cgi-bin/icu/tr]
http://www.macchiato.com
----- Original Message -----
From: "James Seng/Personal" <jseng@pobox.org.sg>
To: "Eric A. Hall" <ehall@ehsco.com>; "IDN" <idn@ops.ietf.org>
Sent: Wednesday, December 05, 2001 06:32
Subject: Re: [idn] naming syntax rules
> Thanks Eric. This is definately more comprehensive than what I
expected!
> I was only thinking of a simple summary of the whole series of the
> discussion of internationalized domain names and hostname so we know
how
> to capture that properly either in Nameprep, ACE or IDNA, DM or any
> relevant document.
>
> Anyway, this is great stuff, perhaps sufficient to be an draft on its
> own. But I am not sure how this fits into the IDN WG yet...but
something
> we should discuss further.
>
> -James Seng
>
> ----- Original Message -----
> From: "Eric A. Hall" <ehall@ehsco.com>
> To: "IDN" <idn@ops.ietf.org>
> Sent: Wednesday, December 05, 2001 4:08 PM
> Subject: [idn] naming syntax rules
>
>
> >
> > James has been needling me to put together a second summary of the
> naming
> > rules in time for the IETF meeting. However, I have been extremely
> busy
> > lately (it is 3 am in a strange hotel room) but I wanted to at least
> > scratch together enough material for the concepts to be tangible.
> >
> > The following text is by no means complete. It's hardly just begun.
> > However, it illustrates what the scope will be, exposes some of the
> open
> > issues, and may be usable as a touchstone to see if this whole IDN
> thing
> > is going to work or not.
> >
> > What I mean by that is, if systems are going to work with IDNs in
> their
> > raw form (not encoded form, but raw form) these are the rules they
> will
> > have to work with. If these rules are too complex, the whole
approach
> has
> > to be reconsidered. That will affect a lot of other things,
including
> ACE.
> >
> > The basic idea here is to declare formal data-types for labels, and
to
> > incorporate the data-types into syntaxes for applications and
> protocols to
> > use when they need to interact with domain names.
> >
> >
> > 1. Summary
> >
> > This memo describes two sets of definitions which are necessary for
> the
> > consistent and reliable use of internationalized domain names across
> the
> > Internet. First and foremost, this memo specifies the rules which
> govern
> > the structure and syntax of internationalized domain names in
various
> > scenarios, and also describes their legitimate characters and any
> > normalization which may be required. Secondarily, this memo also
> clarifies
> > and extends usage rules of common resource records so that
> > internationalized domain names can be stored and exchanged (either
as
> > resource record owner domain names or as resource record data) in a
> form
> > which is consistent across all usage environments.
> >
> >
> > 2. Introduction
> >
> > There are many issues which affect the characters that are desirable
> for
> > use in DNS domain names. Among these considerations are obvious
> aspects
> > such as breadth, as well as less-obvious aspects such as normalized
> forms
> > of particular character sequences, comparison efficiencies, and
more.
> >
> > The general consensus of the IDN working group is that domain names
> should
> > use a mildly-restricted subset of the character codes and
arrangement
> > sequences which are documented in the UCS for use with languages, as
> this
> > subset excludes non-verbal symbols and spurious punctuation which
are
> > likely to be problematic, while still allowing international domain
> names
> > to be created. Furthermore, the consensus is that these character
> > sequences should be normalized and converted to lowercase [in that
> order?]
> > wherever this is possible, since this will provide the tightest
> > syntactical representation of the supported characters with the
least
> > amount of ambiguity.
> >
> > While both of those objectives are highly desirable (and are met in
> most
> > of the scenarios), there are many instances where these objectives
are
> > incompatible with existing practice. For example, existing
> > (STD13-compliant) DNS implementations are allowed to use domain
names
> > which contain any eight-bit character code (0x00 through 0xFF),
while
> > there are some protocol models which specifically require the use of
> > punctuation (SRV requires underscore, for example), while some
> resource
> > records can contain domain names that combine both of these elements
> (SOA
> > and RP both provide email addresses as domain name labels that can
> > contain, and those can use punctuation or case-specific US-ASCII
> letters).
> >
> > In order to facilitate these divergent requirements, this memo
> describes
> > multiple types of domain name labels, including their valid
> characters,
> > any case-conversions and/or normalizations which may be required,
and
> so
> > forth.
> >
> > Furthermore, in order to ensure that these rules are consistently
> > implemented (and to minimize damage when they are not), this memo
also
> > states which label data-types are valid for use with many of the
> common
> > resource records.
> >
> > Cumulatively, this means that a system which attempts to use an
> > internationalized domain name for a specific purpose will have to be
> aware
> > of the rules which govern the resource record which provides that
> service,
> > and will have to be aware of the rules which govern the domain name
> > data-types which are valid for that resource record. For example, if
> an
> > application knows that an internationalized domain name will be used
> for a
> > forward lookup, it will have to be aware of the label data-types
that
> are
> > usable with A (or AAAA) resource records, and must ensure that the
> domain
> > name is processed (normalized and lower-cased, in this example)
before
> it
> > is used.
> >
> > NOTE: Legacy systems which use a backwards-compatible encoding
scheme
> for
> > access to resources with internationalized domain names will not be
> > required to perform any of these tests. However, systems which
embrace
> > internationalized domain names as specific data (EG, any system
which
> > encodes or decodes an internationalized domain name as explicit
data)
> will
> > need to be aware of these issues and will likely be required to
> enforce
> > their usage.
> >
> >
> > 3. Domain Names and Label Data-Types
> >
> > An internationalized domain name is a sequence of labels which are
> > encapsulated in a message. The message may provide the labels as
> separate
> > units of data (as is the case with DNS), or may provide them as a
> series
> > of dot-separated textual strings (as is the case when domain names
are
> > "written-out" in protocol or application data streams).
> >
> > In global terms, an internationalized domain name has the following
> > characteristics:
> >
> > * Series of labels (1*label)
> >
> > * Maximum cumulative length of 255 UCS character codes (not
> necessarily
> > codes with matching characters, and most definitely not octets or
any
> > encoded representation). This limit includes any separators which
may
> be
> > provided (such as the full-stop character commonly used as a
separator
> > when the domain name is written), and also includes one character
for
> the
> > root domain (the trailing dot).
> >
> > The labels that make up a domain name will vary according to the
> > contextual use of the domain name.
> >
> >
> > 3.1. Opaque Labels
> >
> > Some functions can use domain names which consist of unstructured or
> > unknown labels. For example, a TXT resource record can describe
> anything,
> > and as such, it can use any sequence of UCS characters for its owner
> > domain name.
> >
> > Opaque labels require no processing on the part of the application
> which
> > is using the domain name. It is the responsibility of the user to
> provide
> > the domain name to the application in its correct case and/or
> > normalization form.
> >
> > Opaque labels have the following characteristics:
> >
> > * Any valid UCS character code (not necessarily a valid UCS
> character).
> >
> > * Minimum length of one UCS character code.
> >
> > * Maximum length of 63 UCS character codes.
> >
> > NOTE: Even though a domain name may sometimes consist of a variable
> number
> > of opaque labels, most domain names will also contain at least some
> host
> > labels. In those cases, the entire domain name should be provided as
a
> > series of opaque labels, and the host labels should be determined
> > beforehand. For example, a CNAME resource record can reference
> anything,
> > including an A RR that consists entirely of host labels, or a TXT RR
> that
> > consists of a mixture of opaque and host labels. As such, it will
> depend
> > on the formats in use by the alias target, and will inherit those
> > attributes.
> >
> >
> > 3.2. Host Identifier Labels
> >
> > Most functions will use domain names to identify a host, either
> directly
> > or indirectly. For example, a host may be identified by a relative
> domain
> > name which consists of only a local label, or by an FQDN which
> contains a
> > series of host labels. Since all forms must be supportable, all
> namespace
> > delegation functions also use the host label syntax.
> >
> > The UCS characters provided in host labels are required to be
> converted to
> > lowercase and normalized according to the rules in [nameprep] before
> they
> > are processed. Servers are likely to treat such labels as exact
> matches of
> > the encoded data, so it is imperative that applications perform this
> work
> > before they encode the label into a DNS query.
> >
> > Host labels are used for any lookups, protocol actions, or message
> formats
> > which specifically make use of internationalized domain names for
host
> > identification purposes.
> >
> > Host labels have the following characteristics:
> >
> > * UCS characters from the following ranges:
> >
> > "letters" [need a property]
> >
> > characters with number property [?]
> >
> > characters with diacritical mark property [?]
> >
> > hyphen-minus (U+002D)
> >
> > * MUST be converted to lowercase according to [nameprep].
> >
> > * MUST be normalized according to [nameprep].
> >
> > * First and last characters in the label MUST NOT be a diacritical
> mark or
> > hyphen-minus.
> >
> > * Minimum length of two characters.
> >
> > * Maximum length of 63 characters.
> >
> >
> > 3.3. ASCII Labels
> >
> > Some functions require labels that contain extended punctuation, but
> which
> > also require case-neutral comparisons. The most readily apparent of
> these
> > usages is the SRV resource record, which makes use of the underscore
> > character (U+005F) and case-neutral US-ASCII in the owner labels.
> >
> > ASCII labels have the following characteristics:
> >
> > * Any printable character from US-ASCII (0x21 through 0x7E,
> inclusive).
> >
> > * SHOULD be converted to lowercase as specified in [nameprep] (note
> that
> > servers are required to perform case-neutral comparisons, but
certain
> > tools will likely prefer to generate and use lower-case wherever
> possible,
> > so lowercase is the preferred form). All comparison operations on
> these
> > domain names MUST be performed in a case-neutral form.
> >
> > * Minimum length of one character.
> >
> > * Maximum length of 63 characters.
> >
> > NOTE: some resource records may define tighter restrictions.
> >
> > NOTE: Even though a domain name may sometimes consist of a variable
> number
> > of ASCII labels, most domain names will also contain at least some
> host
> > labels. In those cases, the entire domain name should be provided as
a
> > series of opaque labels, and the ASCII and host labels should be
> > determined beforehand.
> >
> >
> > 3.4. Mailbox Labels
> >
> > Some functions provide SMTP mailboxes as labels within domain names.
> For
> > example, the SOA and RP resource records both provide email
addresses,
> > with the first label providing a mailbox (local-part) of the
address,
> and
> > with the remainder of the labels providing the delivery domain of
the
> > address.
> >
> > In order for these resources to be accessible, applications must
> process
> > labels which are known to contain email addresses through these
rules.
> > This means that data must be provided in a non-normalized,
> non-lowercased
> > form, and must be restricted to the range of characters which are
> valid,
> > as specified in section XX of RFC 2822. Until RFC 2822 is deprecated
> or
> > until such a time as UCS characters can be stored in the mailbox
> portion
> > of Internet standard email addresses, the mailbox label is to
> processed
> > according to the rules set forth in RFC 2822.
> >
> > There are two additional rules which govern this data-type:
> >
> > * Minimum length of one character.
> >
> > * Maximum length of 63 characters.
> >
> > NOTE: mailbox labels can contain a large number of special
characters
> such
> > as spaces or full-stop. These characters may require escaping as
> described
> > in section XX of this document.
> >
> > NOTE: Mailbox labels are NOT a subset of the ASCII labels. Mailbox
> labels
> > are case-sensitive, while ASCII labels are case-neutral.
> >
> >
> > 4. Resource Records
> >
> > The following structure is used to describe resource records and
their
> > usage of internationalized domain names and labels.
> >
> > <owner domain name labels> <mnemonic> <[data] [data] [...]>
> >
> > A, always provides a host identifier
> >
> > <1*host> <A> <[IPv4 address]>
> >
> >
> > AAAA, always provides a host identifier
> >
> > <1*host> <AAAA> <[IPv6 address]>
> >
> >
> > CNAME, can reference anything, can target anything
> >
> > <1*opaque> <CNAME> <[1*opaque]>
> >
> >
> > NS, references a host, provides a host identifier
> >
> > <1*host> <NS> <[1*host]>
> >
> >
> > SOA, references a host (delegation), provides host identifier, email
> > address, and custom data
> >
> > <1*host> <SOA> <[1*host] [1mailbox (*host)] [serial] [refresh]
[retry]
> > [expire] [ttl]>
> >
> >
> > WKS, always provides a host identifier
> >
> > <1*host> <WKS> <[XX] [XX]>
> >
> >
> > PTR, can reference anything, must inherit target attributes
> >
> > <1*opaque> <PTR> <[1*opaque]>
> >
> >
> > HINFO, references a host, provides RR-specific data
> >
> > <1*host> <HINFO> <[hardware] [opsys]>
> >
> >
> > MX, references a host, provides a preference and a host identifer
> >
> > <1*host> <MX> <[preference] [1*host]>
> >
> >
> > TXT, can reference anything, provides free-text data
> >
> > <1*host> <TXT> <[text]>
> >
> >
> > RP, can reference anything, provides email address and a pointer to
a
> TXT
> > RR
> >
> > <1*opaque> <RP> <[1mailbox (*host)]> <1*opaque>
> >
> >
> > SRV, references a protocol (which is specified using the ASCII
> data-type),
> > provides preference values and a host identifier
> >
> > <1*ASCII> <SRV> <[priority] [weight] [port] [1*host]>
> >
> > [NOTE: cannot define <2ASCII *HOST> because not all SRV protocol
> labels
> > are just _service._transport]
> >
> >
>
>
>