[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: Draft 0.4
- To: <keka@im.se>,idn@ops.ietf.org
- Subject: Re: Draft 0.4
- From: bill@mail.nic.nu (J. William Semich)
- Date: Fri, 28 Jan 2000 17:52:58 -0500
- Delivery-date: Fri, 28 Jan 2000 14:59:51 -0800
- Envelope-to: idn-data@psg.com
I can agree with the majority of this proposal. Thanks, Kent - this is
much-needed work!
Bill Semich
.NU Domain
At 08:53 AM 1/28/00 -0800, keka@im.se wrote:
>Hi!
>
> Despite some initial trouble getting on this list,
>I've now been able to subscribe to it. Thanks Martin!
>I've tried to catch up on the e-mails so far, but I've
>only browsed quickly though them.
>
> I've been thinking a bit about how domain names should
>be internationalised, and the text below reflects my current
>thinking about this. Most of this text was written before
>browsing trough the e-mail archive for this list. The
>formulations are sometimes as if it is a standards document,
>which it of course isn't.
>
> Note that I'm not a DNS expert, but I have some
>knowledge about Unicode.
>
> Kind regards
> /Kent Karlsson
>
>
>=========================================================
>(Converted from a proprietary document format to plain text.
>Not much touchup has been done after that.)
>
>
>Domain name internationalisation
>
>Draft 0.4
>
>2000-01-26
>
>Kent Karlsson, IMI—Industri-Matematik International
>keka@im.se
>
>
>1 Introduction
>
>This note is about how Internet domain names should be internationalised.
>It deals with the encoding and restrictions of domain names as sent to a DNS
>(Domain Name Server). Domain names can of course be stored differently
>inside of documents (e.g. in XHTML documents, or e-mail messages).
>
>At present Internet domain names are still be restricted to 7-bit ASCII
>(ISO/IEC 646) as sent to a DNS, with some additional rules on which such
>characters are allowed. HTML, XML, IMAP, FTP, and many other text based
>items on the Internet have already been internationalised in the sense that
>a much wider range of characters are allowed, in particular using the UTF-8
>encoding of Unicode or ISO/IEC 10646-1. It is high time for domain names to
>be similarly internationalised.
>
>That the Domain name internationalisation effort should be based on
>Unicode/UTF-8 is taken as a given, as there are no contenders to global
>viability and backwards compatibility with the existing DNS system.
>
>
>2 Unicode vs. ISO/IEC 10646
>
>Unicode 3.0 and ISO/IEC 10646-1:2000 allocate the same characters at the
>same (abstract) code positions. They both define a UTF-8 encoding format,
>with a slight difference (see below). They also both define a UTF-16
>format, but that format is not suitable for domain names as sent to a DNS
>server, taking backwards compatibility into account.
>
>Unicode (but not ISO/IEC 10646) assigns property codes to characters. For
>the purposes of this version of domain name internationalisation, both the
>normative and informative general category property assignments of Unicode
>3.0.0 are considered normative.
>
>
>3 Unicode versioning
>
>This version of domain name internationalisation is made with Unicode 3.0 as
>a basis. When new versions of Unicode are issued, one may need to
>re-examine the domain name internationalisation. Most likely, Unicode 3.0
>will be sufficient for domain name use.
>
>
>4 UTF-8 encoding
>
>The Unicode UTF-8 format is limited to the first 17 planes, while the
>ISO/IEC 10646 UTF-8 covers 32 768 planes. For the purposes of this version
>of domain name internationalisation, UTF-8 is limited to plane 0 (the Basic
>Multilingual Plane) only.
>
>The details of the UTF-8 encoding are not described here. Please see
>ISO/IEC 10646-1:2000, Annex D, or The Unicode Standard, version 3.0, annex
>?, or RFC 2044.
>
>UTF-8 is compatible with 7-bit ASCII, i.e. a 7-bit ASCII string where each
>octet has the 8th bit set to 0 is in UTF-8 already.
>
>4.1 Malformed UTF-8 encodings
>
>Looked-up potential domain names that contain malformed UTF-8 sequences
>shall be rejected by a DNS as unregistered or, optionally, as being in
>error.
>· An octet with the value FE or FF is a malformed UTF-8 sequence.
>· An isolated continuation octet is a malformed UTF-8 sequence.
>· A prematurely terminated UTF-8 sequence is a malformed UTF-8
>sequence.
>· An unnecessarily long (for the abstract code point encoded) UTF-8
>sequence is a malformed UTF-8 sequence.
>· A UTF-8 sequence for the (abstract) code points FFFE and FFFF are
>malformed UTF-8 sequences.
>· A UTF-8 sequence longer than three octets is considered malformed
>for the purposes of this version of domain name internationalisation.
>
>4.2 Surrogates
>
>Surrogate character codes are reserved for use with UTF-16. These are the
>code points DC00 – DFFF. A UTF-8 sequence for a surrogate character code is
>a malformed UTF-8 sequence.
>
>4.3 Private use characters
>
>Unicode reserves some code points for private use characters. In plane 0
>(BMP) these are U+E000 – U+F8FF. These are intended for use only by user
>agreement of some kind.
>
>Private use characters are inappropriate for use in domain names. A UTF-8
>sequence for a private use character code is considered a malformed UTF-8
>sequence for the purposes of this version of domain name
>internationalisation.
>
>
>5 Unicode general categories
>
>Unicode assigns general categories (as well as other character properties)
>to characters. The Unicode 3.0 general categories and their interpretation
>for domain names are discussed in the following sections.
>
>Unicode regards some of these properties as normative, some as informative.
>For this version of internationalised domain names, all of them are
>considered normative.
>
>5.1 Letters, ideographs, and syllable characters
>
>Lu Letter, Uppercase Ok for domain names
>Ll Letter, Lowercase Ok for domain names
>Lt Letter, Titlecase Ok for domain names
>Lm Letter, Modifier Ok for domain names
>Lo Letter, Other Ok for domain names
>
>All of the letters, ideographs, and syllable characters of Unicode 3.0 are
>appropriate for use in domain names. Note however that a difference in
>letter characters need not imply a difference in domain name. Canonical,
>compatibility, and case distinctions are to be ignored. Case distinctions
>are ignored in domain names since the beginning. Since case is ignored, so
>should the less important compatibility distinctions. See also clause 6
>below about normalisation.
>
>5.2 Combining marks
>
>Mn Mark, Non-Spacing Must not be first, nor after a FULL STOP
>(not the LEFT/RIGHT half ones)
>Mc Mark, Spacing Combining Must not be first, nor after a FULL STOP
>Me Mark, Enclosing Probably inappropriate for domain names
>
>Used with reason and in moderation, combining marks are ok for use with
>domain names. Note however that character sequence distinctions that are
>equivalenced by Unicode canonical equivalence do not imply a difference in
>domain name. See also the clause about normalisation below.
>
>There are a number of script specific rules on how combining characters
>should be applied. For the purposes of domain names, we note that they are
>not to come first in any (FULL STOP separated) part of a domain name. See
>also clause 6 below about normalisation, and clause 7 below about scripts.
>
>5.3 Numbers
>
>Nd Number, Decimal Digit Ok for domain names
>Nl Number, Letter Ok for domain names
>No Number, Other Inappropriate for domain names? (comp. decomp.)
>
>Many “number” characters are ok for use with domain names. Note however
>that that many number characters have compatibility decomposition into
>letters, ideographs, or other number characters, and so are equivalent in a
>domain name. [The “No” characters that do not have a decomposition??]
>
>5.4 Punctuation
>
>Pc Punctuation, Connector Inappropriate for domain names (possibly
>with some exceptions, like KATAKANA MIDDLE DOT)
>Pd Punctuation, Dash Inappropriate for domain names, except for a
>few characters (see below).
>Ps Punctuation, Open Inappropriate for domain names
>Pe Punctuation, Close Inappropriate for domain names
>Pi Punctuation, Initial quote Inappropriate for domain names
>Pf Punctuation, Final quote Inappropriate for domain names
>Po Punctuation, Other Inappropriate for domain names, except for a
>few characters (see below).
>
>Domain name rules have always excluded punctuation characters, except for
>FULL STOP, which is given special significance within domain names. MIDDLE
>DOT and HYPHEN (or HYPHEN-MINUS) may need to be considered to be allowed.
>
>Punctuation has been excluded from domain names proper, since some (not all)
>punctuation characters in 7-bit ASCII has been used for other purposes near
>domain names. E.g. @, !, /, :, and % have special meanings near domain
>names in many contexts. Other punctuation is reserved for present or
>possible future use near domain names.
>
>BiDi and FULL STOPs (and @s)??
>
>5.5 Symbols
>
>Sm Symbol, Math Inappropriate for domain names
>Sc Symbol, Currency Inappropriate for domain names
>Sk Symbol, Modifier Inappropriate for domain names?
>So Symbol, Other Inappropriate for domain names (comp. decomp.?)
>
>As the case for punctuation, symbols are inappropriate for use with domain
>names.
>
>5.6 Separators
>
>Zs Separator, Space Inappropriate for domain names
>Zl Separator, Line Inappropriate for domain names
>Zp Separator, Paragraph Inappropriate for domain names
>
>Spaces and similar separators (like LINE FEED) have always been considered
>inappropriate for use in domain names. Unicode has many more different
>space characters than ASCII, and it also has new line/paragraph separation
>characters.
>
>5.7 Other characters
>
>Cc Other, Control Inappropriate for domain names
>Cf Other, Format Inappropriate for domain names (mostly??)
>Cs Other, Surrogate Inappropriate for domain names
>Co Other, Private Use Inappropriate for domain names
>Cn Other, Not Assigned Inappropriate for domain names in this
>version
>
>Control, format, surrogate, and private use characters are inappropriate for
>use in domain names. For this version of internationalised domain names,
>(abstract) code points that were unassigned in Unicode 3.0 are
>inappropriate.
>
>Note that the class Cf includes ZERO WIDTH NO-BREAK SPACE, which can be used
>as a “signature” when at the beginning of a string. This use is also
>inappropriate for domain names.
>
>5.8 The Plane 14 suggestion
>
>The “language tag” characters, that are suggested to be allocated in
plane
>14, see Unicode technical report number 7, are inappropriate for use in
>domain names.
>
>5.9 ISO/IEC TR 10176 AMD 1
>
>The technical report ISO/IEC TR 10176 (Guidelines for the preparation of
>programming language standards) in its revised (soon to be AMD 1) annex
>lists characters that at a minimum should be accepted in programming
>language identifiers. It does so for a “level 2 implementation” of
ISO/IEC
>10646. A domain name is similar to an “identifier” in a programming
>language, so what 10176 lists in its (revised!) Annex A should at least be
>considered.
>
>See PDAM text at http://std.dkuug.dk/jtc1/sc22/wg20/docs/n699.pdf.
>Note that this TR (as amended in what will be AMD 1) is based on Unicode
>2.1, not Unicode 3.0. An AMD 2, etc., is promised to only extend what is in
>AMD 1. Note also that compatibility forms are excluded from the lists in
>AMD 1, but programming languages may of course allow both compatibility
>forms and “level 2” combining marks. Nothing is said in AMD 1 about
>normalisation.
>
>ISO/IEC TR 10176 PDAM 1 is supported by the Unicode consortium, and is their
>(and SC22/WG20s) correction to the original list. The original list should
>be considered defective.
>
>
>6 Normalisation for domain names
>
>6.1 Case normalisation
>
>Internet domain names have been case insensitive from the start. When
>extending the allowed characters in domain names, it would be unwise to
>either abandon case insensitiveness or restrict it to just the ASCII part.
>Instead, this principle should be extended to the new characters allowed in
>domain names. However, there are some problems with this. First, the case
>mappings documented by the Unicode consortium are only informative, not
>normative. Second, there are some known exceptions: like that for Turkish i
>and dotless i. Third, for several more cases the case mapping is not 1 to
>1, e.g. sharp s (ß; U+00DF) maps to uppercase SS, mapping that back to
>lowercase gives ss. There are several other such cases. [not sure exactly
>what to do with these]
>
>Unicode Technical Report number 21 [UTR21] describes one way of doing this
>[is that appropriate? Any better way of doing this?] SHARP S, YPOGEGRAMMENI,
>PROSGEGRAMMENI? Map to lowercase? Map to uppercase? tolower(toupper(x))?
>UTR 21 (with the associated data file CaseFolding.txt) essentially
>(exactly?) implies tolower(toupper(x)) (see also below); dotless i might not
>be handled the way desired (in Turkey), nor is sigma and other letters with
>final forms.
>
>6.2 Unicode normalisation
>
>Canonical distinctions, in the Unicode sense, shall be ignored.
>Since case distinctions should be ignored, compatibility distinctions should
>most certainly be ignored too. Compatibility distinctions can be normalised
>away with the same algorithm as canonical distinctions are normalised away.
>Normalisation form KC (compatibility decomposition, logically followed by
>canonical composition), see Unicode Technical Report number 15 [UTR15],
>should be used for domain names, at least at registration time, if not at
>lookup time. Among a few other things, this maps WIDE, NARROW, and
>PRESENTATION FORM characters to their nominal corresponding character.
>
>It is the resulting character string after KC normalisation for which the
>category test above is referring to.
>
>Normalisation KC by itself does not imply any case normalisation.
>Note that normalise(KC, casefold(x)) is not the same as
>casefold(normalise(KC, x)), if casefold follows CaseFold.txt.
>
>
>6.3 Further normalisation
>
>FINAL SIGMA, FINAL KAF, FINAL MEM, FINAL NUN, FINAL PE, FINAL TSADI, FINAL
>SEMKATH, BOPOMOFO FINAL *? Suggestion: ignore ‘finality’, i.e.,
consider to
>them be equivalent with their corresponding ‘ordinary’ version.
>
>[Funny, CaseFolding.txt maps all sigmas to final(!) sigmas; but does nothing
>for other ‘final’ characters.]
>
>Map HYPHEN, NO-BREAK HYPHEN, and * DASHes to HYPHEN-MINUS? Remove * SOFT
>HYPHEN and ZWSP?
>
>“New line function” ‘normalisation’ (see UTR 13) does not apply to
domain
>names, since no domain name is to have any such character in it.
>
>
>6.4 A possible alternative to normalisation: collation weighting
>
>A possible alternative to do KC and case normalisation is to use the ISO/IEC
>14651 CTT (common template table), or the UTR 10 associated tables, with
>some tailoring suitable for the DNS (no, NOT local ones). In particular,
>punctuation and symbols must be significant at level 1. Then determine
>equality up to and including level 2 (accents; similar), but not level 3
>(case; hira/kata, various compatibility distinctions).
>
>This is also based on Unicode 2.1, not yet Unicode 3.0. Also, there is at
>present NO promise not to do changes that may affect, to some degree, use of
>the weightings that result. In particular, for 14651 no particular weight
>VALUES are assigned. That up to each implementation. For the UTR 10
>tables, the actual weight values may change at any update (or in any
>suitable way by tailoring, or other implementation decisions), so different
>versions cannot be used in a mix. Finally, there is no resulting “normal
>form” character string from these weight tables.
>
>
>7 One should not mix scripts between FULL STOPs
>
>It is not a good idea to mix scripts freely in a single “part” of a
domain
>name. E.g., it would be very confusing if an initial A is a Greek A, while
>the rest of the name part is in the Latin script.
>
>However, what constitutes a script is not clearly defined, and some
>orthographies (like the Japanese) normally do mix “scripts” in a single
>“word”. Therefore this must be left for human judgement. For an
automated
>service one may apply some heuristic on suggested names that may need human
>scrutiny, or reject doubtful cases for registration. Note also that ASCII
>digits can be used with any other script, and many of the combining
>non-spacing marks are script generic, i.e. can be used with several
>different scripts.
>
>No rigid scheme should be applied for this. It should only be a
>registration time heuristic, overrideable by human intervention.
>
>
>8 &-encoding (XML), %-encoding (URL), and =-encoding (QP)
>
>Any &-encoding used in XML (or HTML) documents in a string that contains a
>domain name shall be decoded before sending the domain name to a DNS system.
>Note that XML &-codes are character oriented and independent of the
>character encoding used for the XML document itself.
>Any %-encoding in a URL shall not be decoded in the domain name part, and %
>as such is not legal in a domain name. Such a domain name is thus
>malformed. The % character may mean something else though, so no attempt at
>URL %-decoding shall be done at that point. In addition, the octet oriented
>(not character oriented) %-encoding is for an unknown character encoding,
>and any attempt at decoding it by the client is likely to be in error.
>
>Any =-encoding in an e-mail in Quoted-Printable shall be decoded according
>to the charset declaration of the message. Hopefully, Quoted-Printable will
>go out of use, so this should be less of a problem...
>
>
>9 E-mail address internationalisation
>
>The pre-@ part of e-mail addresses should be internationalised in the same
>way as domain names are internationalised.
>
>====================================================================
>
>
>
Bill Semich
President and Founder
.NU Domain Ltd
http://whats.nu
bill@mail.nic.nu