[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: Draft 0.4

To: <keka@im.se>,idn@ops.ietf.org
Subject: Re: Draft 0.4
From: bill@mail.nic.nu (J. William Semich)
Date: Fri, 28 Jan 2000 17:52:58 -0500
Delivery-date: Fri, 28 Jan 2000 14:59:51 -0800
Envelope-to: idn-data@psg.com
I can agree with the majority of this proposal. Thanks, Kent - this is
much-needed work!

Bill Semich
.NU Domain


At 08:53 AM 1/28/00 -0800, keka@im.se wrote:
>Hi!
>

>	Despite some initial trouble getting on this list,
>I've now been able to subscribe to it.  Thanks Martin!
>I've tried to catch up on the e-mails so far, but I've
>only browsed quickly though them.
>
>	I've been thinking a bit about how domain names should
>be internationalised, and the text below reflects my current
>thinking about this.  Most of this text was written before
>browsing trough the e-mail archive for this list.  The
>formulations are sometimes as if it is a standards document,
>which it of course isn't.
>
>	Note that I'm not a DNS expert, but I have some
>knowledge about Unicode.
>
>		Kind regards
>		/Kent Karlsson
>
>
>=========================================================
>(Converted from a proprietary document format to plain text.
>Not much touchup has been done after that.)
>
>
>Domain name internationalisation
>
>Draft 0.4
>
>2000-01-26
>
>Kent Karlsson, IMI—Industri-Matematik International
>keka@im.se
>
>
>1	Introduction
>
>This note is about how Internet domain names should be internationalised.
>It deals with the encoding and restrictions of domain names as sent to a DNS
>(Domain Name Server).  Domain names can of course be stored differently
>inside of documents (e.g. in XHTML documents, or e-mail messages).
>
>At present Internet domain names are still be restricted to 7-bit ASCII
>(ISO/IEC 646) as sent to a DNS, with some additional rules on which such
>characters are allowed.  HTML, XML, IMAP, FTP, and many other text based
>items on the Internet have already been internationalised in the sense that
>a much wider range of characters are allowed, in particular using the UTF-8
>encoding of Unicode or ISO/IEC 10646-1.  It is high time for domain names to
>be similarly internationalised.
>
>That the Domain name internationalisation effort should be based on
>Unicode/UTF-8 is taken as a given, as there are no contenders to global
>viability and backwards compatibility with the existing DNS system.
>
>
>2	Unicode vs. ISO/IEC 10646
>
>Unicode 3.0 and ISO/IEC 10646-1:2000 allocate the same characters at the
>same (abstract) code positions.  They both define a UTF-8 encoding format,
>with a slight difference (see below).  They also both define a UTF-16
>format, but that format is not suitable for domain names as sent to a DNS
>server, taking backwards compatibility into account.
>
>Unicode (but not ISO/IEC 10646) assigns property codes to characters.  For
>the purposes of this version of domain name internationalisation, both the
>normative and informative general category property assignments of Unicode
>3.0.0 are considered normative.
>
>
>3	Unicode versioning
>
>This version of domain name internationalisation is made with Unicode 3.0 as
>a basis.  When new versions of Unicode are issued, one may need to
>re-examine the domain name internationalisation.  Most likely, Unicode 3.0
>will be sufficient for domain name use.
>
>
>4	UTF-8 encoding
>
>The Unicode UTF-8 format is limited to the first 17 planes, while the
>ISO/IEC 10646 UTF-8 covers 32 768 planes.  For the purposes of this version
>of domain name internationalisation, UTF-8 is limited to plane 0 (the Basic
>Multilingual Plane) only.
>
>The details of the UTF-8 encoding are not described here.  Please see
>ISO/IEC 10646-1:2000, Annex D, or The Unicode Standard, version 3.0, annex
>?, or RFC 2044.
>
>UTF-8 is compatible with 7-bit ASCII, i.e. a 7-bit ASCII string where each
>octet has the 8th bit set to 0 is in UTF-8 already.
>
>4.1	Malformed UTF-8 encodings
>
>Looked-up potential domain names that contain malformed UTF-8 sequences
>shall be rejected by a DNS as unregistered or, optionally, as being in
>error.
>·	An octet with the value FE or FF is a malformed UTF-8 sequence.
>·	An isolated continuation octet is a malformed UTF-8 sequence.
>·	A prematurely terminated UTF-8 sequence is a malformed UTF-8
>sequence.
>·	An unnecessarily long (for the abstract code point encoded) UTF-8
>sequence is a malformed UTF-8 sequence.
>·	A UTF-8 sequence for the (abstract) code points FFFE and FFFF are
>malformed UTF-8 sequences.
>·	A UTF-8 sequence longer than three octets is considered malformed
>for the purposes of this version of domain name internationalisation.
>
>4.2	Surrogates
>
>Surrogate character codes are reserved for use with UTF-16.  These are the
>code points DC00 – DFFF. A UTF-8 sequence for a surrogate character code is
>a malformed UTF-8 sequence.
>
>4.3	Private use characters
>
>Unicode reserves some code points for private use characters.  In plane 0
>(BMP) these are U+E000 – U+F8FF. These are intended for use only by user
>agreement of some kind.
>
>Private use characters are inappropriate for use in domain names.  A UTF-8
>sequence for a private use character code is considered a malformed UTF-8
>sequence for the purposes of this version of domain name
>internationalisation.
>
>
>5	Unicode general categories
>
>Unicode assigns general categories (as well as other character properties)
>to characters.  The Unicode 3.0 general categories and their interpretation
>for domain names are discussed in the following sections.
>
>Unicode regards some of these properties as normative, some as informative.
>For this version of internationalised domain names, all of them are
>considered normative.
>
>5.1	Letters, ideographs, and syllable characters
>
>Lu	Letter, Uppercase 	Ok for domain names
>Ll	Letter, Lowercase 	Ok for domain names
>Lt	Letter, Titlecase 	Ok for domain names
>Lm	Letter, Modifier 	Ok for domain names
>Lo	Letter, Other 	Ok for domain names
>
>All of the letters, ideographs, and syllable characters of Unicode 3.0 are
>appropriate for use in domain names.  Note however that a difference in
>letter characters need not imply a difference in domain name.  Canonical,
>compatibility, and case distinctions are to be ignored.  Case distinctions
>are ignored in domain names since the beginning.  Since case is ignored, so
>should the less important compatibility distinctions.  See also clause 6
>below about normalisation.
>
>5.2	Combining marks
>
>Mn 	Mark, Non-Spacing 	Must not be first, nor after a FULL STOP
>(not the LEFT/RIGHT half ones)
>Mc	Mark, Spacing Combining	Must not be first, nor after a FULL STOP
>Me	Mark, Enclosing 	Probably inappropriate for domain names
>
>Used with reason and in moderation, combining marks are ok for use with
>domain names.  Note however that character sequence distinctions that are
>equivalenced by Unicode canonical equivalence do not imply a difference in
>domain name.  See also the clause about normalisation below.
>
>There are a number of script specific rules on how combining characters
>should be applied.  For the purposes of domain names, we note that they are
>not to come first in any (FULL STOP separated) part of a domain name.   See
>also clause 6 below about normalisation, and clause 7 below about scripts.
>
>5.3	Numbers
>
>Nd	Number, Decimal Digit 	Ok for domain names
>Nl	Number, Letter 	Ok for domain names
>No	Number, Other 	Inappropriate for domain names? (comp. decomp.)
>
>Many “number” characters are ok for use with domain names.  Note however
>that that many number characters have compatibility decomposition into
>letters, ideographs, or other number characters, and so are equivalent in a
>domain name.  [The “No” characters that do not have a decomposition??]
>
>5.4	Punctuation
>
>Pc	Punctuation, Connector 	Inappropriate for domain names (possibly
>with some exceptions, like KATAKANA MIDDLE DOT)
>Pd	Punctuation, Dash 	Inappropriate for domain names, except for a
>few characters (see below).
>Ps	Punctuation, Open 	Inappropriate for domain names
>Pe	Punctuation, Close 	Inappropriate for domain names
>Pi	Punctuation, Initial quote	Inappropriate for domain names
>Pf	Punctuation, Final quote	Inappropriate for domain names
>Po	Punctuation, Other	Inappropriate for domain names, except for a
>few characters (see below).
>
>Domain name rules have always excluded punctuation characters, except for
>FULL STOP, which is given special significance within domain names.  MIDDLE
>DOT and HYPHEN (or HYPHEN-MINUS) may need to be considered to be allowed.
>
>Punctuation has been excluded from domain names proper, since some (not all)
>punctuation characters in 7-bit ASCII has been used for other purposes near
>domain names.  E.g. @, !, /, :, and % have special meanings near domain
>names in many contexts.  Other punctuation is reserved for present or
>possible future use near domain names.
>
>BiDi and FULL STOPs (and @s)??
>
>5.5	Symbols
>
>Sm	Symbol, Math 	Inappropriate for domain names
>Sc	Symbol, Currency 	Inappropriate for domain names
>Sk	Symbol, Modifier 	Inappropriate for domain names?
>So	Symbol, Other 	Inappropriate for domain names (comp. decomp.?)
>
>As the case for punctuation, symbols are inappropriate for use with domain
>names.
>
>5.6	Separators
>
>Zs	Separator, Space 	Inappropriate for domain names
>Zl	Separator, Line 	Inappropriate for domain names
>Zp	Separator, Paragraph 	Inappropriate for domain names
>
>Spaces and similar separators (like LINE FEED) have always been considered
>inappropriate for use in domain names.  Unicode has many more different
>space characters than ASCII, and it also has new line/paragraph separation
>characters.
>
>5.7	Other characters
>
>Cc	Other, Control 	Inappropriate for domain names
>Cf	Other, Format 	Inappropriate for domain names (mostly??)
>Cs	Other, Surrogate 	Inappropriate for domain names
>Co	Other, Private Use 	Inappropriate for domain names
>Cn	Other, Not Assigned	Inappropriate for domain names in this
>version
>
>Control, format, surrogate, and private use characters are inappropriate for
>use in domain names.  For this version of internationalised domain names,
>(abstract) code points that were unassigned in Unicode 3.0 are
>inappropriate.
>
>Note that the class Cf includes ZERO WIDTH NO-BREAK SPACE, which can be used
>as a “signature” when at the beginning of a string.  This use is also
>inappropriate for domain names.
>
>5.8	The Plane 14 suggestion
>
>The “language tag” characters, that are suggested to be allocated in
plane
>14, see Unicode technical report number 7, are inappropriate for use in
>domain names.
>
>5.9	ISO/IEC TR 10176 AMD 1
>
>The technical report ISO/IEC TR 10176 (Guidelines for the preparation of
>programming language standards) in its revised (soon to be AMD 1) annex
>lists characters that at a minimum should be accepted in programming
>language identifiers.  It does so for a “level 2 implementation” of
ISO/IEC
>10646.  A domain name is similar to an “identifier” in a programming
>language, so what 10176 lists in its (revised!) Annex A should at least be
>considered. 
>
>See PDAM text at http://std.dkuug.dk/jtc1/sc22/wg20/docs/n699.pdf.
>Note that this TR (as amended in what will be AMD 1) is based on Unicode
>2.1, not Unicode 3.0.  An AMD 2, etc., is promised to only extend what is in
>AMD 1.  Note also that compatibility forms are excluded from the lists in
>AMD 1, but programming languages may of course allow both compatibility
>forms and “level 2” combining marks.  Nothing is said in AMD 1 about
>normalisation.
>
>ISO/IEC TR 10176 PDAM 1 is supported by the Unicode consortium, and is their
>(and SC22/WG20s) correction to the original list.  The original list should
>be considered defective.
>
>
>6	Normalisation for domain names
>
>6.1	Case normalisation
>
>Internet domain names have been case insensitive from the start.  When
>extending the allowed characters in domain names, it would be unwise to
>either abandon case insensitiveness or restrict it to just the ASCII part.
>Instead, this principle should be extended to the new characters allowed in
>domain names.  However, there are some problems with this.  First, the case
>mappings documented by the Unicode consortium are only informative, not
>normative.  Second, there are some known exceptions: like that for Turkish i
>and dotless i.  Third, for several more cases the case mapping is not 1 to
>1, e.g. sharp s (ß; U+00DF) maps to uppercase SS, mapping that back to
>lowercase gives ss.  There are several other such cases. [not sure exactly
>what to do with these]
>
>Unicode Technical Report number 21 [UTR21] describes one way of doing this
>[is that appropriate? Any better way of doing this?] SHARP S, YPOGEGRAMMENI,
>PROSGEGRAMMENI?  Map to lowercase? Map to uppercase? tolower(toupper(x))?
>UTR 21 (with the associated data file CaseFolding.txt) essentially
>(exactly?) implies tolower(toupper(x)) (see also below); dotless i might not
>be handled the way desired (in Turkey), nor is sigma and other letters with
>final forms.
>
>6.2	Unicode normalisation
>
>Canonical distinctions, in the Unicode sense, shall be ignored.
>Since case distinctions should be ignored, compatibility distinctions should
>most certainly be ignored too.  Compatibility distinctions can be normalised
>away with the same algorithm as canonical distinctions are normalised away.
>Normalisation form KC (compatibility decomposition, logically followed by
>canonical composition), see Unicode Technical Report number 15 [UTR15],
>should be used for domain names, at least at registration time, if not at
>lookup time.  Among a few other things, this maps WIDE, NARROW, and
>PRESENTATION FORM characters to their nominal corresponding character.
>
>It is the resulting character string after KC normalisation for which the
>category test above is referring to.
>
>Normalisation KC by itself does not imply any case normalisation.
>Note that normalise(KC, casefold(x)) is not the same as
>casefold(normalise(KC, x)), if casefold follows CaseFold.txt.
>
>
>6.3	Further normalisation
>
>FINAL SIGMA, FINAL KAF, FINAL MEM, FINAL NUN, FINAL PE, FINAL TSADI, FINAL
>SEMKATH, BOPOMOFO FINAL *? Suggestion: ignore ‘finality’, i.e.,
consider to
>them be equivalent with their corresponding ‘ordinary’ version. 
>
>[Funny, CaseFolding.txt maps all sigmas to final(!) sigmas; but does nothing
>for other ‘final’ characters.]
>
>Map HYPHEN, NO-BREAK HYPHEN, and * DASHes to HYPHEN-MINUS? Remove * SOFT
>HYPHEN and ZWSP?
>
>“New line function” ‘normalisation’ (see UTR 13) does not apply to
domain
>names, since no domain name is to have any such character in it.
>
>
>6.4	A possible alternative to normalisation: collation weighting 
>
>A possible alternative to do KC and case normalisation is to use the ISO/IEC
>14651 CTT (common template table), or the UTR 10 associated tables, with
>some tailoring suitable for the DNS (no, NOT local ones).  In particular,
>punctuation and symbols must be significant at level 1.  Then determine
>equality up to and including level 2 (accents; similar), but not level 3
>(case; hira/kata, various compatibility distinctions).
>
>This is also based on Unicode 2.1, not yet Unicode 3.0.  Also, there is at
>present NO promise not to do changes that may affect, to some degree, use of
>the weightings that result.  In particular, for 14651 no particular weight
>VALUES are assigned.  That up to each implementation.  For the UTR 10
>tables, the actual weight values may change at any update (or in any
>suitable way by tailoring, or other implementation decisions), so different
>versions cannot be used in a mix.  Finally, there is no resulting “normal
>form” character string from these weight tables.
>
>
>7	One should not mix scripts between FULL STOPs
>
>It is not a good idea to mix scripts freely in a single “part” of a
domain
>name.  E.g., it would be very confusing if an initial A is a Greek A, while
>the rest of the name part is in the Latin script.
>
>However, what constitutes a script is not clearly defined, and some
>orthographies (like the Japanese) normally do mix “scripts” in a single
>“word”.  Therefore this must be left for human judgement.  For an
automated
>service one may apply some heuristic on suggested names that may need human
>scrutiny, or reject doubtful cases for registration.  Note also that ASCII
>digits can be used with any other script, and many of the combining
>non-spacing marks are script generic, i.e. can be used with several
>different scripts.
>
>No rigid scheme should be applied for this.  It should only be a
>registration time heuristic, overrideable by human intervention.
>
>
>8	&-encoding (XML), %-encoding (URL), and =-encoding (QP)
>
>Any &-encoding used in XML (or HTML) documents in a string that contains a
>domain name shall be decoded before sending the domain name to a DNS system.
>Note that XML &-codes are character oriented and independent of the
>character encoding used for the XML document itself.
>Any %-encoding in a URL shall not be decoded in the domain name part, and %
>as such is not legal in a domain name.  Such a domain name is thus
>malformed.  The % character may mean something else though, so no attempt at
>URL %-decoding shall be done at that point.  In addition, the octet oriented
>(not character oriented) %-encoding is for an unknown character encoding,
>and any attempt at decoding it by the client is likely to be in error.
>
>Any =-encoding in an e-mail in Quoted-Printable shall be decoded according
>to the charset declaration of the message.  Hopefully, Quoted-Printable will
>go out of use, so this should be less of a problem...
>
>
>9	E-mail address internationalisation
>
>The pre-@ part of e-mail addresses should be internationalised in the same
>way as domain names are internationalised.
>
>====================================================================
>
>
>
Bill Semich
President and Founder
.NU Domain Ltd
http://whats.nu
bill@mail.nic.nu
Prev by Date: Heritage of �� (was RE: China)
Next by Date: Re: This WG in context of RFC 2277
Prev by thread: Re: Unicode categories, normalisation, for IDN
Next by thread: Heritage of �� (was RE: China)
Index(es):
- Date
- Thread