[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: Unicode categories, normalisation, for IDN
- To: Karlsson Kent - keka <keka@im.se>
- Subject: Re: Unicode categories, normalisation, for IDN
- From: James Seng <jseng@pobox.org.sg>
- Date: Sat, 29 Jan 2000 08:48:34 +0800
- CC: idn@ops.ietf.org
- Delivery-date: Fri, 28 Jan 2000 16:49:50 -0800
- Envelope-to: idn-data@psg.com
Good issues on using Unicode on IDN.
Most of the issues raised are important such as numbers, symbols etc which I
will incorporate into the next doc. However, lets focus on the requirement doc
and not diverge into the implementation yet. Hopefully, we have at least
version 0 of the requirement draft before the next IETF meeting in March.
Kent, please refer to ftp://ops.ietf.org/pub/lists/idn* on the archives of the
discussions.
Thanks!
-James Seng
Karlsson Kent - keka wrote:
>
> Hi!
>
> Despite some initial trouble getting on this list,
> I've now been able to subscribe to it. Thanks Martin!
> I've tried to catch up on the e-mails so far, but I've
> only browsed quickly though them.
>
> I've been thinking a bit about how domain names should
> be internationalised, and the text below reflects my current
> thinking about this. Most of this text was written before
> browsing trough the e-mail archive for this list. The
> formulations are sometimes as if it is a standards document,
> which it of course isn't.
>
> Note that I'm not a DNS expert, but I have some
> knowledge about Unicode.
>
> Kind regards
> /Kent Karlsson
>
> =========================================================
> (Converted from a proprietary document format to plain text.
> Not much touchup has been done after that.)
>
> Domain name internationalisation
>
> Draft 0.4
>
> 2000-01-26
>
> Kent Karlsson, IMI—Industri-Matematik International
> keka@im.se
>
> 1 Introduction
>
> This note is about how Internet domain names should be internationalised.
> It deals with the encoding and restrictions of domain names as sent to a DNS
> (Domain Name Server). Domain names can of course be stored differently
> inside of documents (e.g. in XHTML documents, or e-mail messages).
>
> At present Internet domain names are still be restricted to 7-bit ASCII
> (ISO/IEC 646) as sent to a DNS, with some additional rules on which such
> characters are allowed. HTML, XML, IMAP, FTP, and many other text based
> items on the Internet have already been internationalised in the sense that
> a much wider range of characters are allowed, in particular using the UTF-8
> encoding of Unicode or ISO/IEC 10646-1. It is high time for domain names to
> be similarly internationalised.
>
> That the Domain name internationalisation effort should be based on
> Unicode/UTF-8 is taken as a given, as there are no contenders to global
> viability and backwards compatibility with the existing DNS system.
>
> 2 Unicode vs. ISO/IEC 10646
>
> Unicode 3.0 and ISO/IEC 10646-1:2000 allocate the same characters at the
> same (abstract) code positions. They both define a UTF-8 encoding format,
> with a slight difference (see below). They also both define a UTF-16
> format, but that format is not suitable for domain names as sent to a DNS
> server, taking backwards compatibility into account.
>
> Unicode (but not ISO/IEC 10646) assigns property codes to characters. For
> the purposes of this version of domain name internationalisation, both the
> normative and informative general category property assignments of Unicode
> 3.0.0 are considered normative.
>
> 3 Unicode versioning
>
> This version of domain name internationalisation is made with Unicode 3.0 as
> a basis. When new versions of Unicode are issued, one may need to
> re-examine the domain name internationalisation. Most likely, Unicode 3.0
> will be sufficient for domain name use.
>
> 4 UTF-8 encoding
>
> The Unicode UTF-8 format is limited to the first 17 planes, while the
> ISO/IEC 10646 UTF-8 covers 32 768 planes. For the purposes of this version
> of domain name internationalisation, UTF-8 is limited to plane 0 (the Basic
> Multilingual Plane) only.
>
> The details of the UTF-8 encoding are not described here. Please see
> ISO/IEC 10646-1:2000, Annex D, or The Unicode Standard, version 3.0, annex
> ?, or RFC 2044.
>
> UTF-8 is compatible with 7-bit ASCII, i.e. a 7-bit ASCII string where each
> octet has the 8th bit set to 0 is in UTF-8 already.
>
> 4.1 Malformed UTF-8 encodings
>
> Looked-up potential domain names that contain malformed UTF-8 sequences
> shall be rejected by a DNS as unregistered or, optionally, as being in
> error.
> · An octet with the value FE or FF is a malformed UTF-8 sequence.
> · An isolated continuation octet is a malformed UTF-8 sequence.
> · A prematurely terminated UTF-8 sequence is a malformed UTF-8
> sequence.
> · An unnecessarily long (for the abstract code point encoded) UTF-8
> sequence is a malformed UTF-8 sequence.
> · A UTF-8 sequence for the (abstract) code points FFFE and FFFF are
> malformed UTF-8 sequences.
> · A UTF-8 sequence longer than three octets is considered malformed
> for the purposes of this version of domain name internationalisation.
>
> 4.2 Surrogates
>
> Surrogate character codes are reserved for use with UTF-16. These are the
> code points DC00 – DFFF. A UTF-8 sequence for a surrogate character code is
> a malformed UTF-8 sequence.
>
> 4.3 Private use characters
>
> Unicode reserves some code points for private use characters. In plane 0
> (BMP) these are U+E000 – U+F8FF. These are intended for use only by user
> agreement of some kind.
>
> Private use characters are inappropriate for use in domain names. A UTF-8
> sequence for a private use character code is considered a malformed UTF-8
> sequence for the purposes of this version of domain name
> internationalisation.
>
> 5 Unicode general categories
>
> Unicode assigns general categories (as well as other character properties)
> to characters. The Unicode 3.0 general categories and their interpretation
> for domain names are discussed in the following sections.
>
> Unicode regards some of these properties as normative, some as informative.
> For this version of internationalised domain names, all of them are
> considered normative.
>
> 5.1 Letters, ideographs, and syllable characters
>
> Lu Letter, Uppercase Ok for domain names
> Ll Letter, Lowercase Ok for domain names
> Lt Letter, Titlecase Ok for domain names
> Lm Letter, Modifier Ok for domain names
> Lo Letter, Other Ok for domain names
>
> All of the letters, ideographs, and syllable characters of Unicode 3.0 are
> appropriate for use in domain names. Note however that a difference in
> letter characters need not imply a difference in domain name. Canonical,
> compatibility, and case distinctions are to be ignored. Case distinctions
> are ignored in domain names since the beginning. Since case is ignored, so
> should the less important compatibility distinctions. See also clause 6
> below about normalisation.
>
> 5.2 Combining marks
>
> Mn Mark, Non-Spacing Must not be first, nor after a FULL STOP
> (not the LEFT/RIGHT half ones)
> Mc Mark, Spacing Combining Must not be first, nor after a FULL STOP
> Me Mark, Enclosing Probably inappropriate for domain names
>
> Used with reason and in moderation, combining marks are ok for use with
> domain names. Note however that character sequence distinctions that are
> equivalenced by Unicode canonical equivalence do not imply a difference in
> domain name. See also the clause about normalisation below.
>
> There are a number of script specific rules on how combining characters
> should be applied. For the purposes of domain names, we note that they are
> not to come first in any (FULL STOP separated) part of a domain name. See
> also clause 6 below about normalisation, and clause 7 below about scripts.
>
> 5.3 Numbers
>
> Nd Number, Decimal Digit Ok for domain names
> Nl Number, Letter Ok for domain names
> No Number, Other Inappropriate for domain names? (comp. decomp.)
>
> Many “number” characters are ok for use with domain names. Note however
> that that many number characters have compatibility decomposition into
> letters, ideographs, or other number characters, and so are equivalent in a
> domain name. [The “No” characters that do not have a decomposition??]
>
> 5.4 Punctuation
>
> Pc Punctuation, Connector Inappropriate for domain names (possibly
> with some exceptions, like KATAKANA MIDDLE DOT)
> Pd Punctuation, Dash Inappropriate for domain names, except for a
> few characters (see below).
> Ps Punctuation, Open Inappropriate for domain names
> Pe Punctuation, Close Inappropriate for domain names
> Pi Punctuation, Initial quote Inappropriate for domain names
> Pf Punctuation, Final quote Inappropriate for domain names
> Po Punctuation, Other Inappropriate for domain names, except for a
> few characters (see below).
>
> Domain name rules have always excluded punctuation characters, except for
> FULL STOP, which is given special significance within domain names. MIDDLE
> DOT and HYPHEN (or HYPHEN-MINUS) may need to be considered to be allowed.
>
> Punctuation has been excluded from domain names proper, since some (not all)
> punctuation characters in 7-bit ASCII has been used for other purposes near
> domain names. E.g. @, !, /, :, and % have special meanings near domain
> names in many contexts. Other punctuation is reserved for present or
> possible future use near domain names.
>
> BiDi and FULL STOPs (and @s)??
>
> 5.5 Symbols
>
> Sm Symbol, Math Inappropriate for domain names
> Sc Symbol, Currency Inappropriate for domain names
> Sk Symbol, Modifier Inappropriate for domain names?
> So Symbol, Other Inappropriate for domain names (comp. decomp.?)
>
> As the case for punctuation, symbols are inappropriate for use with domain
> names.
>
> 5.6 Separators
>
> Zs Separator, Space Inappropriate for domain names
> Zl Separator, Line Inappropriate for domain names
> Zp Separator, Paragraph Inappropriate for domain names
>
> Spaces and similar separators (like LINE FEED) have always been considered
> inappropriate for use in domain names. Unicode has many more different
> space characters than ASCII, and it also has new line/paragraph separation
> characters.
>
> 5.7 Other characters
>
> Cc Other, Control Inappropriate for domain names
> Cf Other, Format Inappropriate for domain names (mostly??)
> Cs Other, Surrogate Inappropriate for domain names
> Co Other, Private Use Inappropriate for domain names
> Cn Other, Not Assigned Inappropriate for domain names in this
> version
>
> Control, format, surrogate, and private use characters are inappropriate for
> use in domain names. For this version of internationalised domain names,
> (abstract) code points that were unassigned in Unicode 3.0 are
> inappropriate.
>
> Note that the class Cf includes ZERO WIDTH NO-BREAK SPACE, which can be used
> as a “signature” when at the beginning of a string. This use is also
> inappropriate for domain names.
>
> 5.8 The Plane 14 suggestion
>
> The “language tag” characters, that are suggested to be allocated in plane
> 14, see Unicode technical report number 7, are inappropriate for use in
> domain names.
>
> 5.9 ISO/IEC TR 10176 AMD 1
>
> The technical report ISO/IEC TR 10176 (Guidelines for the preparation of
> programming language standards) in its revised (soon to be AMD 1) annex
> lists characters that at a minimum should be accepted in programming
> language identifiers. It does so for a “level 2 implementation” of ISO/IEC
> 10646. A domain name is similar to an “identifier” in a programming
> language, so what 10176 lists in its (revised!) Annex A should at least be
> considered.
>
> See PDAM text at http://std.dkuug.dk/jtc1/sc22/wg20/docs/n699.pdf.
> Note that this TR (as amended in what will be AMD 1) is based on Unicode
> 2.1, not Unicode 3.0. An AMD 2, etc., is promised to only extend what is in
> AMD 1. Note also that compatibility forms are excluded from the lists in
> AMD 1, but programming languages may of course allow both compatibility
> forms and “level 2” combining marks. Nothing is said in AMD 1 about
> normalisation.
>
> ISO/IEC TR 10176 PDAM 1 is supported by the Unicode consortium, and is their
> (and SC22/WG20s) correction to the original list. The original list should
> be considered defective.
>
> 6 Normalisation for domain names
>
> 6.1 Case normalisation
>
> Internet domain names have been case insensitive from the start. When
> extending the allowed characters in domain names, it would be unwise to
> either abandon case insensitiveness or restrict it to just the ASCII part.
> Instead, this principle should be extended to the new characters allowed in
> domain names. However, there are some problems with this. First, the case
> mappings documented by the Unicode consortium are only informative, not
> normative. Second, there are some known exceptions: like that for Turkish i
> and dotless i. Third, for several more cases the case mapping is not 1 to
> 1, e.g. sharp s (ß; U+00DF) maps to uppercase SS, mapping that back to
> lowercase gives ss. There are several other such cases. [not sure exactly
> what to do with these]
>
> Unicode Technical Report number 21 [UTR21] describes one way of doing this
> [is that appropriate? Any better way of doing this?] SHARP S, YPOGEGRAMMENI,
> PROSGEGRAMMENI? Map to lowercase? Map to uppercase? tolower(toupper(x))?
> UTR 21 (with the associated data file CaseFolding.txt) essentially
> (exactly?) implies tolower(toupper(x)) (see also below); dotless i might not
> be handled the way desired (in Turkey), nor is sigma and other letters with
> final forms.
>
> 6.2 Unicode normalisation
>
> Canonical distinctions, in the Unicode sense, shall be ignored.
> Since case distinctions should be ignored, compatibility distinctions should
> most certainly be ignored too. Compatibility distinctions can be normalised
> away with the same algorithm as canonical distinctions are normalised away.
> Normalisation form KC (compatibility decomposition, logically followed by
> canonical composition), see Unicode Technical Report number 15 [UTR15],
> should be used for domain names, at least at registration time, if not at
> lookup time. Among a few other things, this maps WIDE, NARROW, and
> PRESENTATION FORM characters to their nominal corresponding character.
>
> It is the resulting character string after KC normalisation for which the
> category test above is referring to.
>
> Normalisation KC by itself does not imply any case normalisation.
> Note that normalise(KC, casefold(x)) is not the same as
> casefold(normalise(KC, x)), if casefold follows CaseFold.txt.
>
> 6.3 Further normalisation
>
> FINAL SIGMA, FINAL KAF, FINAL MEM, FINAL NUN, FINAL PE, FINAL TSADI, FINAL
> SEMKATH, BOPOMOFO FINAL *? Suggestion: ignore ‘finality’, i.e., consider to
> them be equivalent with their corresponding ‘ordinary’ version.
>
> [Funny, CaseFolding.txt maps all sigmas to final(!) sigmas; but does nothing
> for other ‘final’ characters.]
>
> Map HYPHEN, NO-BREAK HYPHEN, and * DASHes to HYPHEN-MINUS? Remove * SOFT
> HYPHEN and ZWSP?
>
> “New line function” ‘normalisation’ (see UTR 13) does not apply to domain
> names, since no domain name is to have any such character in it.
>
> 6.4 A possible alternative to normalisation: collation weighting
>
> A possible alternative to do KC and case normalisation is to use the ISO/IEC
> 14651 CTT (common template table), or the UTR 10 associated tables, with
> some tailoring suitable for the DNS (no, NOT local ones). In particular,
> punctuation and symbols must be significant at level 1. Then determine
> equality up to and including level 2 (accents; similar), but not level 3
> (case; hira/kata, various compatibility distinctions).
>
> This is also based on Unicode 2.1, not yet Unicode 3.0. Also, there is at
> present NO promise not to do changes that may affect, to some degree, use of
> the weightings that result. In particular, for 14651 no particular weight
> VALUES are assigned. That up to each implementation. For the UTR 10
> tables, the actual weight values may change at any update (or in any
> suitable way by tailoring, or other implementation decisions), so different
> versions cannot be used in a mix. Finally, there is no resulting “normal
> form” character string from these weight tables.
>
> 7 One should not mix scripts between FULL STOPs
>
> It is not a good idea to mix scripts freely in a single “part” of a domain
> name. E.g., it would be very confusing if an initial A is a Greek A, while
> the rest of the name part is in the Latin script.
>
> However, what constitutes a script is not clearly defined, and some
> orthographies (like the Japanese) normally do mix “scripts” in a single
> “word”. Therefore this must be left for human judgement. For an automated
> service one may apply some heuristic on suggested names that may need human
> scrutiny, or reject doubtful cases for registration. Note also that ASCII
> digits can be used with any other script, and many of the combining
> non-spacing marks are script generic, i.e. can be used with several
> different scripts.
>
> No rigid scheme should be applied for this. It should only be a
> registration time heuristic, overrideable by human intervention.
>
> 8 &-encoding (XML), %-encoding (URL), and =-encoding (QP)
>
> Any &-encoding used in XML (or HTML) documents in a string that contains a
> domain name shall be decoded before sending the domain name to a DNS system.
> Note that XML &-codes are character oriented and independent of the
> character encoding used for the XML document itself.
> Any %-encoding in a URL shall not be decoded in the domain name part, and %
> as such is not legal in a domain name. Such a domain name is thus
> malformed. The % character may mean something else though, so no attempt at
> URL %-decoding shall be done at that point. In addition, the octet oriented
> (not character oriented) %-encoding is for an unknown character encoding,
> and any attempt at decoding it by the client is likely to be in error.
>
> Any =-encoding in an e-mail in Quoted-Printable shall be decoded according
> to the charset declaration of the message. Hopefully, Quoted-Printable will
> go out of use, so this should be less of a problem...
>
> 9 E-mail address internationalisation
>
> The pre-@ part of e-mail addresses should be internationalised in the same
> way as domain names are internationalised.
>
> ====================================================================