[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
[idn] WG last call summary
Hi,
The authors of the idna/nameprep/stringprep/punycode documents that were
subject to the wg last call, have produced the following summary of
comments that were raised during the wg last call period. New versions of
drafts have been sent that should follow this summary. The new versions of
the documents are available in the internet-drafts directory. These
documents are:
- draft-ietf-idn-idna-07.txt
- draft-ietf-idn-nameprep-08.txt
- draft-ietf-idn-punycode-01.txt
- draft-hoffman-stringprep-01.txt
We will be sending the documents for IESG consideration for Proposed
Standard on March 11th 2002.
James and Marc, co-chairs.
====================================================================
Summary of the idn WG last call on idna/nameprep/stringprep/punycode
Collected by the authors.
Note from the chairs:
- "editorial changes" are in the first part of this summary.
- The last part of the summary discuss about the major technical changes.
The chairs have concluded that there is working group consensus
not to incorporate those major technical changes in the new versions of
documents.
- new versions of documents should include the changes in this summary.
==========
IDNA
==========
Throughout: Carefully check carefully for host->domain. Also change
"hostname" to "host name".
Abstract: Add "IDNA is only meant for processing domain names, not free
text."
1: Remove last paragraph about mailing list.
1.1: Add:
1.1 Interaction of protocol parts
IDNA requires that implementations process input strings with Nameprep
[NAMEPREP], which is a profile of Stringprep [STRINGPREP], and then with
Punycode [PUNYCODE]. Implementations of IDNA MUST fully implement
Nameprep and Punycode; neither Nameprep nor Punycode are optional.
2: Replace the three paragraphs that start with "A label is an
individual...", "An "internationalized domain name" (IDN) is...", and
"An internationalized label contains...", and replace them with the
following four paragraphs.
[STD13] talks about "domain names" and "host names", but many people use
the terms interchangeably. Further, because [STD13] was not terribly
clear, many people who are sure they know the exact definitions of each
of these terms disagree on the definitions.
A label is an individual part of a domain name. Labels are usually shown
separated by dots; for example, the domain name "www.example.com" is
composed of three labels: "www", "example", and "com". (The zero-length
root label that is implied in domain names, as described in [STD13], is
not considered a label in this specification.) Throughout this document
the term "label" is shorthand for "text label", and "every label" means
"every text label". In IDNA, not all text strings can be labels.
An "internationalized domain name" (IDN) is a domain name for which the
ToASCII operation (see section 4) can be applied to each label without
failing. This document does not attempt to define an "internationalized
host name". It is expected that protocols and name-handling bodies will
want to limit the characters allowed in IDNs further than what is
specified in this document, such as to prohibit additional characters
that they feel are unneeded or harmful in registered domain names.
An "internationalized label" is a label composed of characters from the
Unicode character set; note, however, that not every string of Unicode
characters can be an internationalized label. To allow internationalized
labels to be handled by existing applications, IDNA uses an "ACE label"
(ACE stands for ASCII Compatible Encoding), which can be represented
using only ASCII characters but is equivalent to a label containing
non-ASCII characters. More rigorously, an ACE label is defined to be any
label that the ToUnicode operation would alter (see section 4.2). For
every internationalized label that cannot be directly represented in
ASCII, there is an equivalent ACE label. The conversion of labels to and
from the ACE form is specified in section 4.
3: At the beginning of section 3, change "rules" to "requirements" for
consistency.
3: In requirement 1, insert "(see section 2)" after "generic domain name
slot".
3: Change (2) to:
2) ACE labels obtained from domain name slots SHOULD be hidden from
users except when the use of the non-ASCII form would cause problems or
when the ACE form is explicitly requested. Given an internationalized
domain name, an equivalent domain name containing no ACE labels can be
obtained by applying the ToUnicode operation (see section 4) to each
label. When requirements 1 and 2 both apply, requirement 1 takes
precedence.
4.1: In the first paragraph, change the second sentence to:
The original sequence and the resulting sequence are equivalent labels.
(If the original is an internationalized label that cannot be directly
represented in ASCII, the result will be the equivalent ACE label.)
4.1: In the second paragraph, change the second sentence from
"Failure means that the original sequence cannot be used as a label in
an IDN." to "If any step fails, the original sequence MUST NOT be used
as a label in an IDN."
4.1: After the second paragraph, add the following paragraph: The inputs
to ToASCII are a sequence of code points; a flag indicating whether to
prohibit unassigned code points (see [STRINGPREP]); and a flag
indicating whether to apply the host name syntax rules. The output of
ToASCII is either a sequence of ASCII code points or a failure
condition.
4.1: In step 2, add "and fail if there is an error".
4.1: In step 3, change the "*" at the beginning of each substep to "(a)"
and "(b)" to make them easier to refer to.
4.2: After the second paragraph, add the following paragraph: The inputs
to ToUnicode are a sequence of code points; a flag indicating whether to
prohibit unassigned code points (see [STRINGPREP]); and a flag
indicating whether to apply the host name syntax rules. The output of
ToUnicode is always a sequence of Unicode code points.
4.2: In step 2, add "and fail if there is an error".
5: Replace the section with:
[[ Note to the IESG and Internet Draft readers: The two uses of the
string "IESG--" below are to be changed at time of publication to a
prefix which fulfills the requirements in the first paragraph. ]]
The ACE prefix, used in the conversion operations (section 4), is two
alphanumeric ASCII characters followed by two hyphen-minuses. It cannot
be any of the prefixes already used in earlier documents, which includes
the following: "bl--", "bq--", "dq--", "lq--", "mq--", "ra--", "wq--"
and "zq--". The ToASCII and ToUnicode operations MUST recognize the ACE
prefix in a case-insensitive manner.
The ACE prefix for IDNA is "IESG--".
This means that an ACE label might be "IESG--de-jg4avhby1noc0d", where
"de-jg4avhby1noc0d" is the part of the ACE label that is generated by
the encoding steps in [PUNYCODE].
5: Change the example in the second paragraph to "de-jg4avhby1noc0d"
because the current example isn't valid Punycode.
6.1: After the first sentence of the second paragraph, add "ACE labels
that are displayed or input MUST always include the ACE prefix."
6.2: At the end of the second paragraph, add "ACE labels always include
the ACE prefix."
6.4: In the first sentence of the second paragraph, change "in ACE format"
to "in ACE format (which always includes the ACE prefix)".
6.7: Add new section.
6.7 Limitations of IDNA
The IDNA protocol does not solve all linguistic issues with users
inputting names in different scripts. Many important language-based and
script-based mappings are not covered in IDNA and must be handled
outside the protocol. For example, names that are entered in a mix of
traditional and simplified Chinese characters will not be mapped to a
single canonical name. Another example is Scandinavian names that are
entered with U+00F6 (LATIN SMALL LETTER O WITH DIAERESIS) will not be
mapped to U+00F8 (LATIN SMALL LETTER O WITH STROKE).
7: Change reference in last paragraph to [STRINGPREP].
8: Change this section to:
IDNs are likely to be somewhat longer than current host names, so the
bandwidth needed by the root servers should go up by a small amount.
Also, queries and responses for IDNs will probably be somewhat longer
than typical queries today, so more queries and responses may be forced
to go to TCP instead of UDP.
9: Change the first sentence from "Much of the security of the Internet
relies" to "Security on the Internet partly relies".
A. Add reference for [STRINGPREP]. Make URL references parallel.
B: Remove this appendix.
==========
Stringprep
==========
Throughout: Clean up capitalization of MUST and SHOULD.
1: Change last sentence in the last paragraph to read:
Language-specific equivalences such as "Aepfel" vs. "<U+00C4>pfel",
which are sometimes considered equivalent in German, may not be
considered equivalent in other languages.
1: Remove "Inputs to stringprep are always specified in network byte
order (big-endian)."
1.2 and 1.3: Renumber to 1.1 and 1.2.
1.2: Clarified the second paragraph for profiles that want a wide range
of characters.
1.4: Remove this section (about the mailing list).
2: Fix off-by-one errors in section references.
3. In the second paragraph, change "database should used to" to
"database should be used to".
3: Added a new second paragraph:
Mapped characters are not re-scanned during the mapping step. That is,
if character A at position X is mapped to character B, character B which
is now at position X is not checked against the mapping table.
3: Add a paragraph at the end:
The IETF is relying on Unicode not to change the case-mapping of
currently-assigned characters in future versions of the CaseFolding.txt
file. If a future version of the CaseFolding.txt file changes the mapped
value of an existing character, authors of profiles of this document
have to look at the changes very carefully before they update their case
mapping tables. Such a change could change the behavior that
users see in both updated and unupdated systems.
3: Add a paragraph to the end, after the above:
Authors of profiles of this document need to consider the effects of
changing the mapping of any currently-assigned character when updating
their profiles. Adding a new mapping for a currently-assigned character,
or changing an existing mapping, could change the behavior that users
see in both updated and unupdated systems.
4: In the paragraph starting 'If a profile', change the sentence "Many
user interface systems enter compatibility characters instead of the
base equivalents." to "Some user interface systems make it possible to
enter compatibility characters instead of the base equivalents."
4: In the second-to-last paragraph, change "normalization for KC"
to "normalization form KC".
4: Replace Patrik's question at the end with:
The composition process described in [UAX15] requires a fixed
composition version of Unicode to ensure that strings normalized under
one version of Unicode remain normalized under all future versions of
Unicode.
4: Add a paragraph at the end:
The IETF is relying on Unicode not to change the normalization of
currently-assigned characters in future versions of normalization. If a
future version of the normalization tables changes the normalized value
of an existing character, authors of profiles of this document have to
look at the changes very carefully before they update their
normalization tables. Such a change could change the behavior that users
see in both updated and unupdated systems.
5: Removed the third paragraph that said profiles SHOULD allow as many
characters as possible.
6: In the first sentence of the second paragraph, change "named in the"
to "named in a".
6: Before the last paragraph, add:
The goal of the requirements in this section is to prevent
comparisons between two strings that were both permitted to contain
unassigned code points. When two strings X and Y are compared and
string X was prepared in a way that permits unassigned code points, a
negative result to the comparison is not definitive; it's possible that
the strings don't match even though they would match if a more recent
version of the profile were used for Y. However, if both X and Y were
prepared in a way that permits unassigned code points, something worse
can happen: even a positive result for the comparison is not definitive.
It is possible that the strings do match even though they would not
match if a more recent version of the profile were used (one that
prohibits a code point appearing in both X and Y).
6.1: MN: change "because they are never appear" to "because they never
appear".
7: Change the paragraph to:
The Unicode and ISO/IEC 10646 repertoires have many characters that look
similar. In many cases, users of security protocols might do visual
matching, such as when comparing the names of trusted third parties.
Stringprep does nothing to map similar-looking characters together nor
to prohibit some characters because they look like others.
B: Remove the comment at the beginning of the appendix.
==========
Nameprep
==========
Throughout: Carefully check carefully for host->domain.
Throughout: Change "name part" to "name label" to match IDNA.
Title: Change to "Nameprep: A Stringprep Profile for Internationalized
Domain Names"
1: At the end of the first paragraph, add "These processing rules are
only intended for internationalized domain names, not for arbitrary
text."
1: Removed the second paragraph about the history of the name.
1.1: Add new section.
1.1 Interaction of protocol parts
Nameprep is used by the IDNA [IDNA] protocol for preparing domain names;
it is not designed for any other purpose. It is explicitly not designed
for processing arbitrary free text and SHOULD NOT be used for that
purpose. Nameprep is a profile of Stringprep [STRINGPREP].
Implementations of Nameprep MUST fully implement Stringprep.
1.2: Change "[SYMBOLS]" to "[CONTROL CHARACTERS]".
3: Change "there" to "here" in the first paragraph.
3: In the second paragraph, change "this section describe" to
"this section describes".
3.2: Add to the beginning:
This profile folds case in domain names where possible
because the current DNS has case-insensitive matching for domain names.
If this profile did not do that for the additional characters being
added, it would lead to even greater user confusion. For example, "Abc"
matches "abc", but "<Uppercase-A-with-accent>bc" would not match
"<lowercase-a-with-accent>bc".
3.2: In the last paragraph, change "This mapping was" to "These mappings
were".
5: Change "there" to "here" in the first paragraph.
5: In the third paragraph, change "The collected lists of prohibited
code points can be found in Appendix E of this document." to "The
collected list of code points prohibited by this profile can be found in
Appendix E of this document; note that IDNA prohibits additional
characters.".
5: Add paragraph just before 5.1:
IMPORTANT NOTE: This profile MUST be used with the IDNA protocol. The
IDNA protocol has additional prohibitions that are checked outside of
this profile.
5.1: Change "URLs" to "domain names".
5.1: Add the following sentence to the text: "Note that an additional
space character (U+0020) is prohibited in IDNA." Remove the first line
from the table.
5.2: Add the following sentence to the text: "Note that additional
control characters (U+0000 through U+001F, and U+007F) are prohibited in
IDNA." Remove the first two lines from the table.
6: Add "in the range 0 to 10FFFF" in the first sentence.
6: Change "the list Appendix F" to "the list in Appendix F".
7: Change the first sentence from "Much of the security of the Internet
relies"
to "Security on the Internet partly relies".
7: Change the first paragraph to:
The Unicode and ISO/IEC 10646 repertoires have many characters that look
similar. In many cases, users of security protocols might do visual
matching, such as when comparing the names of trusted third parties.
This profile does nothing to map similar-looking characters together nor
to prohibit some characters because they look like others.
8: Remove [CharModel] and [URI]. Add reference to [IDNA].
9: Removed this section.
D: Change "character" to "code point" throughout the description at the
beginning of the section.
E: Remove the first two rows (0000-0020 and 007F).
E: Merge rows where appropriate.
E: At the end, change "pints" to "points".
==========
Punycode
==========
Changes from draft-ietf-idn-punycode-00.txt to
draft-ietf-idn-punycode-01.txt.
Throughout
"hostname" --> "host name"
Capitalize must/should/etc when used in the RFC 2119 sense, which
happens
only in section 5:
"A decoder MUST recognize the letters in both uppercase and
lowercase forms..."
"An encoder SHOULD output only uppercase forms or only lowercase
forms..."
Reword to avoid must/should/etc when not used in the RFC 2119
sense:
Section 3.2:
"because basic code points must be segregated" -->
"because basic code points were supposed to be segregated"
"should not perform" --> "need not perform"
"should instead use division" --> "can instead use division"
Section 3.3:
"inconvenient when unique encodings are required" -->
"inconvenient when unique encodings are needed"
"4 must be the last digit" --> "4 is the last digit"
Section 4:
- Given a set of basic code points, one must be designated as
- the delimiter. The base can be no greater than the number of
- distinguishable basic code points remaining. The digit-values
- in the range 0 through base-1 must be associated with distinct
- non-delimiter basic code points. In some cases multiple code
- points must have the same digit-value; for example, uppercase
- and lowercase versions of the same letter must be equivalent if
- basic strings are case-insensitive.
+ Given a set of basic code points, one needs to be designated as
+ the delimiter. The base cannot be greater than the number of
+ distinguishable basic code points remaining. The digit-values
+ in the range 0 through base-1 need to be associated with
+ distinct non-delimiter basic code points. In some cases
+ multiple code points need to have the same digit-value; for
+ example, uppercase and lowercase versions of the same letter
+ need to be equivalent if basic strings are case-insensitive.
"The initial value of n must be no greater than" -->
"The initial value of n cannot be greater than"
"must satisfy the following constraints" -->
"need to satisfy the following constraints"
"They should be chosen empirically." -->
"They are best chosen empirically."
Section 5:
"RFC 952 recommendation" --> "RFC 952 rule"
Section 6:
"may be omitted" --> "can be omitted"
- Some actual programming languages might require explicit
- conversion between code points and integers.
+ In some programming languages, explicit conversion between code
+ points and integers might be necessary.
Section 6.2:
"may be omitted" --> "can be omitted"
Section 6.3:
"any string that required a 27-bit delta" -->
"any string that needed a 27-bit delta"
Section 6.4:
"would probably require integers wider than 32 bits" -->
"would probably need integers wider than 32 bits"
Appendix B:
- Punycode encoders and decoders are not required to support these
- annotations, and higher layers need not use them.
+ Punycode encoders and decoders need not support these
+ annotations, and higher layers need not use them.
Appendix D:
"requires wider integers" --> "needs wider integers"
"input must be represented" --> "input is represented"
"caller must pass" --> caller passes"
"may be output" --> "it can receive"
"array must hold" --> "array holds"
"means the corresponding Unicode character should be forced" -->
"suggests that the corresponding Unicode character be forced"
"means it should be forced" --> "suggests that it be forced"
"return value may be" --> "return value can be"
"may contain garbage" --> "might contain garbage"
"array must have room" --> "array needs room"
"may be a null pointer" --> "can be a null pointer"
"indicates that the corresponding Unicode character should be
forced" -->
"suggests that the corresponding Unicode character be forced"
"must be in the range" --> "needs to be in the range"
"requires the Punycode string to be followed" -->
"needs the Punycode string to be followed"
Title
"Punycode version 0.3.3" -->
"Punycode: An encoding of Unicode for use with IDNA"
Boilerplate
Removed:
- Please send comments to the author at amc@cs.berkeley.edu, or to
- the idn working group at idn@ops.ietf.org. A non-paginated (and
- possibly newer) version of this specification may be available at
- http://www.cs.berkeley.edu/~amc/idn/
Abstract
"Internationalized Domain Names [IDN] [IDNA]" -->
"Internationalized Domain Names in Applications [IDNA]"
"encoding" --> "transfer encoding syntax"
"instance Bootstring" --> "instance of Bootstring"
1. Introduction
"The IDNA draft" --> "[IDNA]"
Pushed features list down to section 1.1 Features.
"ratio of extended string length to basic string length" -->
"ratio of basic string length to extended string length"
". This comes for free because it makes the encoding more efficient
on average." -->
"(although the main purpose is to improve efficiency, not
readability)."
Added:
+ 1.2 Interaction of protocol parts:
+
+ Punycode is used by the IDNA protocol [IDNA] for converting domain
+ labels into ASCII; it is not designed for any other purpose. It is
+ explicitly not designed for processing arbitrary free text.
2. Terminology
Changed key-words paragraph to quote RFC 2119 exactly:
The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT",
"SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this
document are to be interpreted as described in RFC 2119 [RFC2119].
3. Bootstring description
Before section 3.1 added:
+ Sections 7.1 "Decoding traces" and 7.2 "Encoding traces" trace the
+ algorithms for sample inputs.
+
+ The following sections describe the four techniques used in
+ Bootstring. "Basic code point segregation" is a very simple
+ and efficient encoding for basic code points occurring in the
+ extended string: they are simply copied all at once. "Insertion
+ unsort coding" encodes the non-basic code points as deltas, and
+ processes the code points in numerical order rather than in order of
+ appearance, which typically results in smaller deltas. The deltas
+ are represented as "generalized variable-length integers", which use
+ basic code points to represent nonnegative integers. The parameters
+ of this integer representation are dynamically adjusted using "bias
+ adaptation", to improve efficiency when consecutive deltas have
+ similar magnitudes.
Clarified section 3.2:
> The remainder of the basic string (after the last delimiter if there
> is one) represents a sequence of nonnegative integral deltas as
> generalized variable-length integers, described in section 3.3. The
> meaning of the deltas is best understood in terms of the decoder.
>
> The decoder builds the extended string incrementally. Initially,
> the extended string is a copy of the literal portion of the basic
> string (excluding the last delimiter).
- Each delta causes the decoder to insert a code point into the
- extended string according to the following procedure.
+ The decoder inserts non-basic code points, one for each delta,
+ into the extended string, ultimately arriving at the final decoded
+ string.
- There are two state variables: a code point n, and an index i that
- ranges from zero (which is the first position of the extended
- string) to the current length of the extended string (which refers
- to a potential position beyond the current end). The decoder
- advances the state monotonically (never returning to an earlier
- state) by taking steps only upward. Each step increments i, except
- when i already equals the length of the extended string, in which
- case a step resets i to zero and increments n. For each delta (in
- order), the decoder takes delta steps upward, then inserts the
- value n into the extended string at position i, then increments i
- (to skip over the code point just inserted). (An implementation
- should not take each step individually, but should insead use
- division and remainder calculations to advance by delta steps all at
- once.) It is an error if the inserted code point is a basic code
- point (because basic code points must be segregated as described in
- section 3.1).
+ At the heart of this process is a state machine with two state
+ variables: an index i and a counter n. The index i refers to
+ a position in the extended string; it ranges from 0 (the first
+ position) to the current length of the extended string (which refers
+ to a potential position beyond the current end). If the current
+ state is <n,i>, the next state is <n,i+1> if i is less than the
+ length of the extended string, or <n+1,0> if i equals the length of
+ the extended string. In other words, each state change causes i to
+ increment, wrapping around to zero if necessary, and n counts the
+ number of wrap-arounds.
+
+ Notice that the state always advances monotonically (there is no
+ way for the decoder to return to an earlier state). At each state,
+ an insertion is either performed or not performed. At most one
+ insertion is performed in a given state. An insertion inserts the
+ value of n at position i in the extended string. The deltas are
+ a run-length encoding of this sequence of events: they are the
+ lengths of the runs of non-insertion states preceeding the insertion
+ states. Hence, for each delta, the decoder performs delta state
+ changes, then an insertion, and then one more state change. (An
+ implementation need not perform each state change individually, but
+ can instead use division and remainder calculations to compute the
+ next insertion state directly.) It is an error if the inserted code
+ point is a basic code point (because basic code points were supposed
+ to be segregated as described in section 3.1).
> The encoder's main task is to derive the sequence of deltas that
> will cause the decoder to construct the desired string. It can do
> this by repeatedly scanning the extended string for the next code
> point that the decoder would need to insert, and counting the number
- of steps the decoder would need to take, mindful of the fact that
- the decoder will be stepping over only those code points that have
- already been inserted.
+ of state changes the decoder would need to perform, mindful of the
+ fact that the decoder's extended string will include only those
+ code points that have already been inserted.
> Section 6.3 "Encoding procedure" gives a precise algorithm.
5. Parameter values for Punycode
"initial_n = 0x80" --> "initial_n = 128 = 0x80"
Clarified input restrictions:
- In Punycode, code points are Unicode code points [UNICODE], that
- is, integers in the range 0..10FFFF, but not D800..DFFF, which are
- reserved for use by UTF-16.
+ Although the only restriction Punycode imposes on the input integers
+ is that they be nonnegative, these parameters are especially
+ designed to work well with Unicode [UNICODE] code points, which
+ are integers in the range 0..10FFFF (but not D800..DFFF, which are
+ reserved for use by the UTF-16 encoding of Unicode).
6.2 Decoding procedure
Made the clamping code not look buggy anymore:
- let t = tmin if k <= bias, tmax if k >= bias + tmax, or
- k - bias otherwise
+ let t = tmin if k <= bias {+ tmin}, or
+ tmax if k >= bias + tmax, or k - bias otherwise
- The assignment of t, where t is clamped to the range tmin through
- tmax, does not handle the case where bias < k < bias + tmin, but
- that is impossible because of the way bias is computed and because
- of the constraints on the parameters.
+ In the assignment of t, where t is clamped to the range tmin through
+ tmax, "+ tmin" can always be omitted. This makes the clamping
+ calculation incorrect when bias < k < bias + tmin, but that cannot
+ happen because of the way bias is computed and because of the
+ constraints on the parameters.
"The statement enclosed in braces" -->
"The full statement enclosed in braces"
6.3 Encoding procedure
Made the clamping code not look buggy anymore (same fix as for
section 6.2).
7. Punycode example strings --> Punycode examples
Pushed strings down into 7.1 Sample strings. Added sections 7.2 and
7.3:
+ 7.2 Decoding traces
+
+ In the following traces, the evolving state of the decoder is
+ shown as a sequence of hexadecimal values, representing the code
+ points in the extended string. An asterisk appears just after the
+ most recently inserted code point, indicating both n (the value
+ preceeding the asterisk) and i (the position of the value just after
+ the asterisk). Other numerical values are decimal.
+
+ Decoding trace of example B from section 7.1:
+
+ n is 128, i is 0, bias is 72
+ input is "ihqwcrb4cv8a8dqg056pqjye"
+ there is no delimiter, so extended string starts empty
+ delta "ihq" decodes to 19853
+ bias becomes 21
+ 4E0D *
+ delta "wc" decodes to 64
+ bias becomes 20
+ 4E0D 4E2D *
+ delta "rb" decodes to 37
+ bias becomes 13
+ 4E3A * 4E0D 4E2D
+ delta "4c" decodes to 56
+ bias becomes 17
+ 4E3A 4E48 * 4E0D 4E2D
+ delta "v8a" decodes to 599
+ bias becomes 32
+ 4E3A 4EC0 * 4E48 4E0D 4E2D
+ delta "8d" decodes to 130
+ bias becomes 23
+ 4ED6 * 4E3A 4EC0 4E48 4E0D 4E2D
+ delta "qg" decodes to 154
+ bias becomes 25
+ 4ED6 4EEC * 4E3A 4EC0 4E48 4E0D 4E2D
+ delta "056p" decodes to 46301
+ bias becomes 84
+ 4ED6 4EEC 4E3A 4EC0 4E48 4E0D 4E2D 6587 *
+ delta "qjye" decodes to 88531
+ bias becomes 90
+ 4ED6 4EEC 4E3A 4EC0 4E48 4E0D 8BF4 * 4E2D 6587
+
+ Decoding trace of example L from section 7.1:
+
+ n is 128, i is 0, bias is 72
+ input is "3B-ww4c5e180e575a65lsy2b"
+ literal portion is "3B-", so extended string starts as:
+ 0033 0042
+ delta "ww4c" decodes to 62042
+ bias becomes 27
+ 0033 0042 5148 *
+ delta "5e" decodes to 139
+ bias becomes 24
+ 0033 0042 516B * 5148
+ delta "180e" decodes to 16683
+ bias becomes 67
+ 0033 5E74 * 0042 516B 5148
+ delta "575a" decodes to 34821
+ bias becomes 82
+ 0033 5E74 0042 516B 5148 751F *
+ delta "65l" decodes to 14592
+ bias becomes 67
+ 0033 5E74 0042 7D44 * 516B 5148 751F
+ delta "sy2b" decodes to 42088
+ bias becomes 84
+ 0033 5E74 0042 7D44 91D1 * 516B 5148 751F
+ 7.3 Encoding traces
+
+ In the following traces, code point values are hexadecimal, while
+ other numerical values are decimal.
+
+ Encoding trace of example B from section 7.1:
+
+ bias is 72
+ input is:
+ 4ED6 4EEC 4E3A 4EC0 4E48 4E0D 8BF4 4E2D 6587
+ there are no basic code points, so no literal portion
+ next code point to insert is 4E0D
+ needed delta is 19853, encodes as "ihq"
+ bias becomes 21
+ next code point to insert is 4E2D
+ needed delta is 64, encodes as "wc"
+ bias becomes 20
+ next code point to insert is 4E3A
+ needed delta is 37, encodes as "rb"
+ bias becomes 13
+ next code point to insert is 4E48
+ needed delta is 56, encodes as "4c"
+ bias becomes 17
+ next code point to insert is 4EC0
+ needed delta is 599, encodes as "v8a"
+ bias becomes 32
+ next code point to insert is 4ED6
+ needed delta is 130, encodes as "8d"
+ bias becomes 23
+ next code point to insert is 4EEC
+ needed delta is 154, encodes as "qg"
+ bias becomes 25
+ next code point to insert is 6587
+ needed delta is 46301, encodes as "056p"
+ bias becomes 84
+ next code point to insert is 8BF4
+ needed delta is 88531, encodes as "qjye"
+ bias becomes 90
+ output is "ihqwcrb4cv8a8dqg056pqjye"
+
+ Encoding trace of example L from section 7.1:
+
+ bias is 72
+ input is:
+ 0033 5E74 0042 7D44 91D1 516B 5148 751F
+ basic code points (0033, 0042) are copied to literal portion: "3B-"
+ next code point to insert is 5148
+ needed delta is 62042, encodes as "ww4c"
+ bias becomes 27
+ next code point to insert is 516B
+ needed delta is 139, encodes as "5e"
+ bias becomes 24
+ next code point to insert is 5E74
+ needed delta is 16683, encodes as "180e"
+ bias becomes 67
+ next code point to insert is 751F
+ needed delta is 34821, encodes as "575a"
+ bias becomes 82
+ next code point to insert is 7D44
+ needed delta is 14592, encodes as "65l"
+ bias becomes 67
+ next code point to insert is 91D1
+ needed delta is 42088, encodes as "sy2b"
+ bias becomes 84
+ output is "3B-ww4c5e180e575a65lsy2b"
8. Security considerations
'"nameprep"' --> 'Nameprep'
9. References
Updated references:
- [IDNA] Patrik Faltstrom, Paul Hoffman, Adam M. Costello,
- "Internationalizing Host Names In Applications (IDNA)", 2001-Nov-19,
- draft-ietf-idn-idna-05.
+ [IDNA] Patrik Faltstrom, Paul Hoffman, Adam M. Costello,
+ "Internationalizing Domain Names In Applications (IDNA)",
+ 2002-###-##, draft-ietf-idn-idna-07.
- [NAMEPREP] Paul Hoffman, Marc Blanchet, "Stringprep
- Profile for Internationalized Host Names", 2001-Sep-27,
- draft-ietf-idn-nameprep-06.
+ [NAMEPREP] Paul Hoffman, Marc Blanchet, "Nameprep: A Stringprep
+ Profile for Internationalized Domain Names", 2002-###-##,
+ draft-ietf-idn-nameprep-08.
Removed reference [IDN] (working group), which is no longer cited.
B. Mixed-case annotation
Clarified applicability to IDNA:
+ Note, however, that mixed-case annotation is not used by the
+ ToASCII and ToUnicode operations specified in [IDNA], and therefore
+ implementors of IDNA can disregard this appendix.
Reworded the preceeding sentence to make it flow better:
- The encoded string can, however, use mixed case as an annotation
- telling how to convert the original folded string into a mixed-case
- string for display purposes.
+ The encoded string can use mixed case as an annotation telling how
+ to convert the folded string into a mixed-case string for display
+ purposes.
Clarified relationship with the rest of the spec with a new
second-to-last paragraph:
+ These annotations do not alter the code points returned by decoders;
+ the annotations are returned separately, for the caller to use or
+ ignore. Encoders can accept annotations in addition to code points,
+ but the annotations do not alter the output, except to influence the
+ uppercase/lowercase form of ASCII letters.
D. Punycode sample implementation
Removed version number and date, cite the draft name instead.
Changed clamping code to look like the pseudocode:
- t = k <= bias ? tmin : k - bias >= tmax ? tmax : k - bias;
+ t = k <= bias /* + tmin */ ? tmin : /* +tmin not needed */
+ k >= bias + tmax ? tmax : k - bias;
==========
Major technical changes
==========
Bidirectional display:
Martin Duerst suggested an addition to nameprep to solve the problem of
differently-ordered strings having the identical display. The change was
only partially specified, and no examples were given. The proposed
solution would involve tables derived from the Unicode standard.
Hebrew points:
Jonathan Rosenne asked that Hebrew points be mapped out in nameprep
because the are optional in Hebrew. Off-list comments pointed out that
points are not optional in Yiddish, and that some people would want to
have points in their names to look more traditional. The proposed
solution would involve tables derived from the Unicode standard.
Korean syllables:
Soobok Lee and Kent Karlsson asked that we not use NFKC due to its
incompleteness and incorrectness with respect to some combining
syllables. The proposed changes either change NFKC in a way that does
not conform to the Unicode standard or use a different normalization
scheme altogether.
Near-instant updating:
Eric hall suggested that Nameprep be augmented to support near-instant
updates of all systems. This would involve a static format for the
tables in the documents, a method for end entities to know where to get
the updates from, and a tranfer method.
Traditional-Simplified Chinese:
Lee Ming Tseng, Jan Ming Ho, and Kenny Huang published a draft
(draft-tseng-idn-piidna-00.txt) which proposes to prohibit all Chinese,
Japanese, and Korean Han characters from plane 0 after nameprep. The
purpose is to allow time for someone to come up with an IETF solution to
the Traditional-Simplified translation problem; the prohibition on the
specified Han characters would be lifted after that time.
"Unneeded" characters:
A few people suggested adding unneeded characters to the prohibited
list in nameprep. None gave specific lists of characters, although
"punctuation" and "symbols" were mentioned many times.
Whole domain names:
Kent asked that all the documents be changed to deal with full domain
names instead of name parts. This would allow additional "full stop"
characters to be handled and would change the bidirectional properties
of names.