[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

stringprep problems



Unicode Consortium has found a number of more "bugs" in the normalization rules. The problem I have is that they have also redefined what "stable" means.

In short, they said they will not change the normalization form for any assigned codepoint.

Now they say "a codepoint which was normalized to a stable codepoint, that stable codepoint will still be stable in a future version of unicode" but they acknowledge that as the normalization forms might change.

This is exactly why we wrote "Version 3.2 of Unicode" and created one specific table which is included in Stringprep.

We (IETF) now have a few choices:

- Stay with 3.2 forewer
- Use 4.0 but, including the correction list (without
the corrections)
- Use 4.0 and say the correction list indicate some
codepoints are dangerous

Here is a first cut of a draft which talks about this problem (but it doesn't talk explicitly about the possible paths IETF can take).

paf



Internet Architecture Board                                 P. Faltstrom
Internet-Draft                                                       IAB
Expires: January 13, 2004                                  July 15, 2003


     Synchronization of Stringprep with Unicode Normalization rules
             draft-faltstrom-unicode-synchronisation-00.txt

Status of this Memo

   This document is an Internet-Draft and is in full conformance with
   all provisions of Section 10 of RFC2026.

   Internet-Drafts are working documents of the Internet Engineering
   Task Force (IETF), its areas, and its working groups. Note that other
   groups may also distribute working documents as Internet-Drafts.

   Internet-Drafts are draft documents valid for a maximum of six months
   and may be updated, replaced, or obsoleted by other documents at any
   time. It is inappropriate to use Internet-Drafts as reference
   material or to cite them other than as "work in progress."

   The list of current Internet-Drafts can be accessed at http://
   www.ietf.org/ietf/1id-abstracts.txt.

   The list of Internet-Draft Shadow Directories can be accessed at
   http://www.ietf.org/shadow.html.

   This Internet-Draft will expire on January 13, 2004.

Copyright Notice

   Copyright (C) The Internet Society (2003). All Rights Reserved.

Abstract

   This memo provides information about potential problems when
   applications using stringprep [RFC3454]. It especially look at how to
   handle differences between normalization rules in different versions
   of Unicode.

1. The problem

   The Unicode Standard Annex #15 (Unicode Normalization Forms) specify
   how the normalization rules are to be applied to strings. In Annex 12
   (Corrigenda) differences between normalization rules between versions
   of Unicode is discussed.

   The Unicode Consortium has well-defined policies in place to govern



Faltstrom               Expires January 13, 2004                [Page 1]

Internet-Draft            Normalization Rules                  July 2003


   changes that affect backwards compatibility. Once a character is
   encoded, its canonical combining class and decomposition mapping will
   not be changed in a way that will destabilize normalization.

   What this means is: If a string contains only characters from a given
   version of the Unicode Standard (e.g., Unicode 3.1.1), and it is put
   into a normalized form in accordance with that version of Unicode,
   then it will be in normalized form according to any past or future
   versions of Unicode.

   This guarantee has been in place for Unicode 3.1 and after. It has
   been necessary to correct the decompositions of a small number of
   characters since Unicode 3.1, as listed in the Normalization
   Corrections data file, but such corrections are in accordance with
   the above principles: all text normalized on old systems will test as
   normalized in future systems. All text normalized in future systems
   will test as normalized on past systems. What may change, for those
   few characters, is that unnormalized text may normalize differently
   on past and future systems.

2. Conclusion

   Assume a client receives a non-normalized string, and then applies
   normalization according to normalization rules in one version of
   Unicode. If the client passes the normalized string to a server that
   also has normalized a non-normalized copy of the string, but is using
   a different version of the Unicode normalization rules, the two
   strings might not match.

   Example: In version 3.1 of Unicode, codepoint U+2F874 is normalized
   to U+5F33. In version 3.2 U+2F874 is normalized to U+5F53. We also
   have on the Internet nodes A and B. Assume that A is using version
   3.1 of Unicode, and B is using version 3.2. U+2F874 is passed to both
   A and B. After normalization they will store the strings U+5F33 and
   U+5F53 respectively. The end result is that even if the same
   codepoint, U+2F874, is passed to both nodes, they will after
   normalization have different strings (U+5F33 and U+5F53). If A send a
   message with U+2F874 to B as a search string, there will be no match
   at B.

   To create a problem, the string (only consisting of the codepoint
   U+2F874 in the example above) need to include at least one of the
   codepoints in the correction list (see appendix A). As of version
   4.0.0 of Unicode, the list of corrections (since Unicode 3.1)
   consists of exactly 5 codepoints. Over time, when errors in the
   normalization rules are found, this list will grow. The list is
   controlled by the Unicode Consortium.




Faltstrom               Expires January 13, 2004                [Page 2]

Internet-Draft            Normalization Rules                  July 2003


3. Recommendation

   Applications implementing stringprep must be aware of the existence
   of the corrections table (http://www.unicode.org/Public/UNIDATA/
   NormalizationCorrections.txt). Version 4.0.0 of this correction list
   can be found in Appendix A. If a string which is to be used for
   matching include any of these codepoints, unexpected results
   (non-matching when mathing should occur) might happen. Because of
   this, it is recommended that in sensitive applications / deployments,
   special care should take place.

   Example of problems include (but is not limited to) problems in
   protocols which use stringprep and pass a normalized version of
   strings received from a human. Such protocols include the DNS
   (dispute resolution at the time of domain name registration) and
   protocols using domain names (HTTP, SMTP etc), LDAP (elements in the
   DN as well as searches on attribute values), Kerberos (Realms), iSCSI
   (names of volumes).

   As codepoints can be added to the list at any time, addition of
   codepoints can affect already normalized strings. Say a registry
   accept registrations of domain names. If a domain name U+2F868 is to
   be registered, according to nameprep profile in Unicode 3.2 the
   string U+2136A is to be registered. If later the registry switches to
   use version 4.0 of Unicode, the question is whether the registered
   string U+2136A is to stay, or whether it should be changed to U+36FC.
   It might even be the case that U+36FC is already registered, and by a
   different domain name holder. The change in normalization rules in
   this case create a potential dispute resolution.

   The IETF strongly encourage the Unicode Consortium to keep the size
   of the correction list to an absolute minimum, as it will be
   impossible for implementations (applications) to know what version of
   the normalization tables which are in use. This because the tables in
   many cases are part of the operating system, as the enduser expect
   the same normalization rules be used in all applications in her
   environment.


Author's Address

   Patrik Faltstroms
   Internet Architecture Board

   EMail: paf@cisco.com






Faltstrom               Expires January 13, 2004                [Page 3]

Internet-Draft            Normalization Rules                  July 2003


Appendix A. Appendix A


   # NormalizationCorrections-4.0.0.txt
   #
   # This file is a normative contributory data file in the
   # Unicode Character Database.
   #
   # The normalization stabilization policy of the Unicode
   # Consortium ordinarily precludes any change to the decomposition
   # for any character, once established in a relevant version
   # of the UnicodeData.txt data file. However, under certain
   # exceptional (and rare) conditions, an error in a decomposition
   # mapping may be discovered that is truly just an unintended
   # typo in the data, and not a matter of dubious interpretation.
   #
   # Whenever such an error may be found, and if it meets the
   # requirements for possible exceptions to normalization
   # stability, the correction is entered in this data file,
   # so that any implementation depending on absolute stability
   # of normalization, *including* any errors in the data, can
   # safely reconstruct the exact state of the data tables at
   # any given version of Unicode.
   #
   # Currently this list has exactly six entries in it, one for the
   # typo found and corrected in Corrigendum #3, and five for
   # the typos and misidentifications found and corrected in
   # Corrigendum #4. All efforts
   # will be made to keep the entries limited to just those fixes.
   #
   # Interpretation of the fields:
   #   Field 1: Unicode code point
   #   Field 2: Original (erroneous) decomposition
   #   Field 3: Corrected decomposition
   #   Field 4: Version of Unicode for which the correction was
   #            entered into UnicodeData.txt, in n.n.n format.
   #   Comment: Indicates the Unicode Corrigendum which documents
   #            the correction
   #
   #
   F951;96FB;964B;3.2.0 # Corrigendum 3
   2F868;2136A;36FC;4.0.0 # Corrigendum 4
   2F874;5F33;5F53;4.0.0 # Corrigendum 4
   2F91F;43AB;243AB;4.0.0 # Corrigendum 4
   2F95F;7AAE;7AEE;4.0.0 # Corrigendum 4
   2F9BF;4D57;45D7;4.0.0 # Corrigendum 4





Faltstrom               Expires January 13, 2004                [Page 4]

Internet-Draft            Normalization Rules                  July 2003


Intellectual Property Statement

   The IETF takes no position regarding the validity or scope of any
   intellectual property or other rights that might be claimed to
   pertain to the implementation or use of the technology described in
   this document or the extent to which any license under such rights
   might or might not be available; neither does it represent that it
   has made any effort to identify any such rights. Information on the
   IETF's procedures with respect to rights in standards-track and
   standards-related documentation can be found in BCP-11. Copies of
   claims of rights made available for publication and any assurances of
   licenses to be made available, or the result of an attempt made to
   obtain a general license or permission for the use of such
   proprietary rights by implementors or users of this specification can
   be obtained from the IETF Secretariat.

   The IETF invites any interested party to bring to its attention any
   copyrights, patents or patent applications, or other proprietary
   rights which may cover technology that may be required to practice
   this standard. Please address the information to the IETF Executive
   Director.


Full Copyright Statement

   Copyright (C) The Internet Society (2003). All Rights Reserved.

   This document and translations of it may be copied and furnished to
   others, and derivative works that comment on or otherwise explain it
   or assist in its implementation may be prepared, copied, published
   and distributed, in whole or in part, without restriction of any
   kind, provided that the above copyright notice and this paragraph are
   included on all such copies and derivative works. However, this
   document itself may not be modified in any way, such as by removing
   the copyright notice or references to the Internet Society or other
   Internet organizations, except as needed for the purpose of
   developing Internet standards in which case the procedures for
   copyrights defined in the Internet Standards process must be
   followed, or as required to translate it into languages other than
   English.

   The limited permissions granted above are perpetual and will not be
   revoked by the Internet Society or its successors or assignees.

   This document and the information contained herein is provided on an
   "AS IS" basis and THE INTERNET SOCIETY AND THE INTERNET ENGINEERING
   TASK FORCE DISCLAIMS ALL WARRANTIES, EXPRESS OR IMPLIED, INCLUDING
   BUT NOT LIMITED TO ANY WARRANTY THAT THE USE OF THE INFORMATION



Faltstrom               Expires January 13, 2004                [Page 5]

Internet-Draft            Normalization Rules                  July 2003


   HEREIN WILL NOT INFRINGE ANY RIGHTS OR ANY IMPLIED WARRANTIES OF
   MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE.


Acknowledgment

   Funding for the RFC Editor function is currently provided by the
   Internet Society.











































Faltstrom               Expires January 13, 2004                [Page 6]