[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
[idn] Unicode security issues (fwd)
- To: idn@ops.ietf.org
- Subject: [idn] Unicode security issues (fwd)
- From: Bill Manning <bmanning@ISI.EDU>
- Date: Fri, 25 Aug 2000 13:10:21 -0700 (PDT)
- Delivery-date: Fri, 25 Aug 2000 13:10:47 -0700
- Envelope-to: idn-data@psg.com
A break from our discussions of registrars experimentation with
one form or another of multilingual support.
It seems that use of unicode itself will inject new integrity/security
concerns if/as it is deployed in the DNS.
From the Intrusion Detection wg...
% Borrowed (with permission) from Bruce Schneier's Crypto-Gram newsletter,
% and relevant to our proposed use of UTF-8.
%
% Stuart.
%
% From: Markus Kuhn <Markus.Kuhn@cl.cam.ac.uk>
% Subject: Re: Security Risks of Unicode
%
% > I don't know if anyone has considered the security implications of this.
% [...]
% > - Somebody uses UTF-8 or UTF-16 to encode a conventional character in a
% > novel way to bypass validation checks?
%
% Thanks for reminding your readers about the security issues surrounding the
% UTF-8 encoding of Unicode and ISO 10646 (UCS).
%
% For some time, this and related issues have been of considerable concern to
% us folks on the linux-utf8 at nl.linux.org mailing list, who try to guide
% and accelerate the eventually inevitable migration of the Unix world from
% ASCII and ISO 8859 to UTF-8 (which the Plan9 operating system has
% demonstrated it successfully almost a decade ago). New UTF-8 decoders
% deployed in for instance GNU glibc 2.2, XFree86 4.0 xterm, and various
% other standard tools have been carefully designed to reject so-called
% overlong UTF-8 sequences as malformed sequences, in order prevent that
% these UTF-8 decoders can be abused by attackers to by-pass critical ASCII
% substring tests that are applied earlier in the processing pipeline.
%
% It is still very unfortunate that even the latest Unicode 3.0 standard
% (ISBN 0-201-61633-5) contains at the end of section 3.8 on page 47 the
% following paragraph: "When converting from UTF-8 to a Unicode scalar value,
% implementations do not need to check that the shortest encoding is being
% used. This simplifies the conversion algorithm."
%
% This paragraph encourages the fielding of sloppy and dangerous UTF-8
% decoders that will for example convert all of the following five UTF-8
% sequences into a U+000A line-feed control character:
%
% 0xc0 0x8A
% 0xe0 0x80 0x8A
% 0xf0 0x80 0x80 0x8A
% 0xf8 0x80 0x80 0x80 0x8A
% 0xfc 0x80 0x80 0x80 0x80 0x8A
%
% A "safe UTF-8 decoder" should reject them just like malformed sequences for
% two reasons: (1) It helps to debug applications if overlong sequences are
% not treated as valid representations of characters, because this helps to
% spot problems more quickly. (2) Overlong sequences provide alternative
% representations of characters, that could maliciously be used to bypass
% prior ASCII filters. For instance, a 2-byte encoded line feed (LF) would
% not be caught by a line counter that counts only 0x0A bytes, but it would
% still be processed as a line feed by an unsafe UTF-8 decoder later in the
% pipeline.
%
% UTF-8 is known to be ASCII compatible, because every existing ASCII file is
% already a correct UTF-8 file and non-ASCII characters do not introduce
% additional occurrences of ASCII bytes. But from a security point of view,
% ASCII compatibility of UTF-8 sequences must also mean that ASCII characters
% are *only* allowed to be represented by ASCII bytes in the range 0x00-0x7F
% and not by any other byte combination. To ensure this often neglected
% aspect of ASCII compatibility, use only "safe UTF-8 decoders" that reject
% overlong UTF-8 sequences for which a shorter encoding exists, for example
% by substituting it with the U+FFFD replacement character.
%
% It is not true that the check for overlong UTF-8 sequences would add any
% significant speed penalty or complexity to the UTF-8 decoder, as for
% example my implementation of the decoder found in the XFree86 4.0 xterm
% version illustrates. The key to understanding how to implement a safe UTF-8
% decoder both simply and efficiently lies in realizing that an UTF-8
% sequences is overlong if and only if it contains one of the following one
% or two byte long bit patterns:
%
% 1100000x (10xxxxxx)
% 11100000 100xxxxx (10xxxxxx)
% 11110000 1000xxxx (10xxxxxx 10xxxxxx)
% 11111000 10000xxx (10xxxxxx 10xxxxxx 10xxxxxx)
% 11111100 100000xx (10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx)
%
% A UTF-8 decoder robustness test file that allows developers to check
% quickly an UTF-8 decoder for its safety is available on
%
% <http://www.cl.cam.ac.uk/~mgk25/ucs/examples/UTF-8-test.txt>
%
% For instance, major Web browsers still fail the test in section 4.1.1.
%
% More information on UTF-8 under Unix are available on
%
% <http://www.cl.cam.ac.uk/~mgk25/unicode.html>
%
%
% From: Curt Sampson <cjs@cynic.net>
% Subject: Re: Security Risks of Unicode
%
% I have to say I'm rather appalled by your "Security Risks of Unicode"
% article. You have identified a type of security vulnerability in some
% systems, and pointed out that Unicode may increase the incidence of this
% type of vulnerability, but completely missed the source of the
% vulnerability.
%
% As we've seen from your examples of non-Unicode systems that have
% experienced security failures, these problems do not stem from using any
% particular character set or character set interpretation. They stem from
% doing what I like to call "validity guessing," rather than true validity
% checking.
%
% The key factor in all of these cases is that we have two separate programs
% (the validity checker and the application itself) using two separate
% algorithms to interpret data. This is what introduces the potential for a
% security breach: if ever the two programs do not interpret a data stream in
% exactly the same way (and this can easily happen if the two programs are
% not maintained by the same person or group), it may become possible to
% convince the application to do something the validator does not want to
% allow.
%
% When it comes to security, guessing just isn't good enough. This is why,
% when we have parameters from external sources, we use the exec() system
% call to run programs under Unix rather than the system() library
% function. We don't pass random data to the shell for interpretation
% because we can never be sure how a particular implementation of a
% particular shell on a particular system will interpret it. (We can't even
% be sure of what shell we're using -- /bin/sh may be any of a number of
% different programs.)
%
% As long as we shift the blame for badly designed security systems to
% external standards that are not the source of the problem, we will have
% insecure systems. Security is something that needs to be built in to
% systems from the beginning, not tacked on with separate programs at the
% end.
%
%
% From: Henry Spencer <henry@spsystems.net>
% Subject: Re: Security Risks of Unicode
%
% You have a point about potential input-validation attacks in Unicode, given
% the much greater complexity of the character set... but I think you have
% missed a couple of more important points.
%
% Trying to analyze the input string for metacharacters, odd delimiters, etc.
% is basically a mistake. I speak as someone who's written code to do this,
% by the way -- it always smelled like a kludge to me, and now I understand
% why.
%
% First, prepending an input validator to a complex interpreter is a
% fundamentally insecure approach. Unless you are prepared to impose truly
% severe restrictions on which features of the interpreter are available --
% in which case, why bother with the interpreter at all? -- the validator
% becomes an attempt to reinvent the interpreter's parser and some of its
% semantic analysis. This is an inherently error-prone approach, as shown by
% various successful input-validation attacks. The validator is a complex
% piece of software which must achieve and maintain an exact relationship
% with the interpreter, which is all the more difficult if the interpreter is
% ill-documented (as most complex interpreters are) and constantly changing
% (ditto).
%
% The right way -- the *only* right way -- to deal with this problem is to
% insist that such interpreters include a show-only mode ("process this input
% and tell me what it would make you do BUT DON'T DO IT"). This can be
% awkward for interpreters with complex programmability and interactions with
% their environment; it may amount to actually running the interpreter, but
% in a controlled and monitored environment with dummy resources. There can
% still be bugs -- unintended differences between the show-only mode and the
% real mode -- but if the interpreter is well organized, almost all of the
% show-only work is being done by the real code rather than a cheap
% independently-maintained fake, and there is at least a fighting chance that
% the behaviors will match.
%
% (A do-only-safe-things mode is also of interest, but not as satisfactory.
% Definitions of safety may not match, and interpreter bugs are arguably more
% likely to affect the outcome.)
%
% Second, less confidently, I have to wonder whether elaborate parsing isn't
% a mistake anyway. When the context is program talking to program, it would
% be better to define the simplest format possible, so that parsing becomes
% trivial and there is no room for misunderstandings. This need not imply
% either binary data formats or simple semantics; for example, one can send a
% complex tree structure in prefix or postfix notation, one node per (text)
% line. Of course, all too often the option isn't available because the
% format is predefined by a 700-page standard, but the possibility is worth
% bearing in mind.
%
%
% From: Michael Smith <smithmb@usa.net>
% Subject: Re: Security Risks of Unicode
%
% Speak of the devil...
%
% Apparently, the dangers of Unicode you discussed in the latest Crypto-Gram
% are not far off. It's already going into use for domain names:
% "Asian-language domain names now available," at
% <http://www.cnn.com/2000/TECH/computing/07/17/asian.domains.idg/index.html>.
%
%
% --
% Stuart Staniford --- President --- Silicon Defense
% stuart@silicondefense.com
% (707) 445-4355 (707) 445-4222 (FAX)
%
%
--
--bill