[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

RE: RFC 3530



Patrik Fältström wrote: 
> As a liason to the Unicode Consortium, I received the following 
> comments on RFC 3530.

Lucky you :-)

> Personally, I possibly should have found at least the first of these 
> issues, so I have much myself to blame for not finding these errors 
> earlier.

Many, many people looked at this spec.  And the basic responsibility for
this is the authors'.

> That said, I think a revision/addendum of the document ("Notes for use 
> of Unicode with NFS version 4" or something like that) is needed given 
> the comments below. Else there is a big risk we will end up with 
> non-interoperable implementations.

Only the first of these has *any* potential in that regard.  I don't know
how  any of the others could lead to non-interoperable implementations. 
If we are going to do some sort of addendum, then we might as well address
these other wording issues, but they don't seem to me, on their own, to 
motivate any sort of addendum.

As far as interoperability testing so far, my guess is that almost all
testing so far has been with servers that implement case-sensitive
comparisons.  The exception would probably be the Hummingbird server,
but I don't believe that most clients would be affected if a server's
use of case-insensitive comparisons were somewhat non-standard.

  
> Mark, if such an addendum is created, I presume you have time to help 
> the editors/authors to find the correct wording, or that you can find a 
>person which can help?
>
>    Regards, Patrik
>
> > Begin forwarded message:
> > 
> > From: "Mark Davis" <mark.davis@jtcsv.com>
> > Date: fre jun 20, 2003  21:19:00 Europe/Stockholm
> > To: Patrik Fältström <paf@cisco.com>
> > Cc: "Paul Hoffman / IMC" <phoffman@imc.org>, François Yergeau 
> > <francois@yergeau.com>, "Martin Duerst" <duerst@w3.org>
> > Subject: RFC 3530
> >
> > Patrik,
> >
> > I was recently pointed to RFC 3530. The incorporation of UTF-8 into
> > the standard is very welcome, but I found a few problems in the text.
> > It was very unclear from the document who to foward the comments to,
> > so as liaison could you forward them?
> >
> > Here are the problematic passages:
> >
> > 1   With respect to the case_insensitive and case_preserving
> > attributes,
> >    each UCS-4 character (which UTF-8 encodes) has a "long descriptive
> >    name" [RFC1345] which may or may not included the word "CAPITAL" or
> >    "SMALL".  The presence of SMALL or CAPITAL allows an NFS server to
> >    implement unambiguous and efficient table driven mappings for case
> >    insensitive comparisons, and non-case-preserving storage.  For
> >    general character handling and internationalization issues, see the
> >    section "Internationalization".
> >
> > This is *not* a reliable guide to the case of letters. A case variant
> > *cannot* be found by simply replacing SMALL by CAPITAL or vice versa.

So, just to make all this a little more definite, can we have an
example or two where the procedure suggested above would give the wrong 
answer?

> >
> > Suggested revision:
> >
> > An NFS server can implement unambiguous and efficient table driven
> > mappings for case insensitive comparisons, and non-case-preserving
> > storage, either by using the Unicode Consortium case-mapping tables,
> > or using the Stringprep tables derived from the Unicode sources.  For
> > general character handling and internationalization issues, see the
> > section "Internationalization".

So here's what I'm worried about.  If there is a case that the RFC3530
procedure and the one suggested give different answers, as I suppose
there is, then don't we have the issue of a change in the protocol?

I think that what is being said is that RFC3530's handling of this is a
mistake and undesirable on various grounds.  However, it could be implemented
and interoperable implementations result.  It is just that the 
case mappings might be "wrong" according to certain external criteria.

If we have an addendum/revision that specifies this new mapping then
that clearly could be implemented in an interoperable fashion, but
there would be the possibility of mis-interoperability with someone
who implemented the RFC3530 approach.

To evaluate what to do about this situation, we need to know a little more
about the problem than simply that the mapping specified in RFC3530 is wrong.
We need to know how big the problem is.

What I would hope is that we could defer this issue to the next minor
version of the protocol, although that might not be possible.  It would
depend on the scope of the problem.

> >
> > 2   Stringprep discusses Unicode characters, whereas NFS version 4
> >    renders UTF-8 characters.  Since there is a one to one mapping from
> >    UTF-8 to Unicode, where ever the remainder of this document refers
> > to
> >    to Unicode, the reader should assume UTF-8.
> >
> > These statements are misleading. Unicode characters have numeric
> > values in the range from 0 to 0x10FFFF. These numbers are encoded in
> > different ways:
> > UTF-8 uses 1 to 4 eight-bit bytes per character
> > UTF-16 uses 1 to 2 sixteen-bit words per character
> > UTF-32 uses 1 thirty-two-bit word per character.
> > All of these are valid Unicode.

Right but the reader is in no doubt which of those is being
specified.

> >
> > Suggested revision:
> >
> > Where ever the remainder of this document refers to to Unicode, the
> > reader should assume the UTF-8 encoding of Unicode.

Clearly more correct, but I don't see how anybody would be misled
by what is there now.

> >
> > 3   Where the client supplied string is valid UTF-8 but contains
> >    characters that are not supported by the server as a value for that
> >    string (e.g., names containing characters that have more than two
> >    octets on a filesystem that supports Unicode characters only), the
> >    server should return an NFS4ERR_BADCHAR error.
> >
> > The example doesn't make sense. All Unicode characters are expressable
> > in UTF-8, and all characters expressable in UTF-8 are Unicode
> > characters (as above).

I think this one is my fault.  I had assumed that Unicode was a
two-byte code (i.e. that it covered only characters upto 0xffff).
Maybe that was true at one time, but in any case, it isn't now.
Sorry about that.

> >
> > Suggested revision:
> >
> > Where the client supplied string is valid UTF-8 but contains
> > characters that are not supported by the server as a value for that
> > string (e.g., if the server doesn't support names containing
> > characters greater than U+FFFF), the server should return an
> > NFS4ERR_BADCHAR error.

Obviously better.  However, the fact that you were able to figure
out what was intended, plus the fact that this is within an example
indicates to me, that this kind of thing can wait, unless we 
have some other need to re-issue.
 
>
> 4   ... The UTF-8 encoding of the UCS as
>    defined by [ISO10646] allows for this type of access and follows
> the
>    policy described in "IETF Policy on Character Sets and Languages",
>    [RFC2277].
>
> This is not an error, but people may think that the UTF-8 definition
> in 10646 is not Unicode. I think François Yergeau can suggest better
> wording based on the new version of RFC2279.

I think we can always improve the wording but given that the spec has 
been issued, that kind of thing is something to note for the future.

>
> Note: the proposal should be checked for grammar, e.g. "if it's post
> processed form collides"

I think that's in the category of should-have-been's at this point.