[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[idn] Common-case optimization of nameprep (was Debunking the ACE myth)



-----BEGIN PGP SIGNED MESSAGE-----

David Hopwood wrote:
> "Adam M. Costello" wrote:
> > I doubt that checking whether a name is nameprep'd is much cheaper than
> > performing nameprep on it.
> 
> It is, actually: see UAX #15 Annexes 3 and 8, at
> http://www.unicode.org/unicode/reports/tr15/
> 
> (The code in Annex 8 would need to be modified to also check for lowercase.

Correction: it is slightly more complicated than just checking for lowercase
(but this doesn't change the efficiency).

Let getCanonicalClassX(ch) be the Canonical Combining Class of ch, except that
it returns a negative value if ch \in X. The set X is defined as the union of:

 - characters that are mapped to something else (or mapped out) by the nameprep
   mapping step,
 - characters for which the NFC flag in DerivedNormalizationProperties.txt
   is either NFC_NO or NFC_MAYBE,
 - characters that are prohibited in the output of nameprep (i.e. category D).

[X is a superset of MN union D. It would be interesting to see what kind
of characters, if any, are in X \ (MN union D); I'll try to work that out.]

Then the Java-like pseudocode would look like this:

public String nameprep(String source) {
    short lastCanonicalClass = 0;

    for (int i = 0; i < source.length(); ++i) {
        char ch = source.charAt(i);
        short canonicalClass = getCanonicalClassX(ch);
        if (lastCanonicalClass > canonicalClass && canonicalClass != 0) {
            return fullNameprep(source, i);
        }
        lastCanonicalClass = canonicalClass;
    }
    // The string was already nameprepped, and has no disallowed characters.
    return string;
}

public String fullNameprep(String source, int i) {
    // Full nameprep algorithm (including checking for disallowed characters),
    // where the first character that may need mapping is at index i.
}

> It also has to be checked that the name is a valid UTF-8 encoding.)

This can be done in the same loop.

- -- 
David Hopwood <david.hopwood@zetnet.co.uk>

Home page & PGP public key: http://www.users.zetnet.co.uk/hopwood/
RSA 2048-bit; fingerprint 71 8E A6 23 0E D3 4C E5  0F 69 8C D4 FA 66 15 01
Nothing in this message is intended to be legally binding. If I revoke a
public key but refuse to specify why, it is because the private key has been
seized under the Regulation of Investigatory Powers Act; see www.fipr.org/rip


-----BEGIN PGP SIGNATURE-----
Version: 2.6.3i
Charset: noconv

iQEVAwUBO2C+eDkCAxeYt5gVAQEU/QgAj47s6KYVxnzTI2DPrfK6dyv3pGh93o9o
y5y7OJIrSyJMq2bsAZZ0vGw+NF+ysQilY+glcR3ZsZ3xEQuh9lW3Rc0P8AvPEHLu
6OB2OwsWPm8LIxg1zuKokOwf11nSGZnniYOwntkNUVgSEC0PeQ0Ry8B4vAL4dgGl
Xu5Emk5DUw0mOLZ1BTJ/5XuyTv55xTpnCNY4/BMTWW9hD+X9BX/XNCe9H2zWLBTw
/m4eGhyulPPQwO2eF80Xu3HDSLaslSHO/upnAcfSz4jzpC8tOeeN19+frcZ4PTYF
S8YBodfWhZYOiD3W80yxXZteRcCFCkj4I/OXaAqyBQdLFn9M4JWpSg==
=rlhv
-----END PGP SIGNATURE-----