[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
[idn] Common-case optimization of nameprep (was Debunking the ACE myth)
-----BEGIN PGP SIGNED MESSAGE-----
David Hopwood wrote:
> "Adam M. Costello" wrote:
> > I doubt that checking whether a name is nameprep'd is much cheaper than
> > performing nameprep on it.
>
> It is, actually: see UAX #15 Annexes 3 and 8, at
> http://www.unicode.org/unicode/reports/tr15/
>
> (The code in Annex 8 would need to be modified to also check for lowercase.
Correction: it is slightly more complicated than just checking for lowercase
(but this doesn't change the efficiency).
Let getCanonicalClassX(ch) be the Canonical Combining Class of ch, except that
it returns a negative value if ch \in X. The set X is defined as the union of:
- characters that are mapped to something else (or mapped out) by the nameprep
mapping step,
- characters for which the NFC flag in DerivedNormalizationProperties.txt
is either NFC_NO or NFC_MAYBE,
- characters that are prohibited in the output of nameprep (i.e. category D).
[X is a superset of MN union D. It would be interesting to see what kind
of characters, if any, are in X \ (MN union D); I'll try to work that out.]
Then the Java-like pseudocode would look like this:
public String nameprep(String source) {
short lastCanonicalClass = 0;
for (int i = 0; i < source.length(); ++i) {
char ch = source.charAt(i);
short canonicalClass = getCanonicalClassX(ch);
if (lastCanonicalClass > canonicalClass && canonicalClass != 0) {
return fullNameprep(source, i);
}
lastCanonicalClass = canonicalClass;
}
// The string was already nameprepped, and has no disallowed characters.
return string;
}
public String fullNameprep(String source, int i) {
// Full nameprep algorithm (including checking for disallowed characters),
// where the first character that may need mapping is at index i.
}
> It also has to be checked that the name is a valid UTF-8 encoding.)
This can be done in the same loop.
- --
David Hopwood <david.hopwood@zetnet.co.uk>
Home page & PGP public key: http://www.users.zetnet.co.uk/hopwood/
RSA 2048-bit; fingerprint 71 8E A6 23 0E D3 4C E5 0F 69 8C D4 FA 66 15 01
Nothing in this message is intended to be legally binding. If I revoke a
public key but refuse to specify why, it is because the private key has been
seized under the Regulation of Investigatory Powers Act; see www.fipr.org/rip
-----BEGIN PGP SIGNATURE-----
Version: 2.6.3i
Charset: noconv
iQEVAwUBO2C+eDkCAxeYt5gVAQEU/QgAj47s6KYVxnzTI2DPrfK6dyv3pGh93o9o
y5y7OJIrSyJMq2bsAZZ0vGw+NF+ysQilY+glcR3ZsZ3xEQuh9lW3Rc0P8AvPEHLu
6OB2OwsWPm8LIxg1zuKokOwf11nSGZnniYOwntkNUVgSEC0PeQ0Ry8B4vAL4dgGl
Xu5Emk5DUw0mOLZ1BTJ/5XuyTv55xTpnCNY4/BMTWW9hD+X9BX/XNCe9H2zWLBTw
/m4eGhyulPPQwO2eF80Xu3HDSLaslSHO/upnAcfSz4jzpC8tOeeN19+frcZ4PTYF
S8YBodfWhZYOiD3W80yxXZteRcCFCkj4I/OXaAqyBQdLFn9M4JWpSg==
=rlhv
-----END PGP SIGNATURE-----