Cocoa’s NSString doesn’t have great support for search and replace at the best of times. But the absence of the NSPredicate class from the iPhone SDK means that there is no quick way to strip a whole range of characters out of an NSString in one go on the iPhone. This article shows one way to do so.

Control Characters

Let’s say I have an input HTML string (which I do), which potentially contains some control characters (which it does). By “control characters” I mean anything in the range 0×00 – 0x1F, except for newlines and tabs. I need to strip out these characters (if they exist) before passing the string to HTML tidy to be converted into XHTML. If I don’t, then when I pass the output XHTML string to libxml, it will complain that my file is not valid XML 1.0. (XML 1.0 doesn’t allow certain control characters, although XML 1.1 is slightly more relaxed about them.)

I can easily replace all instances of a single character in a string using NSString’s stringByReplacingOccurrencesOfString:withString: instance method. But I don’t have an NSString method to replace all instances of multiple characters in one go, and I don’t want to call the replace method 28 times.

In my case, I actually want to replace each character with an empty string, to strip them out completely. The Mac OS X SDK uses the NSPredicate class for this kind of thing, but it’s not available in the iPhone SDK. However, the iPhone SDK does contain the NSScanner class, and we can use this to perform a similar task.

Here’s the code I use to do this (where sourceHTMLString is an NSString containing the entire HTML source, including any control characters):


// get a scanner, initialised with our input string
NSScanner *sourceHTMLScanner = [NSScanner scannerWithString:sourceHTMLString];
// create a mutable output string (empty for now)
NSMutableString *cleanedSourceHTMLString = [[NSMutableString alloc] init];

// create an array of chars for all control characters between 0x00 and 0x1F, apart from \t, \n, \f and \r (which are at code points 0x09, 0x0A, 0x0C and 0x0D respectively)

char thisCharCode[28];
int i;
for (i = 0x00; i < = 0x08; i++) {
thisCharCode[i] = i;
}
thisCharCode[9] = 0x0B;
for (i = 0x0E; i <= 0x1F; i++) {
thisCharCode[i – 4] = i;
}

// convert this array into an NSCharacterSet
NSString *controlCharString = [NSString stringWithCString:thisCharCode length:28];
NSCharacterSet *controlCharSet = [NSCharacterSet characterSetWithCharactersInString:controlCharString];

// request that the scanner ignores these characters
[sourceHTMLScanner setCharactersToBeSkipped:controlCharSet];

// run through the string to remove control characters
while ([sourceHTMLScanner isAtEnd] == NO) {
NSString *outString;
// scan up to the next instance of one of the control characters
if ([sourceHTMLScanner scanUpToCharactersFromSet:controlCharSet intoString:&outString]) {
// add the string chunk to our output string
[cleanedSourceHTMLString appendString:outString];
}
}

Note that this code alloc’s cleanedSourceHTMLString, so you’ll need to release it later on when you are done with it.