XML and Cocoa

So there I am, happily hacking away at a new iPhone app, when suddenly I have cause to execute a web request and parse some response XHTML. (It happens more often than you might think.) And since I was interested in keeping this code clean and well-architectured, I was presented with a dilemma: what’s the “right” way to do this?

The first instinct of a lot of coders would have (including a couple people I asked) would be to use regular expressions. However, as any StackOverflow regular knows, applying a regex to XHTML summons the dark god Chthulu, so that wasn’t necessarily the best solution. (While the link is humorous, the answer is real: regexes generally aren’t sufficiently complex to parse XHTML in a reliable way.)

The platform constraints came into play a bit here, as the next possible solution was to use an NSXMLDocument, which contains nodes that can be traversed to find relevant information in a well-structured XHTML document. Unfortunately, NSXMLDocument isn’t available on the iOS platform, so I turned to its cousin NSXMLParser.

This seemed like it would work - it’s certainly the Apple-sanctioned way of parsing through an XML document. Before we go further, however, let’s take a look at the relevant HTML snippet being parsed, no alterations whatsoever (including whitespace):

<pre>
<div class="mainContainer">
<h2>Network Usage Summary</h2><br />
<table class="ms-rteTable-1" cellpadding="4px">
<tr class="ms-rteTableHeaderRow-1">
<td>Bandwidth Class</td>
<td>Policy Bytes Received</td>
<td>Policy Bytes Sent</td>
<td>Actual Bytes Received</td>
<td>Actual Bytes Sent</td></tr>

<tr class="ms-rteTableOddRow-1">
<td>Unrestricted</td>
<td>1,009.81 MB</td>
<td>35.00 MB</td>
<td>1,346.41 MB</td>
<td>46.66 MB</td></tr><br />
</table>

<h2>Network Usage Details</h2><br />
<table class="ms-rteTable-1" cellpadding="4px">
<tr class="ms-rteTableHeaderRow-1">
<td>Network Address</td>
<td>Host</td>
<td>comment</td>
<td>Policy Bytes Received</td>
<td>Policy Bytes Sent</td>
<td>Actual Bytes Received</td>
<td>Actual Bytes Sent</td></tr>

<tr class="ms-rteTableOddRow-1">
<td>00:1E:90:A0:AD:80</td>
<td>wozniak</td>
<td>Created by DHCP authentication service</td>
<td>906.41 MB</td>
<td>22.75 MB</td>
<td>1,208.54 MB</td>
<td>30.33 MB</td></tr>

<tr class="ms-rteTableEvenRow-1">
<td>00:1F:3B:93:1E:2B</td>
<td>ekltl-1</td>
<td>Created by DHCP authentication service</td>
<td>0.03 MB</td>
<td>0.01 MB</td>
<td>0.04 MB</td>
<td>0.02 MB</td></tr>

<tr class="ms-rteTableOddRow-1">
<td>60:33:4B:29:14:7D</td>
<td>mossberg</td>
<td>Created by DHCP authentication service</td>
<td>90.09 MB</td>
<td>10.66 MB</td>
<td>120.12 MB</td>
<td>14.21 MB</td></tr>

<tr class="ms-rteTableEvenRow-1">
<td>D8:A2:5E:2A:EC:BB</td>
<td>Tim-Ekls-iPad</td>
<td>Created by DHCP authentication service</td>
<td>8.43 MB</td>
<td>0.89 MB</td>
<td>11.24 MB</td>
<td>1.18 MB</td></tr>

<tr class="ms-rteTableOddRow-1">
<td>F8:1E:DF:8F:21:59</td>
<td>Tim-Ekls-iPhone</td>
<td>Created by DHCP authentication service</td>
<td>4.85 MB</td>
<td>0.69 MB</td>
<td>6.46 MB</td>
<td>0.91 MB</td></tr><br />
</table><br />
</div></pre><br />

This block was nested inside a master layout table and adjacent to several other tables, so any parser would have to be very specific as to which table it’s parsing through. I actually started coding up this solution and spent about an hour trying to figure out exactly how to specify “in the master layout table, inside the first table with class ms-rteTable-1, in the second tr tag, the textual contents of any td element.”

That got ugly fast.

So back to the drawing board. The key in this situation was noticing the exact query I wanted looked almost like a jQuery selector - I wanted to walk a tree of XHTML nodes, filter on certain conditions at each node, and provide an array of results for matching nodes. After a little research, I found an equivalent technology that’s standardized outside the Javascript world: XPath. Even better, Matt Gallagher had a phenomenal post about using libxml2 and some custom functions to run quick XPath queries on iOS platforms (obligatory plug for the excellent blog Cocoa With Love, which has helped me out during app development countless times).

After linking in libxml2 to the project (hint: link it from Xcode’s built-in libxml2 listing rather than finding the file in order to build for both Simulator and device) and writing a single XPath query, I was up and running with my XHTML parsing in less than a fifth of the time I had spent trying to work out the parser intricacies on my own. The query itself was a single line of code (broken into several here for readability):

NSString * query = @"//div[@class='mainContainer']"
                    "/table[@class='ms-rteTable-1'][1]"
                    "/tr[@class='ms-rteTableOddRow-1']/td";
NSArray * results = PerformHTMLXPathQuery(_data, query);

Moral of the story: use the right tool for the right job. And don’t ever write a regex to parse HTML.