Empty XHTML tags and Internet Explorer DOM traversal | Technology news, reviews of software, applications, devices, and IT stuff around the world

Here’s the problem: HTML and XHTML pages containing empty elements with no end tag such as  break JavaScript DOM traversal methods in Internet Explorer 6, 7 and 8, resulting in nodes after such an element showing up in more than one node’s childNodes collection.

Consider the following XHTML document:

<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE html
  PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN"
  "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">
<html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en" lang="en">
<head>
    <title>Test</title>
    <script type="text/javascript">
        function show() {
            var span = document.getElementById("span");
            alert(span.innerHTML);
        }
    </script>
</head>
<body onload="show();">
<p id="p1">Paragraph containing some text followed by an empty span<span id="span"/></p>
<p id="p2">Second paragraph just containing text</p>
</body>
</html>

The idea is that when the page loads, the JavaScript will get a reference to the empty span and display its HTML contents. That will be an empty string, right? Not in IE it won’t.

In IE, you get the following:

 Second paragraph just containing text

Now, while  may be valid in documents with an XHTML Strict doctype, this makes not a jot of difference to IE, which always parses XHTML as HTML, regardless of doctype. What seems to happen is that IE finds the offending span tag (henceforth known as ‘Bad Span’) and ignores the closing slash. Whether it does this because it knows that such a tag is invalid in HTML or because it always ignores closing slashes I’m not sure. Whatever, the result is that IE scans for a corresponding  and since none is forthcoming, skips over the  (which it presumably also considers invalid without a matching  inside the span) and carries on to the end of the document, adding subsequent nodes to the ‘s childNodes collection as it goes.

This is not good news. Consider the second  element. IE somewhat contrarily knows enough about the proper structure of the document to place it in the childNodes collection of the body element, but as shown above, it also shows up in the Bad Span’s childNodes, meaning a node can effectively have multiple parents. The DOM is no longer a hierarchy but a map. You may be curious to know which of its parents the second paragraph considers its real parent. The answer (from its parentNode property) is Bad Span.

Now, an unsuspecting piece of JavaScript could easily get into problems. Imagine you had a script that acted on every span in the page. Let’s say for the sake of an example you wanted to set every span’s contents to just be the text “[SPAN]“. You might do something like the following:

function changeSpans(node) {
    if (node.nodeType == 1 && node.nodeName == "SPAN") {
        node.innerHTML = "[SPAN]";
    } else {
        // Traverse the node's children to find more links
        for (var i = 0; i < node.childNodes.length; i++) {
            changeSpans(node.childNodes[i]);
        }
    }
}

changeSpans(document.body);

Run this in every other browser and you get what you expect, but run this in IE and your second paragraph (in fact, anything after Bad Span) gets wiped out. This is a contrived example, but real life scripts recursively traversing and acting on the DOM could easily fall foul of this problem in IE. At best, you could end up traversing some nodes multiple times, which could be significant if your code is not expecting it.

It’s not easy to work around, either. I have found no simple way to detect empty elements like Bad Span, since the closing slash is not present in the the span’s parent element’s innerHTML property or its own outerHTML property.

The only strategies I can see are:

- Traverse the document from start to finish, keeping a collection of the first parent node you see for each node and ignoring any node whose childNodes collection contains a node you’ve already seen a different parent for, or:
- Icky regular expression-based nastiness to check if each node could possibly have the children it claims to have based on its parent’s innerHTML.

Neither is very appealing. Suggestions welcome.