September 29, 2007
admin
0 comments
keywords: Code PHP DOMDocument
| PHP DOMDocument() |
I've been having to do some weird things parsing HTML lately, and I usually just use a lot of regular expressions (amazing magical things). Sometimes, the target HTML is just so grungy, this is not easy, or even possible. Again, PHP to the rescue. It has some incredible built in tools for dealing with HTML like XML data, and this is some concept testing code from where I needed to start doing this. It's not the specific code I ended up with, but it was where I started.
Goal: Find the link text and the sub text in the a href tag as a span (but no other id/class) and seperate them out for parsing and reporting for other things.
<?php
$content = <<<EOF
This is a bogus HTML document example.
<ul>
<li>Item 1</li>
<li>Item 2</li>
</ul>
<span id="info">id info with no a href</span>
<span id="info"><a href="#">link text<span>inner span</span></a></span><p>
<span>this</span> More body text. and more.
EOF
;
$doc = new DOMDocument();
$doc->preserveWhiteSpace = FALSE;
$doc->loadHTML("$content");
$doc->normalizeDocument() ;
$params = $doc->getElementsByTagName('span') ;
$i = 0 ;
foreach ($params as $param) {
if($param->getAttribute('id') == 'info') {
print "v: " . $doc->getElementsByTagName('span')->item($i)->firstChild->nodeValue . "<br> " ;
print "a: " . $param->getElementsByTagName('a')->item(0)->firstChild->nodeValue . "<br> " ;
print "s: " . $param->getElementsByTagName('span')->item(0)->firstChild->nodeValue . "<br> " ;
} ;
$i++ ;
}
?>
The output looked like:
v: id info with no a href
a:
s:
v: link textinner span
a: link text
s: inner span
|
| |