Parsing HTML tables with simpleXML
Sunday, July 25th, 2010 | Author:

Sometimes it still happens that you have to parse the entire HTML tables from other websites and many times that is the only way to do it. Here is a little tip how to make it simple with php’s simpleXML.

First you have to get the website with file_get_contents($url) and extract the table out (with preg_match or substr).

When you have the entire table in the $table variable, just put the <?xml version=”1.0″?> in front of it. That is necessary to call the simplexml_load_string($table) function. The table must be xhtml compliant otherwise the simplexml would raise errors.

The last step is the foreach($xml->children() as $tr){} loop, where you can access any cell row by row and get the data out of it. Thanks to the simplexml the data is already parsed out of the HTML tags and ready for use.

Example:

//get page
$url = 'http://www.apache.org/server-status';
$content = file_get_contents($url);
//get table
$start = strpos($content, '<table');
$end = strpos($content, '</table>') + 8; //length of </table>
$table = substr($content, $start, $end - $start);
//make it usable
$table = '<?xml version="1.0"?>' . str_replace('nowrap', '', $table);
$xml = simplexml_load_string($table);
//go through data, I need just 13. cell
foreach($xml->children() as $tr){
     if(isset($tr->td[12])) echo $tr->td[12].'<br/>';
}
Category: Web development  | Tags: , ,  | Leave a Comment