Is there a lax, permissive XML parser for PHP?

I'm looking for a parser that will allow me to successfully parse broken xml, taking a "best guess" approach - for instance.




<thingy>

       <description>

           something <b>with</b> bogus<br> 

           markup not wrapped in CDATA

       </description>

    </thingy>

Ideally, it will yield a thingy object, with a description property and whatever tag soup inside.

Other suggestions on how to attack the problem (other than having valid markup to start with) welcome.

Non-php solutions (Beautiful Soup (python) for instance) are not outside the pale, but I'd prefer to stick to the prevailing skill-set in the company

Thanks!

Source: Tips4all, CCNA FINAL EXAM

Comments

UserMay 15, 2012 at 9:29 AM
You could use DOMDocument::loadHTML() (or DOMDocument::loadhtmlfile()) to convert your broken XML to proper XML. If you don't like dealing with DOMDocument objectsThen use saveXML() and load the resulting XML string with SimpleXML.

$dom = DOMDocument::loadHTMLfile($filepath);
if (!$dom)
{
throw new Exception("Could not load the lax XML file");
}
// Now you can work with your XML file using the $dom object.

// If you'd like using SimpleXML, do the following steps.
$xml = new SimpleXML($dom->saveXML());
unset($dom);

I've tried this script:

<?php
$dom = new DOMDocument();
$dom->loadHTMLFile('badformatted.xml');
if (!$dom)
{
die('error');
}
$nodes = $dom->getElementsByTagName('description');
for ($i = 0; $i < $nodes->length; $i++)
{
echo "Node content: ".$nodes->item($i)->textContent."\n";
}

The output when executing this from the CLI:

carlos@marmolada:~/xml$ php test.php

Warning: DOMDocument::loadHTMLFile(): Tag thingy invalid in badformatted.xml, line: 1 in /home/carlos/xml/test.php on line 3

Warning: DOMDocument::loadHTMLFile(): Tag description invalid in badformatted.xml, line: 2 in /home/carlos/xml/test.php on line 3
Node content:
something with bogus
markup not wrapped in CDATA

carlos@marmolada:~/xml$

edit: some minor corrections and error treatment.

edit2: Change to non-static call to avoid E_STRICT error, added test case.
ReplyDelete
Replies
UserMay 15, 2012 at 9:29 AM
One alternative is to use the Tidy HTML library (PHP binding here) to clean the HTML first. That survives quite a lot of fairly hideous input, and I've seen people use it for scraping rather ropey HTML before.
ReplyDelete
Replies

Add comment

CCNA, CCNP, MCSA, CCNA Final Exam, All Answer Test Module With 100/100

Search This Blog

Is there a lax, permissive XML parser for PHP?

Labels

Comments

Post a Comment

Popular posts from this blog

Slow Android emulator

Create Subdomains on the fly with .htaccess (PHP)

Reading Excel files from C#