Skip to main content

Is there a lax, permissive XML parser for PHP?


I'm looking for a parser that will allow me to successfully parse broken xml, taking a "best guess" approach - for instance.




<thingy>
<description>
something <b>with</b> bogus<br>
markup not wrapped in CDATA
</description>
</thingy>



Ideally, it will yield a thingy object, with a description property and whatever tag soup inside.



Other suggestions on how to attack the problem (other than having valid markup to start with) welcome.



Non-php solutions (Beautiful Soup (python) for instance) are not outside the pale, but I'd prefer to stick to the prevailing skill-set in the company



Thanks!


Source: Tips4allCCNA FINAL EXAM

Comments

  1. You could use DOMDocument::loadHTML() (or DOMDocument::loadhtmlfile()) to convert your broken XML to proper XML. If you don't like dealing with DOMDocument objectsThen use saveXML() and load the resulting XML string with SimpleXML.

    $dom = DOMDocument::loadHTMLfile($filepath);
    if (!$dom)
    {
    throw new Exception("Could not load the lax XML file");
    }
    // Now you can work with your XML file using the $dom object.


    // If you'd like using SimpleXML, do the following steps.
    $xml = new SimpleXML($dom->saveXML());
    unset($dom);


    I've tried this script:

    <?php
    $dom = new DOMDocument();
    $dom->loadHTMLFile('badformatted.xml');
    if (!$dom)
    {
    die('error');
    }
    $nodes = $dom->getElementsByTagName('description');
    for ($i = 0; $i < $nodes->length; $i++)
    {
    echo "Node content: ".$nodes->item($i)->textContent."\n";
    }


    The output when executing this from the CLI:

    carlos@marmolada:~/xml$ php test.php

    Warning: DOMDocument::loadHTMLFile(): Tag thingy invalid in badformatted.xml, line: 1 in /home/carlos/xml/test.php on line 3

    Warning: DOMDocument::loadHTMLFile(): Tag description invalid in badformatted.xml, line: 2 in /home/carlos/xml/test.php on line 3
    Node content:
    something with bogus
    markup not wrapped in CDATA

    carlos@marmolada:~/xml$


    edit: some minor corrections and error treatment.

    edit2: Change to non-static call to avoid E_STRICT error, added test case.

    ReplyDelete
  2. One alternative is to use the Tidy HTML library (PHP binding here) to clean the HTML first. That survives quite a lot of fairly hideous input, and I've seen people use it for scraping rather ropey HTML before.

    ReplyDelete

Post a Comment

Popular posts from this blog

Why is this Javascript much *slower* than its jQuery equivalent?

I have a HTML list of about 500 items and a "filter" box above it. I started by using jQuery to filter the list when I typed a letter (timing code added later): $('#filter').keyup( function() { var jqStart = (new Date).getTime(); var search = $(this).val().toLowerCase(); var $list = $('ul.ablist > li'); $list.each( function() { if ( $(this).text().toLowerCase().indexOf(search) === -1 ) $(this).hide(); else $(this).show(); } ); console.log('Time: ' + ((new Date).getTime() - jqStart)); } ); However, there was a couple of seconds delay after typing each letter (particularly the first letter). So I thought it may be slightly quicker if I used plain Javascript (I read recently that jQuery's each function is particularly slow). Here's my JS equivalent: document.getElementById('filter').addEventListener( 'keyup', function () { var jsStart = (new Date).getTime()...

Is it possible to have IF statement in an Echo statement in PHP

Thanks in advance. I did look at the other questions/answers that were similar and didn't find exactly what I was looking for. I'm trying to do this, am I on the right path? echo " <div id='tabs-".$match."'> <textarea id='".$match."' name='".$match."'>". if ($COLUMN_NAME === $match) { echo $FIELD_WITH_COLUMN_NAME; } else { } ."</textarea> <script type='text/javascript'> CKEDITOR.replace( '".$match."' ); </script> </div>"; I am getting the following error message in the browser: Parse error: syntax error, unexpected T_IF Please let me know if this is the right way to go about nesting an IF statement inside an echo. Thank you.