Skip to main content

Is there a lax, permissive XML parser for PHP?


I'm looking for a parser that will allow me to successfully parse broken xml, taking a "best guess" approach - for instance.




<thingy>
<description>
something <b>with</b> bogus<br>
markup not wrapped in CDATA
</description>
</thingy>



Ideally, it will yield a thingy object, with a description property and whatever tag soup inside.



Other suggestions on how to attack the problem (other than having valid markup to start with) welcome.



Non-php solutions (Beautiful Soup (python) for instance) are not outside the pale, but I'd prefer to stick to the prevailing skill-set in the company



Thanks!


Source: Tips4allCCNA FINAL EXAM

Comments

  1. You could use DOMDocument::loadHTML() (or DOMDocument::loadhtmlfile()) to convert your broken XML to proper XML. If you don't like dealing with DOMDocument objectsThen use saveXML() and load the resulting XML string with SimpleXML.

    $dom = DOMDocument::loadHTMLfile($filepath);
    if (!$dom)
    {
    throw new Exception("Could not load the lax XML file");
    }
    // Now you can work with your XML file using the $dom object.


    // If you'd like using SimpleXML, do the following steps.
    $xml = new SimpleXML($dom->saveXML());
    unset($dom);


    I've tried this script:

    <?php
    $dom = new DOMDocument();
    $dom->loadHTMLFile('badformatted.xml');
    if (!$dom)
    {
    die('error');
    }
    $nodes = $dom->getElementsByTagName('description');
    for ($i = 0; $i < $nodes->length; $i++)
    {
    echo "Node content: ".$nodes->item($i)->textContent."\n";
    }


    The output when executing this from the CLI:

    carlos@marmolada:~/xml$ php test.php

    Warning: DOMDocument::loadHTMLFile(): Tag thingy invalid in badformatted.xml, line: 1 in /home/carlos/xml/test.php on line 3

    Warning: DOMDocument::loadHTMLFile(): Tag description invalid in badformatted.xml, line: 2 in /home/carlos/xml/test.php on line 3
    Node content:
    something with bogus
    markup not wrapped in CDATA

    carlos@marmolada:~/xml$


    edit: some minor corrections and error treatment.

    edit2: Change to non-static call to avoid E_STRICT error, added test case.

    ReplyDelete
  2. One alternative is to use the Tidy HTML library (PHP binding here) to clean the HTML first. That survives quite a lot of fairly hideous input, and I've seen people use it for scraping rather ropey HTML before.

    ReplyDelete

Post a Comment

Popular posts from this blog

[韓日関係] 首相含む大幅な内閣改造の可能性…早ければ来月10日ごろ=韓国

div not scrolling properly with slimScroll plugin

I am using the slimScroll plugin for jQuery by Piotr Rochala Which is a great plugin for nice scrollbars on most browsers but I am stuck because I am using it for a chat box and whenever the user appends new text to the boxit does scroll using the .scrollTop() method however the plugin's scrollbar doesnt scroll with it and when the user wants to look though the chat history it will start scrolling from near the top. I have made a quick demo of my situation http://jsfiddle.net/DY9CT/2/ Does anyone know how to solve this problem?

Why does this javascript based printing cause Safari to refresh the page?

The page I am working on has a javascript function executed to print parts of the page. For some reason, printing in Safari, causes the window to somehow update. I say somehow, because it does not really refresh as in reload the page, but rather it starts the "rendering" of the page from start, i.e. scroll to top, flash animations start from 0, and so forth. The effect is reproduced by this fiddle: http://jsfiddle.net/fYmnB/ Clicking the print button and finishing or cancelling a print in Safari causes the screen to "go white" for a sec, which in my real website manifests itself as something "like" a reload. While running print button with, let's say, Firefox, just opens and closes the print dialogue without affecting the fiddle page in any way. Is there something with my way of calling the browsers print method that causes this, or how can it be explained - and preferably, avoided? P.S.: On my real site the same occurs with Chrome. In the ex