Skip to main content

Is there a lax, permissive XML parser for PHP?


I'm looking for a parser that will allow me to successfully parse broken xml, taking a "best guess" approach - for instance.




<thingy>
<description>
something <b>with</b> bogus<br>
markup not wrapped in CDATA
</description>
</thingy>



Ideally, it will yield a thingy object, with a description property and whatever tag soup inside.



Other suggestions on how to attack the problem (other than having valid markup to start with) welcome.



Non-php solutions (Beautiful Soup (python) for instance) are not outside the pale, but I'd prefer to stick to the prevailing skill-set in the company



Thanks!


Source: Tips4allCCNA FINAL EXAM

Comments

  1. You could use DOMDocument::loadHTML() (or DOMDocument::loadhtmlfile()) to convert your broken XML to proper XML. If you don't like dealing with DOMDocument objectsThen use saveXML() and load the resulting XML string with SimpleXML.

    $dom = DOMDocument::loadHTMLfile($filepath);
    if (!$dom)
    {
    throw new Exception("Could not load the lax XML file");
    }
    // Now you can work with your XML file using the $dom object.


    // If you'd like using SimpleXML, do the following steps.
    $xml = new SimpleXML($dom->saveXML());
    unset($dom);


    I've tried this script:

    <?php
    $dom = new DOMDocument();
    $dom->loadHTMLFile('badformatted.xml');
    if (!$dom)
    {
    die('error');
    }
    $nodes = $dom->getElementsByTagName('description');
    for ($i = 0; $i < $nodes->length; $i++)
    {
    echo "Node content: ".$nodes->item($i)->textContent."\n";
    }


    The output when executing this from the CLI:

    carlos@marmolada:~/xml$ php test.php

    Warning: DOMDocument::loadHTMLFile(): Tag thingy invalid in badformatted.xml, line: 1 in /home/carlos/xml/test.php on line 3

    Warning: DOMDocument::loadHTMLFile(): Tag description invalid in badformatted.xml, line: 2 in /home/carlos/xml/test.php on line 3
    Node content:
    something with bogus
    markup not wrapped in CDATA

    carlos@marmolada:~/xml$


    edit: some minor corrections and error treatment.

    edit2: Change to non-static call to avoid E_STRICT error, added test case.

    ReplyDelete
  2. One alternative is to use the Tidy HTML library (PHP binding here) to clean the HTML first. That survives quite a lot of fairly hideous input, and I've seen people use it for scraping rather ropey HTML before.

    ReplyDelete

Post a Comment

Popular posts from this blog

Slow Android emulator

I have a 2.67 GHz Celeron processor, 1.21 GB of RAM on a x86 Windows XP Professional machine. My understanding is that the Android emulator should start fairly quickly on such a machine, but for me it does not. I have followed all instructions in setting up the IDE, SDKs, JDKs and such and have had some success in staring the emulator quickly but is very particulary. How can I, if possible, fix this problem?

CCNA 3 Final Exam => latest version

1 . Which security protocol or measure would provide the greatest protection for a wireless LAN? WPA2 cloaking SSIDs shared WEP key MAC address filtering   2 . Refer to the exhibit. All trunk links are operational and all VLANs are allowed on all trunk links. An ARP request is sent by computer 5. Which device or devices will receive this message? only computer 4 computer 3 and RTR-A computer 4 and RTR-A computer 1, computer 2, computer 4, and RTR-A computer 1, computer 2, computer 3, computer 4, and RTR-A all of the computers and the router   3 . Refer to the exhibit. Hosts A and B, connected to hub HB1, attempt to transmit a frame at the same time but a collision occurs. Which hosts will receive the collision jamming signal? only hosts A and B only hosts A, B, and C only hosts A, B, C, and D only hosts A, B, C, and E   4 . Refer to the exhibit. Router RA receives a packet with a source address of 192.168.1.65 and a destination address of 192.168.1.161...