Skip to main content

How to scrape specific data from scrape with simple html dom parser


I am trying to scrape the datas from a webpage, but I get need to get all the data in this link .




include 'simple_html_dom.php';
$html1 = file_get_html('http://www.aktive-buergerschaft.de/buergerstiftungen/unsere_leistungen/buergerstiftungsfinder');

$info1 = $html1->find('b[class=[what to enter herer ]',0);



I need to get all the data out of this site .




Bürgerstiftung Lebensraum Aachen
rechtsfähige Stiftung des bürgerlichen Rechts
Ansprechpartner: Hubert Schramm
Alexanderstr. 69/ 71
52062 Aachen
Telefon: 0241 - 4500130
Telefax: 0241 - 4500131
Email: info@buergerstiftung-aachen.de
www.buergerstiftung-aachen.de
>> Weitere Details zu dieser Stiftung

Bürgerstiftung Achim
rechtsfähige Stiftung des bürgerlichen Rechts
Ansprechpartner: Helga Kühn
Rotkehlchenstr. 72
28832 Achim
Telefon: 04202-84981
Telefax: 04202-955210
Email: info@buergerstiftung-achim.de
www.buergerstiftung-achim.de
>> Weitere Details zu dieser Stiftung



I need to have the data that are "behind" the link - is there any way to do this with a easy and understandable parser - one that can be understood and written by a newbie!?


Source: Tips4allCCNA FINAL EXAM

Comments

  1. Seems to be written in the documentation:

    $html1->find('b[class=info]',0)->innertext;

    ReplyDelete
  2. Your provided links are down,
    I will suggest you to use the native PHP "DOM" Extension instead of "simple html parser", it will be much faster and easier ;)
    I had a look at the page using googlecache, you can use something like:-

    $doc = new DOMDocument;
    @$doc->loadHTMLFile('...URL....'); // Using the @ operator to hide parse errors
    $contents = $doc->getElementById('content')->nodeValue; // Text contents of #content

    ReplyDelete
  3. From what i can quickly glance you need to loop through the <dl> tags in #content, then the dt and dd.

    foreach ($html->find('#content dl') as $item) {
    $info = $item->find('dd');
    foreach ($info as $info_item) {..}
    }


    Using the simple_html_dom library

    ReplyDelete
  4. XPath makes scraping ridiculously easy, and allows for some changes in the HTML document to not affect you. For example, to pull out the names, you'd use a query that looks like:

    //div[id='content']/d1/dt


    A simple Google search will give you plenty of tutorials

    ReplyDelete
  5. @zero: there is good site to try out scrapping a site using both php and python...pretty helpful site atleast to me:-
    http://scraperwiki.com/

    ReplyDelete

Post a Comment

Popular posts from this blog

Wildcards in a hosts file

I want to setup my local development machine so that any requests for *.local are redirected to localhost . The idea is that as I develop multiple sites, I can just add vhosts to Apache called site1.local , site2.local etc, and have them all resolve to localhost , while Apache serves a different site accordingly.