Skip to main content

How to scrape specific data from scrape with simple html dom parser


I am trying to scrape the datas from a webpage, but I get need to get all the data in this link .




include 'simple_html_dom.php';
$html1 = file_get_html('http://www.aktive-buergerschaft.de/buergerstiftungen/unsere_leistungen/buergerstiftungsfinder');

$info1 = $html1->find('b[class=[what to enter herer ]',0);



I need to get all the data out of this site .




Bürgerstiftung Lebensraum Aachen
rechtsfähige Stiftung des bürgerlichen Rechts
Ansprechpartner: Hubert Schramm
Alexanderstr. 69/ 71
52062 Aachen
Telefon: 0241 - 4500130
Telefax: 0241 - 4500131
Email: info@buergerstiftung-aachen.de
www.buergerstiftung-aachen.de
>> Weitere Details zu dieser Stiftung

Bürgerstiftung Achim
rechtsfähige Stiftung des bürgerlichen Rechts
Ansprechpartner: Helga Kühn
Rotkehlchenstr. 72
28832 Achim
Telefon: 04202-84981
Telefax: 04202-955210
Email: info@buergerstiftung-achim.de
www.buergerstiftung-achim.de
>> Weitere Details zu dieser Stiftung



I need to have the data that are "behind" the link - is there any way to do this with a easy and understandable parser - one that can be understood and written by a newbie!?


Source: Tips4allCCNA FINAL EXAM

Comments

  1. Seems to be written in the documentation:

    $html1->find('b[class=info]',0)->innertext;

    ReplyDelete
  2. Your provided links are down,
    I will suggest you to use the native PHP "DOM" Extension instead of "simple html parser", it will be much faster and easier ;)
    I had a look at the page using googlecache, you can use something like:-

    $doc = new DOMDocument;
    @$doc->loadHTMLFile('...URL....'); // Using the @ operator to hide parse errors
    $contents = $doc->getElementById('content')->nodeValue; // Text contents of #content

    ReplyDelete
  3. From what i can quickly glance you need to loop through the <dl> tags in #content, then the dt and dd.

    foreach ($html->find('#content dl') as $item) {
    $info = $item->find('dd');
    foreach ($info as $info_item) {..}
    }


    Using the simple_html_dom library

    ReplyDelete
  4. XPath makes scraping ridiculously easy, and allows for some changes in the HTML document to not affect you. For example, to pull out the names, you'd use a query that looks like:

    //div[id='content']/d1/dt


    A simple Google search will give you plenty of tutorials

    ReplyDelete
  5. @zero: there is good site to try out scrapping a site using both php and python...pretty helpful site atleast to me:-
    http://scraperwiki.com/

    ReplyDelete

Post a Comment

Popular posts from this blog

[韓日関係] 首相含む大幅な内閣改造の可能性…早ければ来月10日ごろ=韓国

div not scrolling properly with slimScroll plugin

I am using the slimScroll plugin for jQuery by Piotr Rochala Which is a great plugin for nice scrollbars on most browsers but I am stuck because I am using it for a chat box and whenever the user appends new text to the boxit does scroll using the .scrollTop() method however the plugin's scrollbar doesnt scroll with it and when the user wants to look though the chat history it will start scrolling from near the top. I have made a quick demo of my situation http://jsfiddle.net/DY9CT/2/ Does anyone know how to solve this problem?

Why does this javascript based printing cause Safari to refresh the page?

The page I am working on has a javascript function executed to print parts of the page. For some reason, printing in Safari, causes the window to somehow update. I say somehow, because it does not really refresh as in reload the page, but rather it starts the "rendering" of the page from start, i.e. scroll to top, flash animations start from 0, and so forth. The effect is reproduced by this fiddle: http://jsfiddle.net/fYmnB/ Clicking the print button and finishing or cancelling a print in Safari causes the screen to "go white" for a sec, which in my real website manifests itself as something "like" a reload. While running print button with, let's say, Firefox, just opens and closes the print dialogue without affecting the fiddle page in any way. Is there something with my way of calling the browsers print method that causes this, or how can it be explained - and preferably, avoided? P.S.: On my real site the same occurs with Chrome. In the ex