I am trying to scrape the datas from a webpage, but I get need to get all the data in this link .
include 'simple_html_dom.php';
$html1 = file_get_html('http://www.aktive-buergerschaft.de/buergerstiftungen/unsere_leistungen/buergerstiftungsfinder');
$info1 = $html1->find('b[class=[what to enter herer ]',0);
I need to get all the data out of this site .
Bürgerstiftung Lebensraum Aachen
rechtsfähige Stiftung des bürgerlichen Rechts
Ansprechpartner: Hubert Schramm
Alexanderstr. 69/ 71
52062 Aachen
Telefon: 0241 - 4500130
Telefax: 0241 - 4500131
Email: info@buergerstiftung-aachen.de
www.buergerstiftung-aachen.de
>> Weitere Details zu dieser Stiftung
Bürgerstiftung Achim
rechtsfähige Stiftung des bürgerlichen Rechts
Ansprechpartner: Helga Kühn
Rotkehlchenstr. 72
28832 Achim
Telefon: 04202-84981
Telefax: 04202-955210
Email: info@buergerstiftung-achim.de
www.buergerstiftung-achim.de
>> Weitere Details zu dieser Stiftung
I need to have the data that are "behind" the link - is there any way to do this with a easy and understandable parser - one that can be understood and written by a newbie!?
Source: Tips4all, CCNA FINAL EXAM
Seems to be written in the documentation:
ReplyDelete$html1->find('b[class=info]',0)->innertext;
Your provided links are down,
ReplyDeleteI will suggest you to use the native PHP "DOM" Extension instead of "simple html parser", it will be much faster and easier ;)
I had a look at the page using googlecache, you can use something like:-
$doc = new DOMDocument;
@$doc->loadHTMLFile('...URL....'); // Using the @ operator to hide parse errors
$contents = $doc->getElementById('content')->nodeValue; // Text contents of #content
From what i can quickly glance you need to loop through the <dl> tags in #content, then the dt and dd.
ReplyDeleteforeach ($html->find('#content dl') as $item) {
$info = $item->find('dd');
foreach ($info as $info_item) {..}
}
Using the simple_html_dom library
XPath makes scraping ridiculously easy, and allows for some changes in the HTML document to not affect you. For example, to pull out the names, you'd use a query that looks like:
ReplyDelete//div[id='content']/d1/dt
A simple Google search will give you plenty of tutorials
@zero: there is good site to try out scrapping a site using both php and python...pretty helpful site atleast to me:-
ReplyDeletehttp://scraperwiki.com/