Skip to main content

Posts

Showing posts with the label web-scraping

How to extract textual contents from a web page?

I'm developing an application in java which can take textual information from different web pages and will summarize it into one page.For example,suppose I have a news on different web pages like Hindu,Times of India,Statesman,etc.Now my application is supposed to extract important points from each one of these pages and will put them together as a single news.The application is based on concepts of web content mining.As a beginner to this field,I can't understand where to start off.I have gone through research papers which explains noise removal as first step in buiding this application. So,if I'm given a news web page the very first step is to extract main news from the page excluding hyperlinks,advertisements,useless images,etc. My question is how can I do this ? Please give me some good tutorials which explains the implementation of such kind of application using web content mining.Or at least give me some hint how to accomplish it ?