Skip to main content

How to extract textual contents from a web page?



I'm developing an application in java which can take textual information from different web pages and will summarize it into one page.For example,suppose I have a news on different web pages like Hindu,Times of India,Statesman,etc.Now my application is supposed to extract important points from each one of these pages and will put them together as a single news.The application is based on concepts of web content mining.As a beginner to this field,I can't understand where to start off.I have gone through research papers which explains noise removal as first step in buiding this application.





So,if I'm given a news web page the very first step is to extract main news from the page excluding hyperlinks,advertisements,useless images,etc. My question is how can I do this ? Please give me some good tutorials which explains the implementation of such kind of application using web content mining.Or at least give me some hint how to accomplish it ?


Comments

  1. You can use readability or boilerpipe, two open source tools for this task. For a tutorial you should read the code & documentation for those two projects.

    ReplyDelete

Post a Comment

Popular posts from this blog

Slow Android emulator

I have a 2.67 GHz Celeron processor, 1.21 GB of RAM on a x86 Windows XP Professional machine. My understanding is that the Android emulator should start fairly quickly on such a machine, but for me it does not. I have followed all instructions in setting up the IDE, SDKs, JDKs and such and have had some success in staring the emulator quickly but is very particulary. How can I, if possible, fix this problem?

CCNA 3 Final Exam => latest version

1 . Which security protocol or measure would provide the greatest protection for a wireless LAN? WPA2 cloaking SSIDs shared WEP key MAC address filtering   2 . Refer to the exhibit. All trunk links are operational and all VLANs are allowed on all trunk links. An ARP request is sent by computer 5. Which device or devices will receive this message? only computer 4 computer 3 and RTR-A computer 4 and RTR-A computer 1, computer 2, computer 4, and RTR-A computer 1, computer 2, computer 3, computer 4, and RTR-A all of the computers and the router   3 . Refer to the exhibit. Hosts A and B, connected to hub HB1, attempt to transmit a frame at the same time but a collision occurs. Which hosts will receive the collision jamming signal? only hosts A and B only hosts A, B, and C only hosts A, B, C, and D only hosts A, B, C, and E   4 . Refer to the exhibit. Router RA receives a packet with a source address of 192.168.1.65 and a destination address of 192.168.1.161...