Deep Neural Networks for Web Page Information Extraction
Tomas Gogar, Hubacek Ondrej, and Jan Sedivy
Code Available — Be the first to reproduce this paper.
ReproduceCode
- github.com/gogartom/TextMapsIn papercaffe2★ 0
Abstract
Web wrappers are systems for extracting structured information from web pages. Currently, wrappers need to be adapted to a particular website template before they can start the extraction process. In this work we present a new method, which uses convolutional neural networks to learn a wrapper that can extract information from previously unseen templates. Therefore, this wrapper does not need any site-specific initialization and is able to extract information from a single web page. We also propose a method for spatial text encoding, which allows us to encode visual and textual content of a web page into a single neural net. The first experiments with product information extraction showed very promising results and suggest that this approach can lead to a general site-independent web wrapper.