Hypertext Entity Extraction in Webpage

2024-03-04Unverified0· sign in to hype

Yifei Yang, Tianqiao Liu, Bo Shao, Hai Zhao, Linjun Shou, Ming Gong, Daxin Jiang

Unverified — Be the first to reproduce this paper.

Abstract

Webpage entity extraction is a fundamental natural language processing task in both research and applications. Nowadays, the majority of webpage entity extraction models are trained on structured datasets which strive to retain textual content and its structure information. However, existing datasets all overlook the rich hypertext features (e.g., font color, font size) which show their effectiveness in previous works. To this end, we first collect a Hypertext Entity Extraction Dataset (HEED) from the e-commerce domains, scraping both the text and the corresponding explicit hypertext features with high-quality manual entity annotations. Furthermore, we present the MoE-based Entity Extraction Framework (MoEEF), which efficiently integrates multiple features to enhance model performance by Mixture of Experts and outperforms strong baselines, including the state-of-the-art small-scale models and GPT-3.5-turbo. Moreover, the effectiveness of hypertext features in HEED and several model components in MoEEF are analyzed.

Tasks

Mixture-of-Experts

Hypertext Entity Extraction in Webpage

Abstract

Tasks

Reproductions