SOTAVerified

Web-Scale Language-Independent Cataloging of Noisy Product Listings for E-Commerce

2017-04-01EACL 2017Unverified0· sign in to hype

Pradipto Das, Y Xia, i, Aaron Levine, Giuseppe Di Fabbrizio, Ankur Datta

Unverified — Be the first to reproduce this paper.

Reproduce

Abstract

The cataloging of product listings through taxonomy categorization is a fundamental problem for any e-commerce marketplace, with applications ranging from personalized search recommendations to query understanding. However, manual and rule based approaches to categorization are not scalable. In this paper, we compare several classifiers for categorizing listings in both English and Japanese product catalogs. We show empirically that a combination of words from product titles, navigational breadcrumbs, and list prices, when available, improves results significantly. We outline a novel method using correspondence topic models and a lightweight manual process to reduce noise from mis-labeled data in the training set. We contrast linear models, gradient boosted trees (GBTs) and convolutional neural networks (CNNs), and show that GBTs and CNNs yield the highest gains in error reduction. Finally, we show GBTs applied in a language-agnostic way on a large-scale Japanese e-commerce dataset have improved taxonomy categorization performance over current state-of-the-art based on deep belief network models.

Tasks

Reproductions