SOTAVerified

A Little Bit Is Worse Than None: Ranking with Limited Training Data

2020-11-01EMNLP (sustainlp) 2020Code Available0· sign in to hype

Xinyu Zhang, Andrew Yates, Jimmy Lin

Code Available — Be the first to reproduce this paper.

Reproduce

Code

Abstract

Researchers have proposed simple yet effective techniques for the retrieval problem based on using BERT as a relevance classifier to rerank initial candidates from keyword search. In this work, we tackle the challenge of fine-tuning these models for specific domains in a data and computationally efficient manner. Typically, researchers fine-tune models using corpus-specific labeled data from sources such as TREC. We first answer the question: How much data of this type do we need? Recognizing that the most computationally efficient training is no training, we explore zero-shot ranking using BERT models that have already been fine-tuned with the large MS MARCO passage retrieval dataset. We arrive at the surprising and novel finding that “some” labeled in-domain data can be worse than none at all.

Tasks

Reproductions