A Little Bit Is Worse Than None: Ranking with Limited Training Data
Xinyu Zhang, Andrew Yates, Jimmy Lin
Code Available — Be the first to reproduce this paper.
ReproduceCode
- github.com/crystina-z/a-little-bit-is-worse-than-noneOfficialIn papertf★ 0
Abstract
Researchers have proposed simple yet effective techniques for the retrieval problem based on using BERT as a relevance classifier to rerank initial candidates from keyword search. In this work, we tackle the challenge of fine-tuning these models for specific domains in a data and computationally efficient manner. Typically, researchers fine-tune models using corpus-specific labeled data from sources such as TREC. We first answer the question: How much data of this type do we need? Recognizing that the most computationally efficient training is no training, we explore zero-shot ranking using BERT models that have already been fine-tuned with the large MS MARCO passage retrieval dataset. We arrive at the surprising and novel finding that “some” labeled in-domain data can be worse than none at all.