BEIR: A Heterogenous Benchmark for Zero-shot Evaluation of Information Retrieval Models

2021-04-17Code Available2· sign in to hype

Nandan Thakur, Nils Reimers, Andreas Rücklé, Abhishek Srivastava, Iryna Gurevych

Code Available — Be the first to reproduce this paper.

Code

github.com/UKPLab/beir
OfficialIn papertf★ 2,117
github.com/osu-nlp-group/hipporag
none★ 3,302
github.com/beir-cellar/beir
pytorch★ 2,117

Abstract

Existing neural information retrieval (IR) models have often been studied in homogeneous and narrow settings, which has considerably limited insights into their out-of-distribution (OOD) generalization capabilities. To address this, and to facilitate researchers to broadly evaluate the effectiveness of their models, we introduce Benchmarking-IR (BEIR), a robust and heterogeneous evaluation benchmark for information retrieval. We leverage a careful selection of 18 publicly available datasets from diverse text retrieval tasks and domains and evaluate 10 state-of-the-art retrieval systems including lexical, sparse, dense, late-interaction and re-ranking architectures on the BEIR benchmark. Our results show BM25 is a robust baseline and re-ranking and late-interaction-based models on average achieve the best zero-shot performances, however, at high computational costs. In contrast, dense and sparse-retrieval models are computationally more efficient but often underperform other approaches, highlighting the considerable room for improvement in their generalization capabilities. We hope this framework allows us to better evaluate and understand existing retrieval systems, and contributes to accelerating progress towards better robust and generalizable systems in the future. BEIR is publicly available at https://github.com/UKPLab/beir.

Tasks

Argument Retrieval Benchmarking Biomedical Information Retrieval Citation Prediction Duplicate-Question Retrieval Entity Retrieval Fact Checking Information Retrieval News Retrieval Passage Retrieval Question Answering Re-Ranking Retrieval Text Retrieval Tweet Retrieval

Benchmark Results

Dataset	Model	Metric	Claimed	Verified	Status
BioASQ (BEIR)	BM25+CE	nDCG@10	0.52	—	Unverified
BioASQ (BEIR)	BM25	nDCG@10	0.51	—	Unverified
NFCorpus (BEIR)	BM25+CE	nDCG@10	0.35	—	Unverified
NFCorpus (BEIR)	ColBERT	nDCG@10	0.31	—	Unverified
TREC-COVID (BEIR)	BM25+CE	nDCG@10	0.76	—	Unverified
TREC-COVID (BEIR)	ColBERT	nDCG@10	0.68	—	Unverified

BEIR: A Heterogenous Benchmark for Zero-shot Evaluation of Information Retrieval Models

Code

Abstract

Tasks

Benchmark Results

Reproductions