SOTAVerified

BM25S: Orders of magnitude faster lexical search via eager sparse scoring

2024-07-04Code Available5· sign in to hype

Xing Han Lù

Code Available — Be the first to reproduce this paper.

Reproduce

Code

Abstract

We introduce BM25S, an efficient Python-based implementation of BM25 that only depends on Numpy and Scipy. BM25S achieves up to a 500x speedup compared to the most popular Python-based framework by eagerly computing BM25 scores during indexing and storing them into sparse matrices. It also achieves considerable speedups compared to highly optimized Java-based implementations, which are used by popular commercial products. Finally, BM25S reproduces the exact implementation of five BM25 variants based on Kamphuis et al. (2020) by extending eager scoring to non-sparse variants using a novel score shifting method. The code can be found at https://github.com/xhluca/bm25s

Tasks

Benchmark Results

DatasetModelMetricClaimedVerifiedStatus
HotpotQABM25SQueries per second20.88Unverified
HotpotQARank-BM25Queries per second0.04Unverified
HotpotQAElasticsearchQueries per second7.11Unverified
Natural QuestionsRank-BM25Queries per second0.1Unverified
Natural QuestionsBM25SQueries per second41.85Unverified
Natural QuestionsElasticsearchQueries per second12.16Unverified
Quora Question PairsElasticsearchQueries per second21.8Unverified
Quora Question PairsBM25SQueries per second183.53Unverified
Quora Question PairsRank-BM25Queries per second1.18Unverified
Quora Question PairsBM25-PTQueries per second6.49Unverified

Reproductions