SGPT: GPT Sentence Embeddings for Semantic Search
Niklas Muennighoff
Code Available — Be the first to reproduce this paper.
ReproduceCode
- github.com/muennighoff/sgptOfficialIn paperjax★ 873
Abstract
Decoder transformers have continued increasing in scale reaching hundreds of billions of parameters. Due to their scale the same decoder sets state-of-the-art results on various language tasks via prompting or fine-tuning. Yet, these large foundation models remain unusable for the related fields of semantic search and sentence embeddings. This prevents possibly new state-of-the-art results and forces organizations to train and maintain separate models. To this end, we propose SGPT to use decoders for sentence embeddings and semantic search via prompting or fine-tuning. At 5.8 billion parameters SGPT improves on the previously best sentence embeddings by a margin of 7% and outperforms a concurrent method with 175 billion parameters as measured on the BEIR search benchmark. Code, models and result files are freely available at https://github.com/Muennighoff/sgpt.
Tasks
Benchmark Results
| Dataset | Model | Metric | Claimed | Verified | Status |
|---|---|---|---|---|---|
| BioASQ (BEIR) | SGPT-BE-5.8B | nDCG@10 | 0.41 | — | Unverified |
| BioASQ (BEIR) | SGPT-CE-6.1B | nDCG@10 | 0.55 | — | Unverified |
| BioASQ (BEIR) | SGPT-CE-2.7B | nDCG@10 | 0.55 | — | Unverified |
| NFCorpus (BEIR) | SGPT-CE-2.7B | nDCG@10 | 0.33 | — | Unverified |
| NFCorpus (BEIR) | SGPT-BE-5.8B | nDCG@10 | 0.36 | — | Unverified |
| NFCorpus (BEIR) | OpenAI Search-Davinci | nDCG@10 | 0.36 | — | Unverified |
| NFCorpus (BEIR) | SGPT-CE-6.1B | nDCG@10 | 0.35 | — | Unverified |
| TREC-COVID (BEIR) | SGPT-CE-6.1B | nDCG@10 | 0.79 | — | Unverified |
| TREC-COVID (BEIR) | SGPT-CE-2.7B | nDCG@10 | 0.76 | — | Unverified |
| TREC-COVID (BEIR) | SGPT-BE-5.8B | nDCG@10 | 0.87 | — | Unverified |