Multi-Source Knowledge Pruning for Retrieval-Augmented Generation: A Benchmark and Empirical Study

2024-09-03Code Available0· sign in to hype

Shuo Yu, Mingyue Cheng, Jiqian Yang, Jie Ouyang, Yucong Luo, Chenyi Lei, Qi Liu, Enhong Chen

Code Available — Be the first to reproduce this paper.

Code

github.com/ustcagi/rag-x
OfficialIn papernone★ 7
github.com/ustcagi/pruningrag
OfficialIn papernone★ 7

Abstract

Retrieval-augmented generation (RAG) is increasingly recognized as an effective approach for mitigating the hallucination of large language models (LLMs) through the integration of external knowledge. While numerous efforts, most studies focus on a single type of externeal knowledge source. However, in real-world applications, most situations involve diverse knowledge from various sources, yet this area has been less explored. The main dilemma is the lack of a suitable dataset containing multiple knowledge sources and pre-exploration of the associated issues. To address these challenges, we standardize a benchmark dataset that combines structured and unstructured knowledge across diverse and complementary domains. Based on this dataset, we further develop a plug-and-play RAG framework, PruningRAG, whose main characteristic is to employ multi-granularity pruning strategies for optimizing the integration of relevant information and minimizing misleading context. Building upon the standardized dataset and PruningRAG, we also report a series of experimental results, as well as insightful findings. Our dataset and code are publicly availablehttps://github.com/USTCAGI/PruningRAG, with the aim of advancing future research in the RAG community.

Tasks

Benchmarking Hallucination RAG Retrieval Retrieval-augmented Generation

Multi-Source Knowledge Pruning for Retrieval-Augmented Generation: A Benchmark and Empirical Study

Code

Abstract

Tasks

Reproductions