Binary Document Filtering for Retrieval-Augmented Generation
No Thumbnail Available
Date
2025-06
Authors
Journal Title
Journal ISSN
Volume Title
Publisher
Indian Statistical Institute, Kolkata
Abstract
Retrieval-Augmented Generation (RAG) has become a popular technique to enhance
Large Language Models (LLMs) with access to external information sources. However,
the success of RAG systems critically depends on the relevance and quality of the retrieved
documents. In particular, supplying irrelevant or noisy context can lead to degraded
downstream generation quality. To address this, our project focuses on improving
the document filtering stage in a RAG pipeline through binary relevance classification
— deciding whether a retrieved document is suitable to include in the final
context window based on its usefulness in directly answering the user query. We explore
a wide range of approaches to this task, including rule-based retrieval methods
(TF-IDF, BM25), classical machine learning classifiers (logistic regression, SVM), deep
neural networks, and LLM-based methods, both in zero-shot and few-shot settings. Our
final pipeline leverages instruction-tuned LLMs to act as strict binary classifiers, with
a focus on maximizing precision over recall, thereby ensuring that only the most relevant
and high-quality documents are passed to the generation module. Experiments
are conducted on a Reddit-based query-document dataset tailored to subjective and
opinion-heavy queries. Our evaluations suggest that LLMs, even without fine-tuning,
can outperform traditional methods in this setting, o”ering a strong foundation for further
enhancement through supervised fine-tuning
Description
Dissertation under the supervision of Dr. Debapriyo Majumdar and Dr. Rajkiran Panuganti
Keywords
Retrieval-Augmented Generation, Binary Relevance Classification, Document Filtering, Large Language Models, Precision-Oriented Retrieval, Reddit Dataset, Zero-Shot Inference
Citation
24p.
