Binary Document Filtering for Retrieval-Augmented Generation

dc.contributor.authorSaha, Sreyan
dc.date.accessioned2025-07-21T10:51:08Z
dc.date.available2025-07-21T10:51:08Z
dc.date.issued2025-06
dc.descriptionDissertation under the supervision of Dr. Debapriyo Majumdar and Dr. Rajkiran Panugantien_US
dc.description.abstractRetrieval-Augmented Generation (RAG) has become a popular technique to enhance Large Language Models (LLMs) with access to external information sources. However, the success of RAG systems critically depends on the relevance and quality of the retrieved documents. In particular, supplying irrelevant or noisy context can lead to degraded downstream generation quality. To address this, our project focuses on improving the document filtering stage in a RAG pipeline through binary relevance classification — deciding whether a retrieved document is suitable to include in the final context window based on its usefulness in directly answering the user query. We explore a wide range of approaches to this task, including rule-based retrieval methods (TF-IDF, BM25), classical machine learning classifiers (logistic regression, SVM), deep neural networks, and LLM-based methods, both in zero-shot and few-shot settings. Our final pipeline leverages instruction-tuned LLMs to act as strict binary classifiers, with a focus on maximizing precision over recall, thereby ensuring that only the most relevant and high-quality documents are passed to the generation module. Experiments are conducted on a Reddit-based query-document dataset tailored to subjective and opinion-heavy queries. Our evaluations suggest that LLMs, even without fine-tuning, can outperform traditional methods in this setting, o”ering a strong foundation for further enhancement through supervised fine-tuningen_US
dc.identifier.citation24p.en_US
dc.identifier.urihttp://hdl.handle.net/10263/7587
dc.language.isoenen_US
dc.publisherIndian Statistical Institute, Kolkataen_US
dc.relation.ispartofseriesMTech(CS) Dissertation;23-25
dc.subjectRetrieval-Augmented Generationen_US
dc.subjectBinary Relevance Classificationen_US
dc.subjectDocument Filteringen_US
dc.subjectLarge Language Modelsen_US
dc.subjectPrecision-Oriented Retrievalen_US
dc.subjectReddit Dataseten_US
dc.subjectZero-Shot Inferenceen_US
dc.titleBinary Document Filtering for Retrieval-Augmented Generationen_US
dc.typeOtheren_US

Files

Original bundle

Now showing 1 - 2 of 2
No Thumbnail Available
Name:
dissertation_report.pdf
Size:
337.94 KB
Format:
Adobe Portable Document Format
Description:
Dissertations - M Tech (CS)
No Thumbnail Available
Name:
plagiarism_check_report.pdf
Size:
294.44 KB
Format:
Adobe Portable Document Format
Description:
Plagiarism_report

License bundle

Now showing 1 - 1 of 1
No Thumbnail Available
Name:
license.txt
Size:
1.71 KB
Format:
Item-specific license agreed upon to submission
Description: