From Vigilance to Veracity: Hallucination Detection, Mitigation, and Safety Enhancement in Large Language Models

Dasgupta, Sharanya

From Vigilance to Veracity: Hallucination Detection, Mitigation, and Safety Enhancement in Large Language Models

dc.contributor.author	Dasgupta, Sharanya
dc.date.accessioned	2025-07-15T09:47:03Z
dc.date.available	2025-07-15T09:47:03Z
dc.date.issued	2025-06
dc.description	Dissertation under the supervision of Dr. Swagatam Das	en_US
dc.description.abstract	Human cognition, driven by complex neurochemical processes, oscillates between imagination and reality and learns to self-correct whenever such subtle drifts lead to hallucinations or unsafe associations. In recent years, large language models (LLMs) have garnered widespread attention due to their adeptness at generating innovative responses to the given prompts across a multitude of domains, yet they exhibit a critical limitation: the propensity to produce factually incorrect and potentially harmful content while preserving syntactic coherence and logical structure. In this work, we hypothesize that these deficiencies in LLMs originate from their internal representational dynamics. Our observations indicate that, during passage generation, LLMs subtly deviate from factual accuracy in a manner analogous to human cognition, maintaining logical coherence while embedding misinformation in minor segments. To address this challenge, we introduce HalluShift, a hallucination detection framework that analyzes distribution shifts within LLMs’ internal state spaces and token probability distributions. Effective mitigation, however, necessitates addressing both factual inaccuracies and content that violates societal standards. We argue that these seemingly disparate issues stem from a “concept misalignment” within the internal space of LLM. Rather than treating these as distinct alignment challenges, we propose that selective intervention through an external regulatory network can simultaneously correct both falsehoods and unsafe outputs without fine-tuning the underlying model parameters. Reflecting this hypothesis, we present ARREST (Adversarial Resilient Regulation Enhancing Safety and Truth), a unified framework designed to identify and rectify misaligned features through context-sensitive soft refusals alongside factual corrections. Empirical evaluation across multiple benchmark datasets demonstrates the superior performance of HalluShift relative to existing detection baselines. Moreover, ARREST not only effectively regulates misalignment but also exhibits enhanced versatility compared to RLHF-aligned models, particularly in generating contextually nuanced soft refusals through adversarial training.	en_US
dc.identifier.citation	59p.	en_US
dc.identifier.uri	http://hdl.handle.net/10263/7564
dc.language.iso	en	en_US
dc.publisher	Indian Statistical Institute, Kolkata	en_US
dc.relation.ispartofseries	MTech(CS) Dissertation;23-30
dc.subject	Large language models	en_US
dc.subject	Hallucination	en_US
dc.subject	Mitigation	en_US
dc.subject	Alignment	en_US
dc.subject	Distribution shift	en_US
dc.subject	Token probability	en_US
dc.subject	Safety	en_US
dc.title	From Vigilance to Veracity: Hallucination Detection, Mitigation, and Safety Enhancement in Large Language Models	en_US
dc.type	Other	en_US

Files

Original bundle

Now showing 1 - 1 of 1

Name:: M.tech _Sharanya_Dasgupta_CS2320.pdf
Size:: 12.79 MB
Format:: Adobe Portable Document Format
Description:: Dissertations - M Tech (CS)

Download

License bundle

Now showing 1 - 1 of 1

Name:: license.txt
Size:: 1.71 KB
Format:: Item-specific license agreed upon to submission
Description:

Download

Collections

Dissertations - M Tech (CS)