From Vigilance to Veracity: Hallucination Detection, Mitigation, and Safety Enhancement in Large Language Models

dc.contributor.authorDasgupta, Sharanya
dc.date.accessioned2025-07-15T09:47:03Z
dc.date.available2025-07-15T09:47:03Z
dc.date.issued2025-06
dc.descriptionDissertation under the supervision of Dr. Swagatam Dasen_US
dc.description.abstractHuman cognition, driven by complex neurochemical processes, oscillates between imagination and reality and learns to self-correct whenever such subtle drifts lead to hallucinations or unsafe associations. In recent years, large language models (LLMs) have garnered widespread attention due to their adeptness at generating innovative responses to the given prompts across a multitude of domains, yet they exhibit a critical limitation: the propensity to produce factually incorrect and potentially harmful content while preserving syntactic coherence and logical structure. In this work, we hypothesize that these deficiencies in LLMs originate from their internal representational dynamics. Our observations indicate that, during passage generation, LLMs subtly deviate from factual accuracy in a manner analogous to human cognition, maintaining logical coherence while embedding misinformation in minor segments. To address this challenge, we introduce HalluShift, a hallucination detection framework that analyzes distribution shifts within LLMs’ internal state spaces and token probability distributions. Effective mitigation, however, necessitates addressing both factual inaccuracies and content that violates societal standards. We argue that these seemingly disparate issues stem from a “concept misalignment” within the internal space of LLM. Rather than treating these as distinct alignment challenges, we propose that selective intervention through an external regulatory network can simultaneously correct both falsehoods and unsafe outputs without fine-tuning the underlying model parameters. Reflecting this hypothesis, we present ARREST (Adversarial Resilient Regulation Enhancing Safety and Truth), a unified framework designed to identify and rectify misaligned features through context-sensitive soft refusals alongside factual corrections. Empirical evaluation across multiple benchmark datasets demonstrates the superior performance of HalluShift relative to existing detection baselines. Moreover, ARREST not only effectively regulates misalignment but also exhibits enhanced versatility compared to RLHF-aligned models, particularly in generating contextually nuanced soft refusals through adversarial training.en_US
dc.identifier.citation59p.en_US
dc.identifier.urihttp://hdl.handle.net/10263/7564
dc.language.isoenen_US
dc.publisherIndian Statistical Institute, Kolkataen_US
dc.relation.ispartofseriesMTech(CS) Dissertation;23-30
dc.subjectLarge language modelsen_US
dc.subjectHallucinationen_US
dc.subjectMitigationen_US
dc.subjectAlignmenten_US
dc.subjectDistribution shiften_US
dc.subjectToken probabilityen_US
dc.subjectSafetyen_US
dc.titleFrom Vigilance to Veracity: Hallucination Detection, Mitigation, and Safety Enhancement in Large Language Modelsen_US
dc.typeOtheren_US

Files

Original bundle

Now showing 1 - 1 of 1
No Thumbnail Available
Name:
M.tech _Sharanya_Dasgupta_CS2320.pdf
Size:
12.79 MB
Format:
Adobe Portable Document Format
Description:
Dissertations - M Tech (CS)

License bundle

Now showing 1 - 1 of 1
No Thumbnail Available
Name:
license.txt
Size:
1.71 KB
Format:
Item-specific license agreed upon to submission
Description: