Browsing by Author "Dasgupta, Sharanya"

Now showing 1 - 1 of 1

From Vigilance to Veracity: Hallucination Detection, Mitigation, and Safety Enhancement in Large Language Models
(Indian Statistical Institute, Kolkata, 2025-06) Dasgupta, Sharanya
Human cognition, driven by complex neurochemical processes, oscillates between imagination and reality and learns to self-correct whenever such subtle drifts lead to hallucinations or unsafe associations. In recent years, large language models (LLMs) have garnered widespread attention due to their adeptness at generating innovative responses to the given prompts across a multitude of domains, yet they exhibit a critical limitation: the propensity to produce factually incorrect and potentially harmful content while preserving syntactic coherence and logical structure. In this work, we hypothesize that these deficiencies in LLMs originate from their internal representational dynamics. Our observations indicate that, during passage generation, LLMs subtly deviate from factual accuracy in a manner analogous to human cognition, maintaining logical coherence while embedding misinformation in minor segments. To address this challenge, we introduce HalluShift, a hallucination detection framework that analyzes distribution shifts within LLMs’ internal state spaces and token probability distributions. Effective mitigation, however, necessitates addressing both factual inaccuracies and content that violates societal standards. We argue that these seemingly disparate issues stem from a “concept misalignment” within the internal space of LLM. Rather than treating these as distinct alignment challenges, we propose that selective intervention through an external regulatory network can simultaneously correct both falsehoods and unsafe outputs without fine-tuning the underlying model parameters. Reflecting this hypothesis, we present ARREST (Adversarial Resilient Regulation Enhancing Safety and Truth), a unified framework designed to identify and rectify misaligned features through context-sensitive soft refusals alongside factual corrections. Empirical evaluation across multiple benchmark datasets demonstrates the superior performance of HalluShift relative to existing detection baselines. Moreover, ARREST not only effectively regulates misalignment but also exhibits enhanced versatility compared to RLHF-aligned models, particularly in generating contextually nuanced soft refusals through adversarial training.