Theses

Permanent URI for this collectionhttps://dspace.isical.ac.in/handle/10263/2744

Browse

Now showing 1 - 1 of 1

Generalization under Sub-population Shift: Equitable Models for Imbalanced, Long-tailed, and Fair Representation Learning
(Indian Statistical Institute, Kolkata, 2026-02-09) Ansari, Faizanuddin
Machine learning systems often experience performance degradation in real-world scenarios due to subpopulation shift defined as mismatches in the distribution of classes or attributes within datasets. This thesis investigates generalization failures arising from class imbalance, long-tailed distributions, and attribute-level biases (specifically, attribute-level biases that originate from demographic imbalances in sensitive domains, such as medical imaging). It proposes principled strategies to mitigate these effects in both classical and deep learning frameworks. Class imbalance and long-tailed distributions pose significant challenges, especially in real-world applications where minority classes are underrepresented yet critically important. To address these challenges, this work develops novel algorithms and frameworks that enhance model generalization on imbalanced and long-tailed datasets. The contributions encompass data-level, model-level, and loss-level innovations, each designed to mitigate bias and improve performance in minority classes while maintaining accuracy in majority classes. First, we propose a data-level solution for classical class imbalance in tabular data through a novel oversampling technique that estimates minority class statistics using neighborhood-based distributional calibration. Unlike existing methods that rely on synthetic interpolation without accounting for class-specific geometry, the proposed approach preserves the fidelity of minority class distributions, leading to significant gains in both binary and multi-label imbalanced settings. Next, we introduce STTP-Net, a two-pronged framework for long-tailed learning in vision tasks. It integrates hybrid augmentation and sampling strategies with a newly proposed Effective Balanced Softmax (EBS) loss to correct label distribution shifts, enabling robust feature learning and improved accuracy across head, medium, and tail classes. Extensive evaluations on benchmark datasets such as CIFAR-LT, ImageNet-LT, and NIH-CXR-LT confirm its superiority over state-of-the-art methods. We address decision boundary distortion under class imbalance by introducing the Goldilocks principle to achieve ``just-right'' boundary fidelity. Our approach leverages this concept to design a training pipeline that produces smoother, more adaptive decision boundaries for tail classes. Specifically, we propose a Dual-Branch Sampler-Guided Mixup (DBSGM) strategy combined with an Adaptive Class-Aware Feature Regularization (ACFR) mechanism. These components jointly enhance intra-class compactness and inter-class separability, improving generalization, especially under extreme imbalance. By dynamically adjusting boundaries and applying adaptive regularization, our method achieves optimal fidelity for minority classes without compromising the performance of majority classes. Extensive experiments validate its effectiveness across a range of imbalance ratios. Furthermore, we extend these ideas to medical imaging, addressing both class imbalance and demographic fairness. This includes the Mixture of Two Experts (Mo2E) framework and fairness-aware lesion classification strategies that ensure equitable performance across subgroups. Mo2E combines asymmetric sampling with adaptive mixup to improve the detection of rare disease classes and is validated across tasks such as Gastrointestinal (GI) Tract Classification of Endoscopic Images and Diabetic Retinopathy (DR) grading. Additionally, we introduce a bias-aware training method to mitigate both \emph{class imbalance and skin tone bias}, achieving fair performance across demographic subgroups, as demonstrated on the ASAN and ISIC-2018 datasets. These results lay the groundwork for demographically fair model design in high-stakes medical applications. Collectively, these contributions advance the field of imbalanced learning by offering scalable, practical solutions grounded in theoretical insight and empirical validation. This thesis provides a comprehensive toolkit for researchers and practitioners confronting the challenges of subpopulation shift, integrating principled data synthesis, loss rebalancing, and fairness constraints. It pushes the frontiers of robust, fair, and generalizable deep learning, particularly in domains where class rarity and demographic underrepresentation have tangible real-world consequences.