Dissertation and Thesis

Permanent URI for this communityhttp://164.52.219.250:4000/handle/10263/2146

Browse

Search Results

Now showing 1 - 10 of 426
  • Item
    Flexible Modeling of non-Gaussian Longitudinal Data: Some Approaches using Copula
    (Indian Statistical Institute, Kolkata, 2026-03-16) Chattopadhyay, Subhajit
    Longitudinal data are common in medical and biological sciences, where measurements are gathered from subjects over time to explore relationships with explanatory variables (covariates) and to uncover the underlying mechanisms of dependence among these measurements. The responses observed at each instance can be either discrete or continuous. One of the primary challenges in longitudinal data analysis lies in the non-Gaussian nature of the response variables. As a result, there are relatively few multivariate models in the literature that effectively address the specific characteristics observed in such datasets. In this dissertation, we address four problems concerning longitudinal data analysis by developing new statistical models. These models specifically address the time-related relationships found in various types of non-Gaussian longitudinal data by employing suitable classes of parametric copulas. In the third chapter of this dissertation, we examine a motivating dataset from a recent HIV-AIDS study conducted in Livingstone district, Zambia. The histogram plots of the repeated measurements at each time point reveal asymmetry in the marginal distributions, and pairwise scatter plots uncover nonelliptical dependence patterns. Traditional linear mixed models, typically used for longitudinal data, struggle to capture these complexities effectively. We introduced skew-elliptical copula based mixed models to analyze this continuous data, where we use generalized linear mixed models (GLMM) for the marginals (e.g., Gamma mixed model), and address the temporal dependence of repeated measurements by utilizing copulas associated with skew-elliptical distributions (such as skew-normal/skew-t). The proposed class of copula-based mixed models addresses asymmetry, between-subject variability, and non-standard temporal dependence simultaneously, thereby extending beyond the limitations of standard linear mixed models based on multivariate normality. We estimate the model parameters using the IFM (inference function of margins) method, and outline the procedure for obtaining standard errors of the parameter estimates. To evaluate the performance of this approach under finite sample conditions, rigorous simulation studies are conducted, encompassing skewed and symmetric marginal distributions along with various copula selections. Finally, we apply these models to the HIV dataset and present the insight gained from the analysis. In the fourth chapter of this dissertation, we introduce factor copula models tailored for unbalanced non-Gaussian longitudinal data. Modeling the joint distribution of such data, where subjects may have varying numbers of repeated measurements and responses can be continuous or discrete, poses practical challenges, especially with numerous measurements per subject. Factor copula models, which are canonical vine copulas, leverage latent variables to elucidate the underlying dependence structure of multivariate data. This approach aids in interpretation and implementation for unbalanced longitudinal datasets, enhancing our ability to model complex dependencies effectively. We develop regression models for continuous, binary and ordinal longitudinal data, incorporating covariates, using factor copula constructions with subject-specific latent variables. With consideration for homogeneous within-subject dependence, the proposed models enable feasible parametric inference in moderate to high dimensional scenarios, employing a two-stage (IFM) estimation method. We also present a method for evaluating the residuals of factor copula models to visually assess the goodness of fit. The performance of the proposed models in finite samples is assessed through extensive simulation studies. In empirical analyses, we apply these models to analyze various longitudinal responses from two real-world datasets. Furthermore, we compare the performance of these models with widely used random effects models using standard selection techniques, revealing significant improvements. Our findings suggest that factor copula models can serve as viable alternatives to random effect models, offering deeper insights into the temporal dependence of longitudinal data across diverse contexts. In the fifth chapter of this dissertation, we address the issue of modeling complex and hidden temporal dependence of count longitudinal data. Multivariate elliptical copulas are typically preferred in statistical literature to analyze dependence between repeated measurements of longitudinal data since they allow for different choices of the correlation structure. But these copulas lack in flexibility to model dependence and inference is only feasible under parametric restrictions. In this chapter, we propose the use of finite mixtures of elliptical copulas to enhance the modeling of temporal dependence in discrete longitudinal data. This approach enables the utilization of distinct correlation matrices within each component of the mixture copula. We theoretically explore the dependence properties of finite mixtures of copulas before employing them to construct regression models for count longitudinal data. Inference for this proposed class of models is based on a composite likelihood approach, and we evaluate the finite sample performance of parameter estimates through extensive simulation studies. To validate the fitting of the proposed models, we extend traditional techniques and introduce the t-plot method to accommodate finite mixtures of elliptical copulas. Finally we apply the proposed models to analyze the temporal dependence within two real-world count longitudinal datasets and demonstrate their superiority over standard elliptical copulas. In the final contributing chapter of this dissertation, we introduce a novel multivariate copula based on the multivariate geometric skew-normal (GSN) distribution. This asymmetric copula serves as an alternative to the skew-normal copula proposed by Azzalini. Unlike the standard skew-normal copula, the multivariate GSN copula retains closure properties under marginalization, which offers computational advantages for modeling multivariate discrete data. In this chapter, we outline the construction of the geometric skew-normal copula and its application in modeling the temporal dependence observed in non-Gaussian longitudinal data. We begin by exploring the theoretical properties of the proposed multivariate copula. Subsequently, we develop regression models tailored for both continuous and discrete longitudinal data using this innovative framework. Notably, the quantile function of this copula remains independent of the correlation matrix of its respective multivariate distribution, offering computational advantages in likelihood inference compared to copulas derived from skew-elliptical distributions proposed by Azzalini. Furthermore, composite likelihood inference becomes feasible for this multivariate copula, allowing for parameter estimation from ordered probit models with the same dependence structure as the geometric skew-normal distribution. We conduct extensive simulation studies to validate the geometric skew-normal copula based models and apply them to analyze the longitudinal dependence of two real-world data sets. Finally, We present our findings in terms of the improvements over regression models based on multivariate Gaussian copulas.
  • Item
    Implementing a Health Recommendation System from Wearable Data
    (Indian Statistical Institute, Kolkata, 2025-07-22) Pramanick Priti
    The rising interest in personalized health monitoring has created a demand for intelligent systems that not only evaluate an individual’s health status but also offer actionable recommendations. This dissertation presents a data-driven approach to assess overall health by calculating a weekly health score using multi-dimensional data sources such as sleep patterns, nutrition, cardiovascular activity, fitness levels, and metabolic parameters. The system integrates and processes data stored in MongoDB using Python, applies scoring logic tailored to each health domain, and aggregates them into a unified health score. Addi- tionally, the system generates a detailed summary and leverages a language model to extract personalized recommendations aimed at improving user well-being. A comprehensive PDF health report is produced, featuring score visualizations and advice tailored to the individual. The implementation was tested across multiple profiles, and evaluation metrics indicate that the approach is both adaptive and insightful. This work not only demonstrates a scalable pipeline for health analysis but also opens up opportunities for future integration of machine learning and deeper behavioral insights.
  • Item
    Dynamic Sparsification in Secure Gradient Aggregation for Federated Learning
    (Indian Statistical Institute, Kolkata, 2025-07-23) Samanta, Bikash
    Secure aggregation is a critical component of privacy-preserving federated learning. However, existing fixed-sparsity approaches often incur unnecessary communication overhead. We present DynamicSecAgg, a novel framework that introduces dynamic sparsity while preserving coordinate-level privacy. Our method achieves significant improvements in communication efficiency while maintaining — and in some cases improving — model accuracy across both IID and non-IID user distributions. The framework maintains information-theoretic privacy guarantees via adaptive gradient thresholding and polynomial-based aggregation, proving particularly effective under heterogeneous data settings. These results establish dynamic sparsity as a key optimization for efficient and privacy-preserving federated learning.
  • Item
    Efficient SIMD based Implementation of Xoodyak
    (Indian Statistical Institute, Kolkata, 2025-07-11) Biswas, Soham
    Modern computing devices—particularly in the domains of the Internet of Things (IoT), mobile computing, and embedded systems—often operate under severe resource constraints in terms of processing power, memory (RAM/ROM), bandwidth, and battery life. Devices such as IoT sensors, smart cards, medical implants, RFID tags, and wearable systems typically rely on low-power hardware, including 8-bit microcontrollers with only a few kilobytes of memory. Conventional cryptographic algorithms are frequently unsuitable for such environments, as they may consume excessive power, introduce unacceptable latency, or fail to execute altogether. Lightweight cryptography addresses these challenges by providing cryptographic primitives specifically designed to operate efficiently on constrained hardware. With the rapid growth of IoT, billions of low-power devices are being deployed annually, all of which require fundamental security services such as encryption for data privacy, authentication for identity verification, and integrity protection to detect tampering. In response, international standardization bodies such as NIST and ISO have initiated efforts to define lightweight cryptographic standards. Notably, NIST’s Lightweight Cryptography Project aims to standardize algorithms that offer an effective balance between security and performance in resource-limited environments. Xoodyak is a modern lightweight cryptographic scheme developed for constrained platforms including IoT devices, embedded systems, and other resource-limited applications. It supports authenticated encryption, hashing, and pseudo-random number generation within a compact and efficient design, making it well suited for environments with strict limitations on memory, power, and computational capacity. Xoodyak was designed by Guido Bertoni, Joan Daemen, Michael Peeters, and Gilles Van Assche, who are also among the creators of Keccak (SHA-3). The scheme is built around the Xoodoo permutation, from which it derives its name, and was submitted to NIST’s Lightweight Cryptography Project, where it was recognized for its strong security properties and efficient performance across diverse platforms. Although Xoodyak is highly efficient on 8-bit, 16-bit, and 32-bit microcontrollers due to its compact code size and reliance on a single permutation for multiple cryptographic services, its design also enables a high degree of parallelism. This characteristic makes it suitable for deployment on powerful server-class processors that manage large numbers of constrained devices. In this work, we explore SIMD-based implementations of Xoodyak on modern Intel processors supporting AVX2 and AVX-512 instruction sets. While the eXtended Keccak Code Package (XKCP) provides up to 16-way parallelization, we investigate alternative SIMD parallelization paradigms capable of executing up to 512 parallel instances simultaneously.
  • Item
    Zero Knowledge Proofs in Hybrid Environments
    (Indian Statistical Institute, Kolkata, 2025-07-11) Hajra, Rittwik
    The impending advent of quantum computing poses a significant threat to classical cryptographic primitives, necessitating a robust migration toward post-quantum cryptographic (PQC) systems. However, a complete transition remains impractical in the short term, giving rise to hybrid environments where classical and PQC schemes coexist. This thesis addresses a fundamental challenge in such settings: the need for efficient and secure zero-knowledge proofs (ZKPs) that establish plaintext consistency across cryptographic primitives defined over distinct algebraic domains. We present novel zero-knowledge protocols that bridge lattice-based schemes, specifically NTRU, with classical constructions like Pedersen vector commitments and ElGamal encryption. Our primary contributions include (1) a !-protocol for proving plaintext equality between an NTRU ciphertext and a Pedersen commitment, and (2) a ZKP of plaintext equality between NTRU and ElGamal ciphertexts. Both constructions ensure perfect honest-verifier zero-knowledge and computational soundness, while preserving efficiency and composability. A central innovation of our work lies in constructing a common linear language across domains— leveraging homomorphic properties and inner product arguments—allowing the prover to demonstrate equivalence of messages without revealing their content. Our protocols integrate rejection sampling techniques to preserve privacy in the lattice setting and achieve 2n-special soundness. We further extend our constructions to support batch proofs, enabling scalable and bandwidthefficient verification of multiple plaintext equalities. These protocols are, to the best of our knowledge, the first concrete and fully specified ZKPs achieving plaintext equality across NTRU and widely used classical primitives. Our work lays foundational tools for secure interoperability in hybrid systems and facilitates verifiable migration paths toward post-quantum secure infrastructures.
  • Item
    The Monodromy Leak for a Generalized Montgomery Ladder
    (Indian Statistical Institute, Kolkata, 2025-07-11) Raychaudhuri, Arani
    The Diffie-Hellman key exchange protocol using elliptic curves is the most wide-spread approach to the establishment of a secure internet connection. As an important subroutine, Alice and Bob need to perform multiplications of elliptic curve points by large scalars. The textbook method for scalar multiplication is the double-and-add algorithm. For the sake of efficiency, one usually performs x-coordinate only arithmetic using projective coordinates, and doubling-and-adding is done using the Montgomery ladder. The advantage of using projective coordinates is that this avoids costly field inversions at each iteration. However, when Alice (say) uses the double-and-add algorithm for computing her public key Q = [a]P, it is a bad idea for her to publish the resulting projective coordinates of Q. Indeed, it was shown in 2003 by Naccache, Smart and Stern that these coordinates leak a few bits of the secret scalar a. Therefore, Alice must perform a final division deprojectivizing the coordinates of Q, and this division must be done in constant time so that side-channel analysis does not allow for a reconstruction of these projective coordinates. In 2019 Aldaya, Garcia and Brumley discovered that many real-life implementations violate this requirement. New work by Robert from 2024 shows that the leak is much more devastating than assumed by Naccache et al.: one can easily recover the entire secret. Thus, bad implementations of elliptic curve scalar multiplication using the Montgomery ladder are a recipe for disaster. The goal of this thesis is to study the new method by Robert, which he calls “the monodromy leak”. It stems from the deep fact that the set of all possible projective coordinates for points on an elliptic curve E (called “cubical points”) still comes equipped with a natural scalar- multiplication map, despite this set not being a group. Robert shows that the cubical discrete logarithm problem reduces to a discrete logarithm problem in the underlying finite field, which is known to be easier (index-calculus). He then also shows that the Montgomery ladder essentially implements cubical scalar multiplication: whence the devastating conclusion. Besides understanding how the attack works, the goal is also to study the relation between cubical arithmetic and other projective double-and-add algorithms (such as the standard double-and-add algorithm for Weierstrass curves, or Edwards curves). Our current conclusion is that the Monodromy Leak is specific to the Montgomery ladder, but not to Montxi gomery curves : we generalize the attack to Partially-Long Weierstrass curves (PLWC). For the standard double-and-add algorithm on Edwards curves (as used in EdDSA), we report on some first explorations. There are also other applications of cubical arithmetic, namely to the efficient computation of pairings, and to the efficient computation of isogenies. Isogeny-based cryptography is another booming branch in cryptography, which is supposed to remain hard even in the presence of quantum adversaries (unlike “classical” elliptic curve cryptography, which is based on the discrete logarithm problem and therefore broken by Shor’s algorithm). However, these applications are not touched upon in this thesis.
  • Item
    Projective corepresentations and cohomology of compact quantum groups
    (Indian Statistical Institute, Kolkata, 2026-01-22) Maity, Kiran
    In this thesis, we briefly review various types of projective corepresentations of compact quantum groups and prove the existence of suitable envelopes for them. We also study the associated invariant (dual) 2-cohomology and calculate it in a few concrete examples.
  • Item
    Generalization under Sub-population Shift: Equitable Models for Imbalanced, Long-tailed, and Fair Representation Learning
    (Indian Statistical Institute, Kolkata, 2026-02-09) Ansari, Faizanuddin
    Machine learning systems often experience performance degradation in real-world scenarios due to subpopulation shift defined as mismatches in the distribution of classes or attributes within datasets. This thesis investigates generalization failures arising from class imbalance, long-tailed distributions, and attribute-level biases (specifically, attribute-level biases that originate from demographic imbalances in sensitive domains, such as medical imaging). It proposes principled strategies to mitigate these effects in both classical and deep learning frameworks. Class imbalance and long-tailed distributions pose significant challenges, especially in real-world applications where minority classes are underrepresented yet critically important. To address these challenges, this work develops novel algorithms and frameworks that enhance model generalization on imbalanced and long-tailed datasets. The contributions encompass data-level, model-level, and loss-level innovations, each designed to mitigate bias and improve performance in minority classes while maintaining accuracy in majority classes. First, we propose a data-level solution for classical class imbalance in tabular data through a novel oversampling technique that estimates minority class statistics using neighborhood-based distributional calibration. Unlike existing methods that rely on synthetic interpolation without accounting for class-specific geometry, the proposed approach preserves the fidelity of minority class distributions, leading to significant gains in both binary and multi-label imbalanced settings. Next, we introduce STTP-Net, a two-pronged framework for long-tailed learning in vision tasks. It integrates hybrid augmentation and sampling strategies with a newly proposed Effective Balanced Softmax (EBS) loss to correct label distribution shifts, enabling robust feature learning and improved accuracy across head, medium, and tail classes. Extensive evaluations on benchmark datasets such as CIFAR-LT, ImageNet-LT, and NIH-CXR-LT confirm its superiority over state-of-the-art methods. We address decision boundary distortion under class imbalance by introducing the Goldilocks principle to achieve ``just-right'' boundary fidelity. Our approach leverages this concept to design a training pipeline that produces smoother, more adaptive decision boundaries for tail classes. Specifically, we propose a Dual-Branch Sampler-Guided Mixup (DBSGM) strategy combined with an Adaptive Class-Aware Feature Regularization (ACFR) mechanism. These components jointly enhance intra-class compactness and inter-class separability, improving generalization, especially under extreme imbalance. By dynamically adjusting boundaries and applying adaptive regularization, our method achieves optimal fidelity for minority classes without compromising the performance of majority classes. Extensive experiments validate its effectiveness across a range of imbalance ratios. Furthermore, we extend these ideas to medical imaging, addressing both class imbalance and demographic fairness. This includes the Mixture of Two Experts (Mo2E) framework and fairness-aware lesion classification strategies that ensure equitable performance across subgroups. Mo2E combines asymmetric sampling with adaptive mixup to improve the detection of rare disease classes and is validated across tasks such as Gastrointestinal (GI) Tract Classification of Endoscopic Images and Diabetic Retinopathy (DR) grading. Additionally, we introduce a bias-aware training method to mitigate both \emph{class imbalance and skin tone bias}, achieving fair performance across demographic subgroups, as demonstrated on the ASAN and ISIC-2018 datasets. These results lay the groundwork for demographically fair model design in high-stakes medical applications. Collectively, these contributions advance the field of imbalanced learning by offering scalable, practical solutions grounded in theoretical insight and empirical validation. This thesis provides a comprehensive toolkit for researchers and practitioners confronting the challenges of subpopulation shift, integrating principled data synthesis, loss rebalancing, and fairness constraints. It pushes the frontiers of robust, fair, and generalizable deep learning, particularly in domains where class rarity and demographic underrepresentation have tangible real-world consequences.
  • Item
    On Robust Estimation of Multivariate Location and Scale with Applications
    (Indian Statistical Institute, Kolkata, 2026-02-04) Chakraborty, Soumya
    The principal objective of this thesis is, in a nutshell, to provide robust estimators of multivariate location and scale which have reasonable to high model efficiency but avoid high computational complexity so as to be practically useful in real problems. We utilize the minimum density power divergence (DPD) and the related philosophy to invoke robustness. There are some computational issues while minimizing the DPD in different multivariate set-ups. We will work on this problem rigorously and come up with three types of estimation procedures which are explicitly or implicitly related to the minimum DPD methodology, keeping the computational issue in mind each time. In particular, we develop a robust clustering algorithm based on mixture normal models in the first work where the component mean vectors and covariance matrices are estimated by minimizing the DPD with a suitable iteratively reweighted least squares (IRLS) algorithm. The second work proposes a sequential approach to minimize the DPD for location-scale estimation in case of elliptically symmetric probability models. The third work studies the one-step minimization of the DPD with various highly robust initializations and iterative procedures. We derive the theoretical properties (asymptotic and robustness features) of these methods, empirically validate them with extensive simulation studies in various set-ups and apply them in different problems in the domains of pattern recognition and machine learning.
  • Item
    Statistical Guarantees of Deep Generative Models Involving Diverse Spaces: Generation Consistency and Robustness
    (Indian Statistical Institute, Kolkata, 2026-02-04) Chakrabarty, Anish
    Generative modeling focuses on the task of producing new data samples that closely resemble those drawn from an original, unknown distribution. Despite being well-known in statistical estimation theory, the approach has gained substantial traction in recent years, driven by groundbreaking results in areas such as image synthesis, natural language generation, and network modeling. The complexity of modern-era data domains and the ensuing adaptations that suitable models must undergo have presented new challenges. These advances raise several fundamental questions, the first of which is: When do generative models accurately approximate the true data distribution? One may also ask: How well do these models perform under contaminated data? This work explores these questions through the lens of generative modeling frameworks that, by design, involve distinct data spaces. We focus on two major classes of such models that blend optimal transport and representation learning in their objectives: Wasserstein autoencoders (WAE) and Cycle-consistent cross-domain translators. WAE, on its way to regeneration, learns a latent code, which in turn aids the simulation of newer pseudo-random replicates. By providing statistical characterizations of the latent distribution and the transforms inducing a dimensionality reduction in the process, we present a detailed error analysis underlying WAEs. From a non-parametric density estimation perspective, we establish deterministic bounds on the latent and reconstruction errors that adapt to the intrinsic dimensions of input data. We also study the extent of distortion that WAE-generated samples suffer when learned using contaminated data. Key takeaways for practitioners from our analysis include specific architectural suggestions that foster near-perfect sampling. The framework developed thus far fittingly extends to unpaired cycle-consistent cross-domain models. We show that the sufficient conditions for successful data translation under Sobolev and H¨older-smooth distributions resemble those in the case of WAEs. Our analysis also suggests error upper bounds due to ill-posed transformations and validates the choice of divergences used in objectives. Finally, in search of a consolidated solution to the robustification problem, we present parallel formulations based on the Gromov-Wasserstein (GW) distance. Due to the equivalence of Gromov-Monge samplers (GW), following GW, and cross-domain translation models, including WAE and GWAE, this answers the second question. We study the robust recovery guarantees, concentration, and tractable computational properties of the newly introduced distance measures under diverse contamination scenarios. We substantiate all our findings based on real-world data in varying generative tasks.