On the Effect of Negative-Pair Variance in Contrastive Learning and a VRN-Based Solution
본 논문은 Negative pair similarity의 분산이 높으면 positive alignment가 무너지는 문제를 VRN(Variance Reduction of Negative Pairs)을 통해 해결하는 것인데, 기존 주제에서는 이 핵심 이슈가 부각되지 않아서 이를 강조하도록 수정.
Contrastive learning aims to pull positive pairs together and push negative pairs apart in the embedding space. However, we find that high variance in negative-pair similarities—often caused by random mini-batching—can destabilize learning and prevent proper alignment of positive pairs. Through theoretical analysis, we show that as negative-pair similarity falls below a critical threshold, positive pairs can no longer align perfectly. To address this issue, we introduce VRN (Variance Reduction for Negative-pair similarity), a simple yet effective regularization technique that reduces variance in the negative-pair similarities. We prove its benefits analytically and validate them across standard contrastive learning benchmarks. Our results highlight the often-overlooked role of similarity variance and demonstrate how controlling it leads to more stable and better-performing contrastive models.
본 논문의 핵심은 VRN을 통해 문제를 해결하는 데 있음에도, 기존 Abstract에서는 Contrastive Learning 및 Mini-batch traning의 한계점이 더 부각되고 VRN에 대한 언급이 부족하였다. 이에 문제 제기 부분은 간결하게 압축하고, VRN의 해결 방안을 중심으로 내용을 수정.
Contrastive learning has emerged as a powerful framework for learning data representations without supervision, achieving impressive performance across various domains such as vision and speech. At its core, contrastive learning encourages positive pairs (semantically similar samples) to be embedded closely, while pushing negative pairs (semantically dissimilar samples) apart.
Most prior work focuses on analyzing the expected similarities between embeddings, assuming that these averages capture the essential training dynamics. However, in practice, contrastive learning is conducted with small mini-batches, where the variance of similarity—especially for negative pairs—can be substantial. We observe that this variance is not merely noise, but a critical factor that can significantly impact both training stability and performance.
In particular, we find that when the variance of negative-pair similarities becomes too large, the alignment of positive pairs can break down. In other words, overly dispersed negative examples can interfere with the model’s ability to properly align even the correct positive pairs.
In this work, we provide a theoretical analysis of this phenomenon and propose a simple yet effective regularization method called VRN (Variance Reduction for Negative-pair similarity). By reducing the variance of negative-pair similarities, VRN improves the stability and effectiveness of contrastive learning. We validate our approach through both theoretical insights and empirical results across standard benchmarks.
기존 Introduction에서는 Contrastive Learning(CL)과 Mini-batch에 대한 수학적 수식이 서두에 제시되어 가독성이 떨어졌다. 이에 수식은 본문 후반으로 미루고, Introduction에서는 기존 연구의 핵심 내용을 먼저 소개한 뒤, 그로부터 도출된 한계와 문제점을 설명하고, 본 논문의 해결책과 기여를 요약하는 구조로 수정.
Contrastive Loss. The InfoNCE loss [Oord et al., 2018] is a widely used contrastive objective, but it suffers from critical limitations: it creates inter-sample dependence through softmax normalization and jointly optimizes positive and negative pairs, often causing gradient conflicts. These issues can lead to unstable training and suboptimal representations.
SimCLR [Chen et al., 2020] extends InfoNCE by treating augmented views as positives and others in the batch as negatives, but retains the same coupled normalization. To address this, DCL [Yeh et al., 2022] decouples the optimization of positive and negative pairs, and DHEL [Koromilas et al., 2024] further improves learning by focusing on harder negatives from similar sample types. Meanwhile, Zhai et al. [2023] replace the softmax in InfoNCE with a sigmoid to eliminate inter-sample dependence and allow independent optimization per sample.
기존 문단은 InfoNCE, SimCLR, DCL, DHEL 등의 연구들을 단순 나열하는 방식으로 서술되어, 각 연구 간의 연관성을 파악하기 어려웠다. 이에 InfoNCE의 한계를 먼저 명확히 제시하고, SimCLR, DCL, DHEL이 해당 문제를 어떻게 보완해왔는지를 연결 구조 속에서 설명하는 방식으로 수정.
Several studies have explored how embedding pairs should be structured to minimize the contrastive loss. Lu & Steinerberger (2022) showed that minimizing softmax-based contrastive losses, such as InfoNCE, leads to embeddings that form a simplex Equiangular Tight Frame (ETF) [Papyan et al., 2020; Sustik et al., 2007; Fickus & Mixon, 2015]. A simplex ETF is a set of unit-length vectors that are equally separated in angle and maximally spread on the unit hypersphere, ensuring both uniformity and discriminative capacity—a desirable property for contrastive representation learning.
Building on this, Lee et al. (2024) demonstrated that even contrastive losses using sigmoids [Zhai et al., 2023] can produce simplex ETF structures. Moreover, Sreenivasan et al. (2023) showed that such configurations can still arise in mini-batch training when all possible batch combinations are considered.
However, these results describe ideal conditions. In practice, especially in mini-batch settings with limited negatives, high variance in negative-pair similarities can distort the embedding space, preventing such structured configurations from emerging. In this work, we study this phenomenon theoretically and propose a regularization method to mitigate it.
simplex Equiangular Tight Frame (ETF)에 대한 설명이 없어서 추가하는 방향으로 수정.
CL achieves outstanding per- formance, particularly when trained with large batch sizes (Chen et al., 2020; Radford et al., 2021; Pham et al., 2023; Tian et al., 2020b; Jia et al., 2021). However, using large batch sizes demands substantial memory resources, which poses practical challenges and necessitates the use of smaller batches. This compromise in batch size tends to degrade performance, prompting several theoretical studies to investigate the underlying causes of this degradation (Sreenivasan et al., 2023; Koromilas et al., 2024). For instance, Yuan et al. (2022) demonstrated that the optimization error in SimCLR (Chen et al., 2020) is bounded above by a function inversely proportional to the batch size, implying that smaller batch sizes result in larger optimization errors. Furthermore, Chen et al. (2022a) found that contrastive loss functions have a discrepancy between the true gradients and those computed during training, and it grows as the batch size decreases.
In addition to gradient distortion, we show that negative-pair similarity variance also increases as the batch size decreases, contributing to unstable training and poor positive alignment.
마지막에 본 논문에서는 negative pair similarity의 분산이 커지는 현상를 추가적으로 분석하므로 마지막에 해당 내용을 추가.
In contrastive learning, training data consists of paired inputs
Without loss of generality, we focus our analysis on the unimodal case, noting that the same results apply to the multimodal setting.
An encoder
If the pair originates from the same instance, it is considered a positive pair, and the embeddings are encouraged to be similar. If the pair comes from different instances, it is treated as a negative pair, and the embeddings are pushed apart.
Throughout this work, we assume that all embeddings are
가독성을 높이기 위해 설명의 흐름을 "문제 설정 → unimodal/multimodal 구분 → encoder 구조 → positive/negative pair 정의 → 정규화" 순서로 재구성하였고, 중복되는 표현은 정리하여 간결하게 수정하였다.
To formalize our analysis, we define the joint distributions of positive and negative pairs. Let
These constraints ensure consistency between the positive and negative pair distributions over the data space.
Let
$$ \hat{p}{\text{pos}} = \frac{1}{n} \sum{i=1}^{n} \hat{p}_{\text{pos}}^{(i)}. $$
Similarly, let
Following the notation of Koromilas et al. (2024), we denote the pushforward distributions under the encoder
$$ f_{#} \hat{p}x, \quad f{#} \hat{p}y, \quad f{#} \hat{p}{\text{pos}}, \quad f{#} \hat{p}_{\text{neg}}. $$
For example, $f_{#} \hat{p}{\text{neg}}$ represents the empirical distribution of negative embedding pairs, i.e., the distribution of $(f(x), f(y))$ where $(x, y) \sim \hat{p}{\text{neg}}$.
These empirical and pushforward distributions form the basis of our theoretical analysis of similarity variance in contrastive learning.
기존 문장은 수학 기호가 먼저 제시되어 이해가 어려웠기 때문에, 개념을 먼저 소개한 후 기호를 제시하는 방식으로 구조를 변경하였고, 수식에 대한 설명도 보다 명확하고 간결하게 다듬었다.
Let
Let
We aim to learn the optimal encoder
$$ f^* := arg\min_f \mathbb{E}{(U, V) \sim f{#} \hat{p}_{ ext{pos}}^{[n]}}[\mathcal{L}(U, V)] $$
수식이 길게 나열되고 설명이 곧바로 이어져 있어서 가독성이 떨어졌는데 정의(Definition)와 해설 분리해서 가독성을 높이도록 수정.
For a subset
$$ \mathcal{L}{\text{info-sym}}(U_I, V_I) := \frac{1}{2} \mathcal{L}{\text{info}}(U_I, V_I) + \frac{1}{2} \mathcal{L}_{\text{info}}(V_I, U_I) $$
The asymmetric part is:
$$ \mathcal{L}{\text{info}}(U_I, V_I) := \frac{1}{|I|} \sum{i \in I} \psi \left(c_1 \sum_{j \in I \setminus {i}} \phi((v_j - v_i)^\top u_i) +c_2 \sum_{j \in I \setminus {i}} \phi((u_j - v_i)^\top u_i)\right) $$
-
$c_1$ ,$c_2 \in {(0,1), (1,0), (1,1)}$ : control negative pair types -
$\phi$ : convex & increasing -
$\psi$ : convex & increasing
For the same index set
$$ \mathcal{L}{\text{ind-add}}(U_I, V_I) := - \frac{1}{n} \sum{i=1}^{n} \phi(u_i^\top v_i) + \frac{c_1}{n(n-1)} \sum_{i \ne j} \psi(u_i^\top v_j) + \frac{c_2}{2n(n-1)} \sum_{i \ne j} [\psi(u_i^\top u_j) + \psi(v_i^\top v_j)] $$
-
$\phi$ : concave, increasing, differentiable -
$\psi$ : convex, increasing, differentiable -
$c_1 = 1$ : includes cross-view negatives$(u_i, v_j)$ -
$c_2 = 1$ : includes within-view negatives$(u_i, u_j)$ and$(v_i, v_j)$
정의 밑에 간단한 bullet list로 요소 의미 설명 추가.
Category | Definition 3.1 (InfoNCE) | Definition 3.2 (Ind-Add) |
---|---|---|
Computation | Joint, normalized | Independent, additive |
Scalability | Small/medium batch | Large batch, efficient |
Examples | InfoNCE, SimCLR, DCL | SigLIP, Spectral CL |
case 구분이 설명으로만 되어 있어 직관이 떨어지므로 도표를 추가.
- Def. 3.1 includes: InfoNCE, SimCLR, DCL, DHEL (see Appendix A.1)
- Def. 3.2 includes: SigLIP, Spectral CL (see Appendix A.2)
The key difference lies in computational cost: Def. 3.1 requires computing all pairwise similarities jointly due to normalization, which becomes inefficient at large batch sizes. Def. 3.2 allows separate computation for each term and scales better for large datasets.
<script type="text/javascript" src="http://cdn.mathjax.org/mathjax/latest/MathJax.js?config=TeX-AMS-MML_HTMLorMML"></script> <script type="text/x-mathjax-config"> MathJax. Hub. Config({ tex2jax: {inlineMath: [['$', '$']]}, messageStyle: "none", "HTML-CSS": { availableFonts: "TeX", preferredFont: "TeX" }, }); </script>Def. 3.1과 Def. 3.2의 비교가 보다 명시적으로 드러나도록 서술 방식을 수정.