Decoupled Contrastive Learning

1IIS, Academia Sinica 2UC Berkeley 3National Taiwan University 4Meta AI Research 5New York University
ECCV 2022
NPC multiplier visualization
TL;DR
  • The problem: InfoNCE has a hidden coupling between positive and negative samples (the NPC effect) that reduces learning efficiency, especially at small batch sizes.
  • Our solution: DCL removes the positive term from the denominator, decoupling gradients and eliminating batch-size sensitivity.
  • The payoff: +6.4% on ImageNet-1K with batch 256, new SOTA 72.3% combined with NNCLR.

Abstract

Contrastive learning (CL) is one of the most successful paradigms for self-supervised learning (SSL). In a principled way, it considers two augmented views of the same image as positive to be pulled closer, and all other images as negative to be pushed further apart. However, behind the impressive success of CL-based techniques, their formulation often relies on heavy-computation settings, including large sample batches, extensive training epochs, etc. We are thus motivated to tackle these issues and establish a simple, efficient, yet competitive baseline of contrastive learning. Specifically, we identify, from theoretical and empirical studies, a noticeable negative-positive-coupling (NPC) effect in the widely used InfoNCE loss, leading to unsuitable learning efficiency concerning the batch size. By removing the NPC effect, we propose decoupled contrastive learning (DCL) loss, which removes the positive term from the denominator and significantly improves the learning efficiency. DCL achieves competitive performance with less sensitivity to sub-optimal hyperparameters, requiring neither large batches in SimCLR, momentum encoding in MoCo, or large epochs. We demonstrate with various benchmarks while manifesting robustness as much less sensitive to suboptimal hyperparameters. Notably, SimCLR with DCL achieves 68.2% ImageNet-1K top-1 accuracy using batch size 256 within 200 epochs pre-training, outperforming its SimCLR baseline by 6.4%. Further, DCL can be combined with the SOTA contrastive learning method, NNCLR, to achieve 72.3% ImageNet-1K top-1 accuracy with 512 batch size in 400 epochs, which represents a new SOTA in contrastive learning. We believe DCL provides a valuable baseline for future contrastive SSL studies.

Key Findings


The NPC Effect Hurts Small Batches

The InfoNCE loss has a hidden coupling multiplier that suppresses gradients when positives are close or negatives are far, making training inefficient at small batch sizes.

Decoupling Beats Complexity

Simply removing one term from the InfoNCE denominator outperforms complex solutions like momentum encoders and memory queues, achieving +6.4% on ImageNet-1K.

Plug-and-Play Improvement

DCL is a drop-in replacement for InfoNCE that works with SimCLR, MoCo, and NNCLR without any architectural changes, achieving 72.3% SOTA.

Key Insight: The NPC Problem

The standard InfoNCE loss used in contrastive learning (e.g., SimCLR) is:

$$\mathcal{L}_i^{(k)} = -\log \frac{\exp(\langle \mathbf{z}_i^{(1)}, \mathbf{z}_i^{(2)} \rangle / \tau)}{\exp(\langle \mathbf{z}_i^{(1)}, \mathbf{z}_i^{(2)} \rangle / \tau) + U_{i,k}} \tag{1}$$

where $U_{i,k}$ denotes the summation of negative terms for the view $k$ of sample $i$:

$$U_{i,k} = \sum_{l \in \{1,2\},\; j \in [\![1,N]\!],\; j \neq i} \exp(\langle \mathbf{z}_i^{(k)}, \mathbf{z}_j^{(l)} \rangle / \tau) \tag{2}$$

and $\tau$ is the temperature parameter.

Proposition 1. There exists a Negative-Positive Coupling (NPC) multiplier $q_{B,i}^{(1)}$ in the gradient of $\mathcal{L}_i^{(1)}$:

$$\left\{\begin{array}{l} -\nabla_{\mathbf{z}_{i}^{(1)}}\mathcal{L}_{i}^{(1)} = \frac{q_{B,i}^{(1)}}{\tau} \left( \mathbf{z}_i^{(2)} - \sum_{l \in \{1,2\},\; j \neq i}{\frac{\exp\langle \mathbf{z}_i^{(1)},\mathbf{z}_j^{(l)} \rangle/\tau}{U_{i,1}}}\cdot \mathbf{z}_j^{(l)}\right) \\[8pt] -\nabla_{\mathbf{z}_{i}^{(2)}}\mathcal{L}_{i}^{(1)} = \frac{q_{B,i}^{(1)}}{\tau}\cdot \mathbf{z}_i^{(1)}\\[8pt] -\nabla_{\mathbf{z}_{j}^{(l)}}\mathcal{L}_{i}^{(1)} = - \frac{q_{B,i}^{(1)}}{\tau}\frac{\exp\langle \mathbf{z}_i^{(1)},\mathbf{z}_j^{(l)} \rangle/\tau}{U_{i,1}}\cdot \mathbf{z}_i^{(1)} \end{array} \right. \tag{3}$$

where the NPC multiplier $q_{B,i}^{(1)}$ is:

$$q_{B,i}^{(1)} = 1 - \frac{\exp(\langle \mathbf{z}_i^{(1)}, \mathbf{z}_i^{(2)} \rangle / \tau)}{\exp(\langle \mathbf{z}_i^{(1)}, \mathbf{z}_i^{(2)} \rangle / \tau) + U_{i,1}} \tag{4}$$

Due to symmetry, a similar NPC multiplier $q_{B,i}^{(k)}$ exists in the gradient of $\mathcal{L}_i^{(k)}$, $k \in \{1,2\}$, $i \in [\![1,N]\!]$.

All partial gradients in Eq. (3) are modulated by the common NPC multiplier. This coupling is problematic because:

  • When a positive sample is close (easy positive), the gradient from informative negatives gets suppressed.
  • When negative samples are far (easy negatives), the gradient from the informative positive is reduced.
  • With smaller batch sizes, the classification task becomes simpler, causing $q_B$ to cluster near 0 and drastically reducing learning efficiency.
NPC coupling visualization

Figure 2. (a) SimCLR framework. (b) The gradient is modulated by the NPC multiplier $q_B$. (c) Two failure cases: easy positives suppress negative gradients (top), and easy negatives suppress positive gradients (bottom).

Method: Decoupled Contrastive Loss

Proposition 2 (the DCL Loss). Removing the positive pair from the denominator of Eq. (1) leads to a decoupled contrastive learning loss. If we remove the NPC multiplier $q_{B,i}^{(k)}$ from Eq. (3), we reach $\mathcal{L}_{DC} = \sum_{k \in \{1,2\},\; i \in [\![1,N]\!]} \mathcal{L}_{DC,i}^{(k)}$:

$$\mathcal{L}_{DC,i}^{(k)} = -\log \frac{\exp(\langle \mathbf{z}_i^{(1)}, \mathbf{z}_i^{(2)} \rangle / \tau)}{\cancel{\exp(\langle \mathbf{z}_i^{(1)}, \mathbf{z}_i^{(2)} \rangle / \tau)} + U_{i,k}} = -\frac{\langle \mathbf{z}_i^{(1)}, \mathbf{z}_i^{(2)} \rangle}{\tau} + \log U_{i,k} \tag{6}$$

We further generalize to $\mathcal{L}_{DCW}$ by introducing a weighting function for the positive pairs, the weighted variant (DCLW):

$$\mathcal{L}_{DCW,i}^{(k)} = -w(\mathbf{z}_i^{(1)}, \mathbf{z}_i^{(2)}) \cdot \frac{\langle \mathbf{z}_i^{(1)}, \mathbf{z}_i^{(2)} \rangle}{\tau} + \log U_{i,k} \tag{7}$$

where $w(\mathbf{z}_i^{(1)}, \mathbf{z}_i^{(2)}) = 2 - \frac{\exp(\langle \mathbf{z}_i^{(1)}, \mathbf{z}_i^{(2)} \rangle / \sigma)}{\mathbb{E}_i[\exp(\langle \mathbf{z}_i^{(1)}, \mathbf{z}_i^{(2)} \rangle / \sigma)]}$ gives larger weight to hard positives (pairs that are far apart), with $\mathbb{E}[w] = 1$.

Key properties of DCL:

  • Plug-and-play: Replace InfoNCE loss in any contrastive method (SimCLR, MoCo, NNCLR)
  • No additional components: No momentum encoder, memory queue, or stop-gradient needed
  • Batch-size invariant: Performance is stable across batch sizes from 32 to 4096

Analysis

Batch size comparison

Figure 3. ImageNet-1K top-1 accuracy across different batch sizes. DCL maintains stable performance while baselines degrade significantly at small batch sizes.

CIFAR10 convergence

Figure 4a. Convergence on CIFAR-10.

STL10 convergence

Figure 4b. Convergence on STL-10.

t-SNE visualization

Figure 4c. t-SNE visualization showing stronger cluster separation with DCL.

Quantitative Results

SimCLR with DCL/DCLW at batch size 256, 200 epochs pre-training. DCL consistently improves over the SimCLR baseline across all benchmarks, with DCLW achieving the best results.

Method ImageNet-1K ImageNet-100 CIFAR-10 CIFAR-100 STL-10
SimCLR (Baseline)61.880.781.452.080.7
+ DCL65.983.184.254.981.2
+ DCLW68.284.285.757.181.3
Δ Improvement↑6.4↑3.5↑4.3↑5.1↑0.6

Batch size sensitivity (ImageNet-1K Linear Top-1, 200 epochs). DCL maintains stable performance across batch sizes while the SimCLR baseline degrades significantly at smaller batches.

Method BS 32 BS 64 BS 128 BS 256 BS 512
SimCLR56.858.960.661.864.0
SimCLR + DCL61.563.464.365.965.8
Δ Improvement↑4.7↑4.5↑3.7↑4.1↑1.8

Comparison with state-of-the-art SSL methods (ImageNet-1K Linear Top-1, ResNet-50). DCL combined with NNCLR achieves 72.3% with significantly smaller batch size and fewer epochs than competing methods.

Method Batch Size Epochs Top-1 (%)
MoCo-v225620067.5
SiMo25620068.0
SwAV409620069.1
SimSiam25620070.0
InfoMin25620070.1
BYOL409620070.6
SimCLR + DCL25620067.8
SimCLR + DCLW25620068.2
SimCLR4096100069.3
MoCo-v225640071.0
Barlow Twins25630070.7
SimSiam25640070.8
SwAV409640070.7
BYOL409640073.2
NNCLR512100071.7
SimCLR + DCL25640069.5
NNCLR + DCL25640071.1
NNCLR + DCL51240072.3

BibTeX

@inproceedings{yeh2022decoupled,
  title={Decoupled Contrastive Learning},
  author={Yeh, Chun-Hsiao and Hong, Cheng-Yao and Hsu, Yen-Chi and Liu, Tyng-Luh and Chen, Yubei and LeCun, Yann},
  booktitle={European Conference on Computer Vision (ECCV)},
  pages={668--684},
  year={2022}
}