DCL

TL;DR

The problem: InfoNCE has a hidden coupling between positive and negative samples (the NPC effect) that reduces learning efficiency, especially at small batch sizes.
Our solution: DCL removes the positive term from the denominator, decoupling gradients and eliminating batch-size sensitivity.
The payoff: +6.4% on ImageNet-1K with batch 256, new SOTA 72.3% combined with NNCLR.

Abstract

Contrastive learning (CL) is one of the most successful paradigms for self-supervised learning (SSL). In a principled way, it considers two augmented views of the same image as positive to be pulled closer, and all other images as negative to be pushed further apart. However, behind the impressive success of CL-based techniques, their formulation often relies on heavy-computation settings, including large sample batches, extensive training epochs, etc. We are thus motivated to tackle these issues and establish a simple, efficient, yet competitive baseline of contrastive learning. Specifically, we identify, from theoretical and empirical studies, a noticeable negative-positive-coupling (NPC) effect in the widely used InfoNCE loss, leading to unsuitable learning efficiency concerning the batch size. By removing the NPC effect, we propose decoupled contrastive learning (DCL) loss, which removes the positive term from the denominator and significantly improves the learning efficiency. DCL achieves competitive performance with less sensitivity to sub-optimal hyperparameters, requiring neither large batches in SimCLR, momentum encoding in MoCo, or large epochs. We demonstrate with various benchmarks while manifesting robustness as much less sensitive to suboptimal hyperparameters. Notably, SimCLR with DCL achieves 68.2% ImageNet-1K top-1 accuracy using batch size 256 within 200 epochs pre-training, outperforming its SimCLR baseline by 6.4%. Further, DCL can be combined with the SOTA contrastive learning method, NNCLR, to achieve 72.3% ImageNet-1K top-1 accuracy with 512 batch size in 400 epochs, which represents a new SOTA in contrastive learning. We believe DCL provides a valuable baseline for future contrastive SSL studies.

Key Findings

The NPC Effect Hurts Small Batches

The InfoNCE loss has a hidden coupling multiplier that suppresses gradients when positives are close or negatives are far, making training inefficient at small batch sizes.

Decoupling Beats Complexity

Simply removing one term from the InfoNCE denominator outperforms complex solutions like momentum encoders and memory queues, achieving +6.4% on ImageNet-1K.

Plug-and-Play Improvement

DCL is a drop-in replacement for InfoNCE that works with SimCLR, MoCo, and NNCLR without any architectural changes, achieving 72.3% SOTA.

Key Insight: The NPC Problem

The standard InfoNCE loss used in contrastive learning (e.g., SimCLR) is:

$$\mathcal{L}_i^{(k)} = -\log \frac{\exp(\langle \mathbf{z}_i^{(1)}, \mathbf{z}_i^{(2)} \rangle / \tau)}{\exp(\langle \mathbf{z}_i^{(1)}, \mathbf{z}_i^{(2)} \rangle / \tau) + U_{i,k}} \tag{1}$$

where $U_{i,k}$ denotes the summation of negative terms for the view $k$ of sample $i$:

$$U_{i,k} = \sum_{l \in \{1,2\},\; j \in [\![1,N]\!],\; j \neq i} \exp(\langle \mathbf{z}_i^{(k)}, \mathbf{z}_j^{(l)} \rangle / \tau) \tag{2}$$

and $\tau$ is the temperature parameter.

Proposition 1. There exists a Negative-Positive Coupling (NPC) multiplier $q_{B,i}^{(1)}$ in the gradient of $\mathcal{L}_i^{(1)}$:

$$\left\{\begin{array}{l} -\nabla_{\mathbf{z}_{i}^{(1)}}\mathcal{L}_{i}^{(1)} = \frac{q_{B,i}^{(1)}}{\tau} \left( \mathbf{z}_i^{(2)} - \sum_{l \in \{1,2\},\; j \neq i}{\frac{\exp\langle \mathbf{z}_i^{(1)},\mathbf{z}_j^{(l)} \rangle/\tau}{U_{i,1}}}\cdot \mathbf{z}_j^{(l)}\right) \\[8pt] -\nabla_{\mathbf{z}_{i}^{(2)}}\mathcal{L}_{i}^{(1)} = \frac{q_{B,i}^{(1)}}{\tau}\cdot \mathbf{z}_i^{(1)}\\[8pt] -\nabla_{\mathbf{z}_{j}^{(l)}}\mathcal{L}_{i}^{(1)} = - \frac{q_{B,i}^{(1)}}{\tau}\frac{\exp\langle \mathbf{z}_i^{(1)},\mathbf{z}_j^{(l)} \rangle/\tau}{U_{i,1}}\cdot \mathbf{z}_i^{(1)} \end{array} \right. \tag{3}$$

where the NPC multiplier $q_{B,i}^{(1)}$ is:

$$q_{B,i}^{(1)} = 1 - \frac{\exp(\langle \mathbf{z}_i^{(1)}, \mathbf{z}_i^{(2)} \rangle / \tau)}{\exp(\langle \mathbf{z}_i^{(1)}, \mathbf{z}_i^{(2)} \rangle / \tau) + U_{i,1}} \tag{4}$$

Due to symmetry, a similar NPC multiplier $q_{B,i}^{(k)}$ exists in the gradient of $\mathcal{L}_i^{(k)}$, $k \in \{1,2\}$, $i \in [\![1,N]\!]$.

All partial gradients in Eq. (3) are modulated by the common NPC multiplier. This coupling is problematic because:

When a positive sample is close (easy positive), the gradient from informative negatives gets suppressed.
When negative samples are far (easy negatives), the gradient from the informative positive is reduced.
With smaller batch sizes, the classification task becomes simpler, causing $q_B$ to cluster near 0 and drastically reducing learning efficiency.

Figure 2. (a) SimCLR framework. (b) The gradient is modulated by the NPC multiplier $q_B$. (c) Two failure cases: easy positives suppress negative gradients (top), and easy negatives suppress positive gradients (bottom).

Method: Decoupled Contrastive Loss

Proposition 2 (the DCL Loss). Removing the positive pair from the denominator of Eq. (1) leads to a decoupled contrastive learning loss. If we remove the NPC multiplier $q_{B,i}^{(k)}$ from Eq. (3), we reach $\mathcal{L}_{DC} = \sum_{k \in \{1,2\},\; i \in [\![1,N]\!]} \mathcal{L}_{DC,i}^{(k)}$:

$$\mathcal{L}_{DC,i}^{(k)} = -\log \frac{\exp(\langle \mathbf{z}_i^{(1)}, \mathbf{z}_i^{(2)} \rangle / \tau)}{\cancel{\exp(\langle \mathbf{z}_i^{(1)}, \mathbf{z}_i^{(2)} \rangle / \tau)} + U_{i,k}} = -\frac{\langle \mathbf{z}_i^{(1)}, \mathbf{z}_i^{(2)} \rangle}{\tau} + \log U_{i,k} \tag{6}$$

We further generalize to $\mathcal{L}_{DCW}$ by introducing a weighting function for the positive pairs, the weighted variant (DCLW):

$$\mathcal{L}_{DCW,i}^{(k)} = -w(\mathbf{z}_i^{(1)}, \mathbf{z}_i^{(2)}) \cdot \frac{\langle \mathbf{z}_i^{(1)}, \mathbf{z}_i^{(2)} \rangle}{\tau} + \log U_{i,k} \tag{7}$$

where $w(\mathbf{z}_i^{(1)}, \mathbf{z}_i^{(2)}) = 2 - \frac{\exp(\langle \mathbf{z}_i^{(1)}, \mathbf{z}_i^{(2)} \rangle / \sigma)}{\mathbb{E}_i[\exp(\langle \mathbf{z}_i^{(1)}, \mathbf{z}_i^{(2)} \rangle / \sigma)]}$ gives larger weight to hard positives (pairs that are far apart), with $\mathbb{E}[w] = 1$.

Key properties of DCL:

Plug-and-play: Replace InfoNCE loss in any contrastive method (SimCLR, MoCo, NNCLR)
No additional components: No momentum encoder, memory queue, or stop-gradient needed
Batch-size invariant: Performance is stable across batch sizes from 32 to 4096

Analysis

Figure 3. ImageNet-1K top-1 accuracy across different batch sizes. DCL maintains stable performance while baselines degrade significantly at small batch sizes.

Figure 4a. Convergence on CIFAR-10.

Figure 4b. Convergence on STL-10.

Figure 4c. t-SNE visualization showing stronger cluster separation with DCL.

Quantitative Results

SimCLR with DCL/DCLW at batch size 256, 200 epochs pre-training. DCL consistently improves over the SimCLR baseline across all benchmarks, with DCLW achieving the best results.

Method	ImageNet-1K	ImageNet-100	CIFAR-10	CIFAR-100	STL-10
SimCLR (Baseline)	61.8	80.7	81.4	52.0	80.7
+ DCL	65.9	83.1	84.2	54.9	81.2
+ DCLW	68.2	84.2	85.7	57.1	81.3
Δ Improvement	↑6.4	↑3.5	↑4.3	↑5.1	↑0.6

Batch size sensitivity (ImageNet-1K Linear Top-1, 200 epochs). DCL maintains stable performance across batch sizes while the SimCLR baseline degrades significantly at smaller batches.

Method	BS 32	BS 64	BS 128	BS 256	BS 512
SimCLR	56.8	58.9	60.6	61.8	64.0
SimCLR + DCL	61.5	63.4	64.3	65.9	65.8
Δ Improvement	↑4.7	↑4.5	↑3.7	↑4.1	↑1.8

Comparison with state-of-the-art SSL methods (ImageNet-1K Linear Top-1, ResNet-50). DCL combined with NNCLR achieves 72.3% with significantly smaller batch size and fewer epochs than competing methods.

Method	Batch Size	Epochs	Top-1 (%)
Lower Compute (200 epochs)
MoCo-v2	256	200	67.5
SiMo	256	200	68.0
SwAV	4096	200	69.1
SimSiam	256	200	70.0
InfoMin	256	200	70.1
BYOL	4096	200	70.6
SimCLR + DCL	256	200	67.8
SimCLR + DCLW	256	200	68.2
Higher Compute (300+ epochs)
SimCLR	4096	1000	69.3
MoCo-v2	256	400	71.0
Barlow Twins	256	300	70.7
SimSiam	256	400	70.8
SwAV	4096	400	70.7
BYOL	4096	400	73.2
NNCLR	512	1000	71.7
SimCLR + DCL	256	400	69.5
NNCLR + DCL	256	400	71.1
NNCLR + DCL	512	400	72.3

BibTeX

@inproceedings{yeh2022decoupled,
  title={Decoupled Contrastive Learning},
  author={Yeh, Chun-Hsiao and Hong, Cheng-Yao and Hsu, Yen-Chi and Liu, Tyng-Luh and Chen, Yubei and LeCun, Yann},
  booktitle={European Conference on Computer Vision (ECCV)},
  pages={668--684},
  year={2022}
}