1. Bibliographic Information

1.1. Title

The paper is titled "Selective Synergistic Learning for Video Object-Centric Learning". The name directly hints at the core idea: a selective (rather than dense) mutual‑teaching scheme between the encoder and decoder of a video object‑centric model, designed to combine the strengths of each module while avoiding error propagation.

1.2. Authors

The authors are WonJun Moon and Jae‑Pil Heo. Based on the acknowledgment numbers in the original manuscript and related prior works (such as the ICLR 2026 paper SRL), they are affiliated with Sungkyunkwan University (SKKU), South Korea. Jae‑Pil Heo is a professor, and WonJun Moon is a researcher in the same group.

1.3. Journal / Conference

The paper was released on arXiv on June 14, 2026 (preprint identifier 2606.15527). At the time of writing, no specific conference or journal venue is explicitly mentioned beyond the arXiv publication. However, the research context and comparison baselines (e.g., ICLR, CVPR papers) strongly suggest it targets top‑tier machine learning and computer vision venues.

1.4. Publication Year

2026 (June).

1.5. Abstract

The abstract summarises the following points:

Problem: Typical video object‑centric learning (VOCL) uses slot‑based encoder‑decoder architectures where learning relies on two spatial maps: attention maps from the encoder (sharp but noisy) and object maps from the decoder (smooth but with blurred boundaries). Dense alignment methods that enforce agreement across all spatio‑temporal patches can propagate the weaknesses of each module and suffer from quadratic memory costs.
Proposed solution: Selective Synergistic Learning (SSync) distills only the most reliable cues: the encoder’s sharp boundaries to refine the decoder, and the decoder’s coherent interiors to denoise the encoder. This is implemented via pseudo‑labeling with linear complexity.
Additional mechanism: A transitive pseudo‑label merging consolidates overlapping (redundant) slots using spatio‑temporal activation consistency.
Results: SSync improves decomposition quality, is plug‑and‑play, robust to slot configurations, and memory‑efficient.

1.6. Original Source Link

The paper is available on arXiv: https://arxiv.org/abs/2606.15527
PDF link: https://arxiv.org/pdf/2606.15527v1
Code: github.com/wjun0830/SSync

The work is a self‑contained preprint; no official conference publication has been indicated apart from the arXiv date.

2. Executive Summary

2.1. Background & Motivation

In video object‑centric learning (VOCL), a neural network tries to parse a video into slots—latent representations that each correspond to an object or a coherent region, without any human annotations. The dominant framework uses a slot attention encoder to group image patch features into slots and a decoder that reconstructs each frame pixel‑wise, implicitly producing an object map that indicates which slot is responsible for which patch.

A fundamental tension exists: the encoder’s attention map $\mathbf{A}$ tends to be sharp at object boundaries but also introduces noisy, fragmented assignments; the decoder’s object map $\mathbf{D}$ is spatio‑temporally smooth yet suffers from blurred object boundaries due to the low‑frequency inductive bias of lightweight decoders. Ideally these two maps should be consistent, but in practice they persistently misalign.

A previous method, SRL (Synergistic Representation Learning, ICLR 2026), attempted to fix this by applying a dense contrastive loss between all pairs of spatio‑temporal patches in $\mathbf{A}$ and $\mathbf{D}$ . While partially effective, this dense alignment has two critical drawbacks:

It propagates noise and blur because it treats every patch as an equally reliable teacher, inadvertently reinforcing the architectural weaknesses of both modules.
The pairwise comparison across $T\times H\times W$ patches results in $\mathcal{O}((T\cdot H\cdot W)^2)$ memory cost, severely limiting scalability to longer or higher‑resolution videos.

SSync is motivated by a selective alignment principle: instead of aligning everything everywhere, we should only enforce mutual supervision where each module is structurally reliable. This simple insight -‑ that the encoder’s sharpness is trustworthy at boundaries and the decoder’s coherence is trustworthy inside objects -‑ enables a lightweight, error‑resistant, and linearly scaling alignment scheme.

2.2. Main Contributions / Findings

Reliability‑based selective distillation: The paper identifies the spatial complementarity between the encoder (boundaries) and decoder (interiors) and formulates mutual learning as a cross‑distillation via pseudo‑labels applied only in carefully selected regions.
Redundancy control via transitive merging: To prevent on‑the‑fly pseudo‑labels from reinforcing slot fragmentation (where a single object is split into multiple slots), a transitive pseudo‑label merging strategy is introduced. It groups overlapping slots into a single identity using a spatio‑temporal overlap graph.
Strong performance & practical generalization: SSync achieves state‑of‑the‑art object decomposition on standard VOCL benchmarks (MOVi‑C, MOVi‑E, YouTube‑VIS), as well as on image‑only benchmarks. It significantly reduces memory consumption compared to dense alignment, works as a plug‑and‑play module with no additional parameters, and shows exceptional robustness to the number of slots -‑ a common hyperparameter that usually requires careful dataset‑specific tuning.

3.1. Foundational Concepts

3.1.1. Object‑Centric Learning and Slot Attention

Object‑centric learning aims to decompose visual scenes into a set of object representations without supervision. The most influential architecture is Slot Attention (Locatello et al., NeurIPS 2020). In its image‑level version, an input image is split into patches, each represented by a feature vector (e.g., from a Vision Transformer). A set of S slots (randomly initialised vectors) compete to attend to the patches through an iterative attention mechanism:

Each slot produces a query; each patch produces a key.
Dot‑product attention yields an attention map $\mathbf{A}\in\mathbb{R}^{P\times S}$ (with $P = H\times W$ patches), giving the probability that patch $p$ is assigned to slot $s$ .
The slots are updated by aggregating patch features weighted by this attention map.
A decoder reconstructs the image using the slot representations, which normally also produces a “rendering” or object map $\mathbf{D}$ that indicates per‑pixel, per‑slot responsibility.

This framework has been extended to videos by incorporating temporal recurrence: slots are updated from frame to frame, and the decoder may also use temporal information.

3.1.2. The Discrepancy between $\mathbf{A}$ and $\mathbf{D}$

Because the encoder (which produces $\mathbf{A}$ ) is usually a high‑capacity backbone (e.g., DINOv2), its attention maps can capture high‑frequency details (sharp edges), but it may also generate salt‑and‑pepper noise due to imperfect grouping. The decoder (often a lightweight spatial broadcast MLP) imposes a smooth spatial prior; its $\mathbf{D}$ is temporally coherent and low‑noise, but blurs across object boundaries. This asymmetry means that neither map is perfect everywhere, but together they contain complementary information.

3.1.3. Pseudo‑Labeling

Pseudo‑labeling is a semi‑supervised technique where a model’s own confident predictions are used as training targets. In the SSync context, the model generates pseudo‑labels from the relatively reliable parts of one spatial map to supervise the other. The challenge is that early‑training predictions are noisy; therefore SSync incorporates warm‑up and merging mechanisms to prevent error amplification.

3.2. Previous Works

3.2.1. Video Object‑Centric Learning

SAVi (Kipf et al., ICLR 2022) and SAVi++ (Elsayed et al., NeurIPS 2022) extended slot attention to video by conditioning on motion cues like optical flow and depth.
STEVE (Singh et al., NeurIPS 2022) used a transformer‑based slot encoder to model complex object interactions.
VideoSAUR (Zadaianchuk et al., NeurIPS 2023) trained on real‑world videos by predicting temporal feature similarity, encouraging grouping of motion‑consistent regions.
SlotContrast (Manasyan et al., CVPR 2025) added a temporal contrastive objective that pulls slot representations of the same object across frames closer and pushes different objects apart, improving slot discriminability.
SRL (Seong et al., ICLR 2026) directly tackled the encoder‑decoder map discrepancy via dense alignment. It introduced a ternary contrastive loss that aligns all spatio‑temporal patches from $\mathbf{A}$ and $\mathbf{D}$ . However, this indiscriminate alignment propagated noise/blur and had $\mathcal{O}((T\cdot H\cdot W)^2)$ complexity.

3.2.2. Dense Alignment in SRL

SRL uses a contrastive loss that treats pairs of patches that share the same slot assignment as positives and all others as negatives. Concretely, for each patch, its representation from $\mathbf{A}$ (projected into an embedding space) and the corresponding one from $\mathbf{D}$ are attracted, while being repelled from all other patches. This requires computing a similarity matrix of size $(T\cdot P)\times(T\cdot P)$ , consuming memory quadratic in the total number of patches.

SSync’s differentiation: Instead of dense, unselective alignment, SSync performs asymmetric selective distillation only in boundary (where encoder is reliable) and interior (where decoder is reliable) regions. It avoids the quadratic complexity by using hard pseudo‑labels and an MSE loss, not pairwise comparisons.

3.3. Technological Evolution

The field progressed from single‑image slot attention to video models by adding temporal recurrence (SAVi), motion‑based grouping (SAVi++, VideoSAUR), contrastive learning (SlotContrast), and explicit encoder‑decoder alignment (SRL). SSync represents the next step: shifting from dense to selective alignment, and from passive to active redundancy handling (via transitive merging). It also emphasises practical scalability.

3.4. Differentiation Analysis

Relative to the closest prior work, SRL, the core differences are:

Alignment scope: Dense (all patches) vs. selective (only reliable boundary/interior patches).
Alignment method: Ternary contrastive loss requiring extra projector modules vs. simple pseudo‑label MSE with no additional parameters.
Redundancy handling: Fixed warm‑up regularisation vs. adaptive transitive merging that runs throughout training.
Computational complexity: $\mathcal{O}((T\cdot H\cdot W)^2)$ vs. $\mathcal{O}(T\cdot H\cdot W)$ , making SSync much more memory‑efficient.

4. Methodology

4.1. Principles

The core idea is selective synergy: use each branch (encoder and decoder) to teach the other only in the spatial regions where the teacher is most reliable. The encoder’s sharp attention map is exploited to refine decoder boundaries; the decoder’s coherent object map is used to denoise encoder interiors. This simple rule, together with a mechanism to merge redundant slots, achieves robust alignment without the pitfalls of dense, indiscriminate supervision.

4.2. Core Methodology In‑depth (Layer by Layer)

4.2.1. Preliminaries: Video Slot Learning and Spatial Maps

Consider a video of $T$ frames, each divided into $P = H\times W$ patches. Let $z_{t,p}$ be the feature of patch $p$ at time $t$ (extracted by a frozen Vision Transformer). The encoder (a recurrent slot attention module) maintains $S$ slot representations and projects them to keys $\mathbf{k}_{t,s}$ and the patch features to queries $\mathbf{q}_{t,p}$ . The encoder attention map $\mathbf{A}\in\mathbb{R}^{T\times P\times S}$ is obtained via:

$ \mathbf{A}{t,p} = \mathrm{Softmax}\big( \mathbf{q}{t,p}^\top \mathbf{k}_{t,1:S} \big) , \qquad\qquad\qquad\text{(Eq. 1)} $

where $\mathbf{A}_{t,p,s}$ is the probability that patch (t,p) is assigned to slot $s$ . This map inherits the sharpness of the backbone but also its local noise.

The decoder reconstructs the input and outputs a decoder object map $\mathbf{D}\in\mathbb{R}^{T\times P\times S}$ , normalised over slots such that $\sum_{s=1}^S \mathbf{D}_{t,p,s}=1$ for all t,p. Due to the decoder’s smoothness, $\mathbf{D}$ tends to be coherent inside objects but blurry at boundaries.

Bias misalignment: Ideally $\mathbf{A}$ and $\mathbf{D}$ should agree, but in practice they exhibit a systematic discrepancy. Quantitative evidence (Table 1 of the paper) shows that the symmetric Adjusted Rand Index (ARI) between $\mathbf{A}$ and $\mathbf{D}$ is only around 73–77% for prior methods, indicating substantial inconsistency.

4.2.2. Selective Alignment Principle

The key insight is that the reliability of each map is region‑dependent. Therefore, instead of trying to enforce agreement everywhere, we should only align in regions where the respective map is naturally trustworthy. This prevents error propagation and drastically reduces computational cost.

SSync implements this by first extracting hard slot assignments from both maps:

$ \hat s_{t,p}^{A} = \underset{s}{\mathrm{argmax}},\mathbf{A}{t,p,s}, \qquad \hat s{t,p}^{D} = \underset{s}{\mathrm{argmax}},\mathbf{D}_{t,p,s}, \qquad\text{(Eq. 2)} $

These hard pseudo‑labels will act as teacher signals, but only after we refine them and determine where they should be applied.

4.2.3. Structuring Pseudo‑Labels: Boundary and Interior Selection

To identify reliable regions, the method analyses local slot consistency. First, a spatio‑temporal neighbourhood $\mathcal{N}(t,p)$ is defined for each patch:

$ \mathcal{N}(t,p) = { (t,q) \mid q\in\mathcal{N}_{\mathrm{sp}}(p) } ;\cup; {(t-1,p), (t+1,p)}, \qquad\qquad\text{(Eq. 3)} $

where $\mathcal{N}_{\mathrm{sp}}(p)$ is the 8‑neighbourhood on the spatial grid. This includes adjacent frames at the same spatial location, helping to detect motion edges.

Next, for a map $\mathbf{X}\in\{\mathbf{A},\mathbf{D}\}$ , the model counts the number of neighbouring patches with a different hard assignment (disagreement) and the number with the same hard assignment (agreement):

$ c_{(t,p)}^{\neq,\mathbf{X}} = \sum_{(t',q)\in\mathcal{N}(t,p)} \mathbb{I}\big[\hat s_{t',q}^{\mathbf{X}} \neq \hat s_{t,p}^{\mathbf{X}}\big],\qquad c_{(t,p)}^{=,\mathbf{X}} = \sum_{(t',q)\in\mathcal{N}(t,p)} \mathbb{I}\big[\hat s_{t',q}^{\mathbf{X}} = \hat s_{t,p}^{\mathbf{X}}\big], \qquad\qquad\text{(Eq. 4)} $

where $\mathbb{I}[\cdot]$ is the indicator function.

Boundary region selection: Boundary patches are characterised by local disagreement while still having at least one agreeing neighbour (to avoid isolated noise). The encoder’s attention map $\mathbf{A}$ is sharper, so boundary candidates are detected from $\mathbf{A}$ :

$ \mathcal{P}{\mathrm{bd}} = \left{ (t,p) \mid c{(t,p)}^{\neq,\mathbf{A}} > n_{\mathrm{bd}} ;\wedge; c_{(t,p)}^{=,\mathbf{A}} \ge 1 \right}, \qquad\qquad\text{(Eq. 5)} $

where $n_{\mathrm{bd}}$ is a sensitivity threshold (set to 1 in all experiments). This selects patches where at least two neighbours disagree and at least one agrees, capturing true object transitions while discarding solitary noisy pixels.

Interior region selection: Interior patches exhibit high local consistency. The decoder map $\mathbf{D}$ is smooth and therefore more reliable in these regions:

$ \mathcal{P}{\mathrm{nbd}} = \left{ (t,p) \mid c{(t,p)}^{\neq,\mathbf{D}} < n_{\mathrm{nbd}} \right}, \qquad\qquad\text{(Eq. 6)} $

where $n_{\mathrm{nbd}}$ (also set to 1) controls strictness; a patch is included in $\mathcal{P}_{\mathrm{nbd}}$ only if the number of differing neighbour assignments is strictly smaller than $n_{\mathrm{nbd}}$ , i.e., essentially all neighbours agree.

Intuitively, this selection acts like a relaxed morphological erosion with noise filtering, parameterised in space and time.

4.2.4. Transitive Pseudo‑Label Merging

Although the above selection restricts supervision to structurally reliable regions, the pseudo‑labels themselves come from on‑the‑fly model predictions and can still exhibit over‑fragmentation, where a single object is split into multiple slot identities. If such fragmented assignments are used as supervision targets, they reinforce the fragmentation.

To combat this, SSync merges redundant slots based on their spatio‑temporal activation overlap. This is done before the pseudo‑labels are used for alignment.

Step 1: Compute binary active‑region masks for each slot.

For each slot $s$ , a thresholded mask is obtained by comparing its attention values to the slot‑wise mean activation across all patches:

$ \mathbf{M}{t,p,s} = \mathbb{I}\big[ \mathbf{A}{t,p,s} > \mu_s \big], \qquad \mu_s = \frac{1}{T P}\sum_{t,p} \mathbf{A}_{t,p,s}. \qquad\qquad\text{(Eq. 7)} $

Thus, $\mathbf{M}_{t,p,s}=1$ if slot $s$ has above‑average attention on patch (t,p), identifying its active spatial footprint.

Step 2: Measure pairwise overlap via frame‑averaged IoU.

For any two slots s,s', the overlap is quantified by the average Intersection‑over‑Union (IoU) of their active masks across frames:

$ \mathrm{IoU}(s,s') = \frac{1}{T}\sum_{t=1}^{T} \frac{ \sum_{p} \mathbf{M}{t,p,s},\mathbf{M}{t,p,s'} }{ \sum_{p} \mathbf{M}{t,p,s} + \sum{p} \mathbf{M}{t,p,s'} - \sum{p} \mathbf{M}{t,p,s},\mathbf{M}{t,p,s'} }, \qquad\qquad\text{(Eq. 8)} $

where the denominator is the union of the two masks. This yields a value between 0 and 1.

Step 3: Build a redundancy graph and find connected components.

A graph $\mathcal{G}=(\mathcal{V},\mathcal{E})$ is constructed with nodes $\mathcal{V}$ representing the slots. An undirected edge (s,s') is added if

$ \mathrm{IoU}(s,s') > \tau_{\mathrm{merge}}, \qquad\qquad\text{(Eq. 9)} $

where $\tau_{\mathrm{merge}}$ is a threshold (e.g., 0.7). This indicates that the two slots significantly overlap and likely represent the same object.

Because redundancy can be transitive (slot A overlaps with B, and B overlaps with C, but A and C may not directly overlap very much), the method groups slots by computing the connected components of $\mathcal{G}$ . Each component $\mathcal{C}_k$ is a set of mutually redundant slots.

Step 4: Choose a dominant slot identity per component and relabel.

For each connected component $\mathcal{C}_k$ , a representative slot $s_k^*$ is chosen as the one with the largest spatial footprint (most patches assigned in the hard attention map):

$ s_k^* = \underset{s\in\mathcal{C}k}{\mathrm{argmax}} \sum{t,p} \mathbb{I}\big[\hat s_{t,p}^{A} = s\big]. \qquad\qquad\text{(Eq. 10)} $

A relabeling function $\phi(\cdot)$ consolidates all slots in the component to this representative:

$ \phi(s) = \begin{cases} s_k^* & \text{if } s\in\mathcal{C}_k,\ s & \text{otherwise}. \end{cases} \qquad\qquad\text{(Eq. 11)} $

Both pseudo‑label maps are then relabeled:

$ \hat s_{t,p}^{A} \leftarrow \phi(\hat s_{t,p}^{A}), \qquad \hat s_{t,p}^{D} \leftarrow \phi(\hat s_{t,p}^{D}). \qquad\qquad\text{(Eq. 12)} $

After merging, the local consistency counts (Eq. 4) and the boundary/interior sets $\mathcal{P}_{\mathrm{bd}}$ , $\mathcal{P}_{\mathrm{nbd}}$ are recomputed using the updated hard assignments. This ensures that the supervision is based on semantically coherent, de‑fragmented identities.

4.2.5. Training Objective: Asymmetric Cross‑Distillation

With the refined pseudo‑labels and the selected boundary/interior regions, the model uses a simple MSE loss to teach each map using the other’s reliable cues.

For boundary patches $(t,p)\in\mathcal{P}_{\mathrm{bd}}$ , the decoder object map $\mathbf{D}$ is supervised with the encoder‑derived pseudo‑label (one‑hot vector):

$ \mathcal{L}{\mathrm{bd}} = \frac{1}{|\mathcal{P}{\mathrm{bd}}|} \sum_{(t,p)\in\mathcal{P}{\mathrm{bd}}} \big| \mathbf{D}{t,p} - \mathrm{onehot}(\hat s_{t,p}^{A}) \big|_2^2, \qquad\qquad\text{(Eq. 13)} $

where $\mathrm{onehot}(\cdot)$ converts the scalar slot index into a vector with 1 at that index and 0 elsewhere. This loss forces the decoder to sharpen its boundaries using the sharp boundary cues from the encoder.

For interior patches $(t,p)\in\mathcal{P}_{\mathrm{nbd}}$ , the encoder attention map $\mathbf{A}$ is supervised with the decoder‑derived pseudo‑label:

$ \mathcal{L}{\mathrm{nbd}} = \frac{1}{|\mathcal{P}{\mathrm{nbd}}|} \sum_{(t,p)\in\mathcal{P}{\mathrm{nbd}}} \big| \mathbf{A}{t,p} - \mathrm{onehot}(\hat s_{t,p}^{D}) \big|_2^2, \qquad\qquad\text{(Eq. 14)} $

This denoises the encoder assignment inside consistent regions by distilling the decoder’s smooth, coherent labels.

The MSE loss is chosen (rather than cross‑entropy) because its scale is naturally compatible with the reconstruction loss, and its bounded gradients improve robustness to imperfect pseudo‑labels early in training.

Finally, these selective alignment losses are combined with the base objective $\mathcal{L}_{\mathrm{base}}$ (which includes the reconstruction loss and the temporal slot contrastive loss from SlotContrast). After a warm‑up phase spanning the first 30% of the total iterations $\eta_{\mathrm{t}}$ , the complete training objective is:

$ \mathcal{L} = \mathcal{L}{\mathrm{base}} + \mathbb{I}[\eta > 0.3,\eta{\mathrm{t}}] \cdot \lambda_{\mathrm{SSync}} \big( \mathcal{L}{\mathrm{bd}} + \mathcal{L}{\mathrm{nbd}} \big), \qquad\qquad\text{(Eq. 15)} $

where $\lambda_{\mathrm{SSync}}$ is a balancing coefficient (set to 1.0 by default). The warm‑up ensures that the alignment starts only after the model has acquired a reasonably stable representation.

Overall, the process is intentionally simple: no learnable projectors, no pairwise comparisons, no heavy post‑processing. The complexity is $\mathcal{O}(T\cdot P)$ (linear in the number of patches), a stark contrast to SRL’s $\mathcal{O}((T\cdot P)^2)$ .

5. Experimental Setup

5.1. Datasets

The paper evaluates on several standard VOCL benchmarks and also tests generalisation to images.

MOVi‑C (Kubric‑based synthetic videos): 1000 videos, each 24 frames, resolution $336\times 336$ . The scenes contain 3–10 objects with complex interactions. The main challenge is over‑fragmentation due to frequent occlusions and object interactions.
MOVi‑E (Kubric‑based synthetic videos): Same format but contains a large number of small‑scale objects (up to 23), making boundary precision critical.
YouTube‑VIS 2021 (real‑world video instance segmentation): 2885 training videos, 474 validation videos, resolution $518\times 518$ . The videos feature severe appearance variation, non‑rigid motion, and occlusions. Only a sparse set of foreground objects have ground‑truth masks; the rest are unannotated background or secondary objects.
RandSF.Q protocol (lower‑resolution variants of MOVi‑C, MOVi‑D, and High‑Quality YTVIS): Used to benchmark under a different training protocol (random slot‑feature pairs). Images resized to $224\times 224$ .
Image benchmarks: MOVi‑E (single‑frame) and COCO2017 (real‑world image dataset). These test spatial consistency without motion.

Sample data: A MOVi‑C video consists of a sequence like: a red cube and a yellow sphere moving on a textured floor against a sky background. The ground‑truth masks are provided per object per frame. YouTube‑VIS videos contain real‑world scenes; for example, a person walking a dog in a park, with pixel‑level masks for “person” and “dog” but not for trees or grass.

5.2. Evaluation Metrics

The paper primarily uses FG‑ARI and mBO, and in some tables also reports ARI and mIoU. For boundary analysis, it uses boundary F‑score.

5.2.1. Foreground Adjusted Rand Index (FG‑ARI)

Conceptual definition: FG‑ARI measures the clustering agreement between predicted slot assignments and ground‑truth object masks, considering only foreground pixels (i.e., pixels belonging to any object). It is the standard Adjusted Rand Index computed on the foreground after ignoring the background, making it permutation‑invariant (the slot numbering does not matter). Higher is better.
Mathematical formula: The Adjusted Rand Index (ARI) for a clustering $\mathcal{C}$ and a ground‑truth partition $\mathcal{G}$ is: $ \mathrm{ARI}(\mathcal{C},\mathcal{G}) = \frac{ \sum_{ij} \binom{n_{ij}}{2} - \big[ \sum_i \binom{a_i}{2} \sum_j \binom{b_j}{2} \big] / \binom{n}{2} } { \frac12 \big[ \sum_i \binom{a_i}{2} + \sum_j \binom{b_j}{2} \big] - \big[ \sum_i \binom{a_i}{2} \sum_j \binom{b_j}{2} \big] / \binom{n}{2} } $ where $n_{ij}$ is the number of pixels simultaneously in cluster (slot) $i$ and ground‑truth object $j$ , $a_i$ is the total number of pixels assigned to slot $i$ , $b_j$ is the total number of pixels in ground‑truth object $j$ , and $n$ is the total number of foreground pixels. FG‑ARI is computed by restricting $\mathcal{C}$ and $\mathcal{G}$ to foreground only. Values range from 0 (random) to 1 (perfect).
Symbol explanation: $i$ indexes predicted slots, $j$ indexes ground‑truth objects, $n_{ij}$ is the intersection count, $a_i = \sum_j n_{ij}$ , $b_j = \sum_i n_{ij}$ , $\binom{x}{2} = x(x-1)/2$ .

5.2.2. mean Best Overlap (mBO)

Conceptual definition: mBO evaluates how well each ground‑truth object is covered by the best‑matching predicted slot. For each GT object, we find the predicted slot with the highest Intersection‑over‑Union (IoU) and average these best IoU values over all objects. It ranges from 0 to 1; higher means better object coverage.
Mathematical formula: $ \mathrm{mBO} = \frac{1}{|\mathcal{O}|} \sum_{j\in\mathcal{O}} \max_{i} \frac{ |\mathrm{mask}_j \cap \mathrm{pred}_i| }{ |\mathrm{mask}_j \cup \mathrm{pred}_i| } $ where $\mathcal{O}$ is the set of ground‑truth objects (foreground).
Symbol explanation: $\mathrm{mask}_j$ is the binary mask for GT object $j$ , $\mathrm{pred}_i$ is the binarised mask for predicted slot $i$ (typically by argmax across slots per pixel).

5.2.3. (Full‑scene) ARI and mIoU

ARI (all pixels): Same as FG‑ARI but including background as an additional class. Used in RandSF.Q experiments.
mIoU: Mean Intersection‑over‑Union computed per class (including background). Often used in the RandSF.Q protocol.

5.2.4. Boundary F‑score

Conceptual definition: Evaluates the quality of object boundaries by computing precision and recall of predicted boundary pixels against ground‑truth boundary pixels, within a small tolerance distance. It complements region‑based metrics like mBO.
Mathematical formula: For a tolerance $d$ , F‑score = $\frac{2 \cdot \mathrm{Precision} \cdot \mathrm{Recall}}{\mathrm{Precision}+\mathrm{Recall}}$ where Precision = correctly predicted boundary pixels / total predicted boundary pixels, Recall = correctly predicted boundary pixels / total ground‑truth boundary pixels. Boundary pixels are obtained by morphological contour extraction (e.g., Sobel edge detection). The paper uses the standard evaluation from the DAVIS benchmark.

5.3. Baselines

The paper compares against the following, which represent the evolutionary line of VOCL:

SAVi (ICLR 2022): early video slot attention with motion cues.
STEVE (NeurIPS 2022): transformer‑based video slot model.
VideoSAUR / VideoSAURv2 (NeurIPS 2023): real‑world video slot learning via feature similarity.
SlotContrast (CVPR 2025): temporal contrastive slot learning.
SRL (ICLR 2026): dense encoder‑decoder alignment with contrastive loss.
SlotCurri (CVPR 2026): curriculum‑based slot learning.
RandSF.Q variants (baseline using random slot‑feature queries) are also used to demonstrate plug‑and‑play capability.

All baselines are trained following their original settings, and SSync is integrated on top of the SlotContrast/SRL training pipeline for a fair comparison.

6. Results & Analysis

6.1. Core Results Analysis

6.1.1. Video Benchmarks (Table 2)

The following are the results from Table 2 of the original paper:

Method	Venue	MOVi‑C		MOVi‑E		YouTube‑VIS
		FG‑ARI↑	mBO↑	FG‑ARI↑	mBO↑	FG‑ARI↑	mBO↑
SAVi	ICLR'22	22.2	13.6	42.8	16.0	-	-
STEVE	NeurIPS'22	36.1	26.5	50.6	26.6	15.0	19.1
VideoSAUR	NeurIPS'23	64.8	38.9	73.9	35.6	28.9	26.3
VideoSAURv2	NeurIPS'23	-	-	77.1	34.4	31.2	29.7
SlotContrast	CVPR'25	69.3	32.7	82.9	29.2	38.0	33.7
SRL	ICLR'26	74.3	34.5	81.9	29.3	42.9	35.6
SlotCurri	CVPR'26	77.6±0.9	32.8±0.2	83.7±0.2	28.9±0.7	44.8±1.2	35.5±2.2
SSync (Ours)		79.4±0.6	39.5±0.1	84.0±0.9	34.8±1.9	42.6±0.2	38.7±0.6

Analysis:

MOVi‑C: SSync achieves the highest FG‑ARI (79.4) and by far the highest mBO (39.5). The large gap in mBO indicates that SSync’s boundary refinement via $\mathcal{L}_{\mathrm{bd}}$ and interior denoising via $\mathcal{L}_{\mathrm{nbd}}$ together produce much tighter object masks. The transitive merging further reduces fragmentation, helping both metrics.
MOVi‑E: SSync again leads in both FG‑ARI (84.0) and mBO (34.8). The improvement in mBO is particularly notable because MOVi‑E contains many tiny objects where blurry boundaries severely hurt coverage. SSync’s boundary‑refinement stage is clearly beneficial.
YouTube‑VIS: SSync obtains competitive FG‑ARI (42.6, second to SlotCurri’s 44.8) but achieves the best mBO (38.7). This indicates that SSync’s representations, while occasionally slightly less accurate in foreground clustering identity (maybe due to the dataset’s sparse annotations), provide markedly better object coverage. The authors note that FG‑ARI underrepresents the model’s handling of unannotated objects; qualitative examples show SSync still gives coherent background and secondary object masks.

6.1.2. RandSF.Q Protocol (Table 3)

The following are the results from Table 3 of the original paper (full transcription with averaged values):

Method	MOVi‑C				MOVi‑D				HQ‑YTVIS
	ARI	FGARI	mBO	mIoU	ARI	FGARI	mBO	mIoU	ARI	FGARI	mBO	mIoU
RandSF.Qtsim	70.7	63.3	31.1	28.1	39.3	70.5	25.6	24.3	39.2	56.3	37.2	37.0
RandSF.Qtsim+SSync	73.0	67.3	32.3	29.9	45.5	71.2	27.6	25.7	42.9	58.6	39.4	39.2
RandSF.Qssc	52.7	67.8	24.2	22.1	37.3	85.8	28.0	26.8	40.7	57.2	38.3	37.8
RandSF.Qssc+SSync	55.5	71.4	25.1	23.1	39.4	86.5	27.9	26.8	48.6	57.5	42.1	41.9

Adding SSync as a plug‑in consistently improves all metrics across both base methods (RandSF.Qtsim and RandSF.Qssc) and all datasets. This demonstrates that SSync is not tied to a specific base objective; its selective alignment logic generalises.

6.1.3. Image Benchmarks (Table 4)

Image‑only results are shown in Table 4:

(a) Results on MOVi‑E (single‑frame)
Method	FG‑ARI↑
VideoSAUR	78.4
SOLV	80.8
SlotContrast	84.8
SlotCurri	84.9
SSync	86.0

(b) Results on COCO2017
Method	FG‑ARI	mBO
Baseline	40.5	28.8
SRL	42.8	29.4
SlotCurri	43.4	28.9
SSync	47.9	33.1

SSync significantly outperforms all baselines on both tasks, including on real‑world COCO images where object appearances vary wildly. This confirms that the encoder‑decoder synergy is not only about motion; the spatial selective distillation alone is powerful.

6.1.4. Memory Efficiency (Table 5)

The following are the results from Table 5 of the original paper:

Frames (T)	Batch Size per GPU
	32	64
T = 4	70 / 27	OOM / 59
T = 6	OOM / 48	OOM / 89
T = 8	OOM / 60	OOM / 93

Each entry shows “SRL memory / SSync memory” in GB on an NVIDIA RTX PRO 6000 Blackwell GPU (max 97 GB). OOM = Out‑Of‑Memory. SSync uses only ~27 GB at T=4, batch size 32, whereas SRL already requires 70 GB. At larger T or batch sizes, SRL fails, while SSync remains viable. This directly stems from the linear $\mathcal{O}(T\cdot H\cdot W)$ complexity of SSync versus SRL’s $\mathcal{O}((T\cdot H\cdot W)^2)$ .

6.2. Ablation Studies / Parameter Analysis

6.2.1. Component Ablation (Table 6)

The following are the results from Table 6 of the original paper:

Selected Components			MOVi‑C
Lbd	Lnbd	T.M.	FG‑ARI	mBO
			69.0	30.6
✓			72.9	33.4
	✓		71.4	33.3
✓	✓		77.1	38.0
✓	✓	✓	79.4	39.5

Baseline (no SSync): 69.0 FG‑ARI, 30.6 mBO.
Adding only $\mathcal{L}_{\mathrm{bd}}$ (boundary supervision) boosts to 72.9/33.4, confirming that decoder boundaries benefit from sharp encoder cues.
Only $\mathcal{L}_{\mathrm{nbd}}$ (interior supervision) yields 71.4/33.3, showing that denoising the encoder assignment helps.
Combining both losses gives a large jump to 77.1/38.0, indicating they are complementary.
Adding transitive merging further improves to 79.4/39.5, proving that resolving slot redundancy stabilises the pseudo‑labels and prevents fragmentation from degrading the alignment signals.

6.2.2. Robustness to Number of Slots (Table 7)

The following are the results from Table 7 of the original paper:

Method	slot=7		slot=11		slot=15
	FG‑ARI↑	mBO↑	FG‑ARI↑	mBO↑	FG‑ARI↑	mBO↑
SlotContrast	74.9	27.9	69.3	32.7	61.8	31.2
SRL	76.5	31.6	74.3	34.5	72.8	31.1
SSync (Ours)	76.9	39.8	79.4	39.5	78.8	41.0

With 15 slots (well beyond the average number of objects), SlotContrast drops to 61.8 FG‑ARI and SRL degrades moderately, while SSync remains highly stable at 78.8 FG‑ARI and even improves mBO to 41.0. The transitive merging successfully consolidates superfluous, overlapping slots, making SSync much less sensitive to the chosen slot count -‑ a practical advantage.

6.2.3. Transitive Merging Strategies (Table 8)

The following are the results from Table 8 of the original paper:

Strategy		FG‑ARI↑	mBO↑
(Ours default)		79.4	39.5
Varying criteria for merging
(a0)	D	77.5	38.7
(a1)	A ∨ D	78.6	39.8
(a2)	A ∧ D	78.9	39.1
(a3)	Avg(A, D)	79.2	40.0
Merging Range
(b)	Pairwise	78.2	39.3
Comparison to Slot Reg.
(c)	Slot Reg.	74.6	35.8

The main finding: using either map (or combinations) for building the overlap graph works similarly, indicating the redundancy signal is intrinsic. Transitive merging (connected components) outperforms pairwise merging, showing that global consolidation is needed. Standard slot regularisation (as in SRL) is significantly worse than the proposed merging strategy.

6.2.4. Hyperparameter Sensitivity (Table 9)

The following are the results from Table 9 of the original paper:

(a) Impact of λSSync			(b) Impact of nbd			(c) Impact of nnbd			(d) Impact of τmerge
λSSync	FG‑ARI	mBO	nbd	FG‑ARI	mBO	nnbd	FG‑ARI	mBO	τmerge	FG‑ARI	mBO
0.1	74.1	37.8	1	79.4	39.5	1	79.4	39.5	0.65	78.3	40.1
0.5	79.1	39.0	2	78.9	39.3	2	78.9	41.2	0.7	79.4	39.5
1.0	79.4	39.5	3	78.2	39.4	3	78.8	40.1	0.75	79.1	39.5
1.2	78.9	40.2	4	76.6	36.7	4	76.7	36.6	0.8	78.2	39.0

SSync is relatively robust across a range of these hyperparameters. Performance plateaus around $\lambda_{\mathrm{SSync}}=1.0$ , $n_{\mathrm{bd}}=n_{\mathrm{nbd}}=1$ , and $\tau_{\mathrm{merge}}$ near 0.7. Even when $\tau_{\mathrm{merge}}$ is slightly changed, results remain strong, confirming that transitive merging does not require fine‑tuning.

6.3. Deeper Analysis: Denoising, Deblurring, and Synergy

6.3.1. Fragmentation and Boundary Leakage (Table 10 and Table 11)

The following are the results from Table 10 of the original paper (fragmentation and leakage diagnostics):

Method	Encoder FCC8↓	Decoder Match90↑	Leak (×10³)↓
SlotContrast	33.20	639	98.95
SRL	21.03	684	86.26
SSync	8.79	702	72.02

SSync dramatically reduces encoder fragmentation (FCC₈) from ~33 to 8.79, close to the ground‑truth value of 6.27. At the same time, the decoder’s outside leakage (pixels spilling beyond the true object) drops from 98.95 to 72.02, and more slots achieve ≥90% GT coverage (702 vs. 684). This quantitatively validates the dual effect: encoder denoising and decoder deblurring.

Additionally, boundary F‑score (Table 11) improves from 0.184 (SlotContrast) and 0.222 (SRL) to 0.255, directly proving sharper object contours.

The paper also tracks the evolution during training: the decoder’s outside leakage falls from 93.96×10³ at 30% iterations to 71.03×10³ at convergence, while the encoder’s FCC₈ drops from 40.75 to 8.79. Concurrently, the overlap between the boundary set $\mathcal{P}_{\mathrm{bd}}$ and the non‑boundary set $\mathcal{P}_{\mathrm{nbd}}$ (initially 0.277 IoU) rapidly decreases to 0.059, indicating that the model learns to properly separate boundary and interior supervisions.

6.4. Qualitative Observations

The paper shows qualitative comparisons (Figures 3, 6–7). SSync consistently exhibits: (1) sharper object boundaries, (2) more coherent object‑interior assignments, and (3) temporally stable slot identities for background and unannotated objects. In contrast, SRL and SlotContrast often exhibit fragmented background slots or blurry contours. The boundary and interior selection visualisations (Figure 5) show that the detected boundary patches lie along true object transitions, even adapting to motion‑induced temporal boundaries.

7. Conclusion & Reflections

7.1. Conclusion Summary

The paper presents SSync, a selective synergy framework for video object‑centric learning. Its core innovation is the asymmetric selective cross‑distillation that exploits the encoder’s boundary sharpness and the decoder’s interior coherence to mutually refine each map, strictly limiting supervision to reliable regions. A transitive pseudo‑label merging mechanism further resolves slot redundancy dynamically. The method achieves state‑of‑the‑art decomposition quality, reduces memory complexity from quadratic to linear, requires no additional parameters, and is robust to hyperparameter choices (especially the number of slots). It works as a plug‑and‑play loss on top of existing slot‑based models.

7.2. Limitations & Future Work (as discussed in the paper)

Early‑frame under‑fragmentation: When two objects start very close or have similar appearance, they may be initially merged. SSync relies partly on motion cues; if objects separate only later, the model may take a few frames to split them. Future work could incorporate bidirectional temporal modelling (e.g., using future frames).
Part‑level over‑fragmentation of large objects: For large objects with strong intra‑object appearance variation (e.g., a truck with cabin and container), SSync can still split them into separate slots because spatio‑temporal overlap alone may not be enough. This suggests a need for stronger part‑to‑whole grouping priors, potentially through higher‑level semantic reasoning.
Threshold sensitivity for real‑world videos: Although the merging threshold $\tau_{\mathrm{merge}}$ is shown robust around 2/3, in extremely complex real‑world scenes the optimal value might shift. However, the authors note that even without dataset‑specific tuning, performance remains competitive.

7.3. Personal Insights & Critique

Innovation and Elegance: The shift from dense to selective alignment is conceptually simple yet highly effective. By clearly characterising the spatial reliability of the two maps, the authors avoid the common trap of “align everything” and instead design a targeted, interpretable mechanism. The transitive merging elegantly solves the often‑ignored problem of pseudo‑label instability in self‑supervised object grouping.
Plug‑and‑Play Nature: Because SSync only adds a loss term and requires no architectural changes, it can be retrofitted into many existing video (and even image) slot‑attention frameworks. This greatly increases its practical impact.
Scalability: The $\mathcal{O}(T\cdot P)$ complexity is a major win. Many recent self‑supervised video methods struggle with high‑resolution or long sequences; SSync’s linear scaling makes it feasible where SRL fails.
Potential Weaknesses: The method fundamentally relies on the assumption that encoder boundaries are sharp and decoder interiors are smooth. In situations where the encoder itself cannot capture sharp boundaries (e.g., with very low‑resolution features or heavy occlusion), the boundary supervision may be less effective. Similarly, if the decoder over‑smooths interiors beyond recognition, the interior pseudo‑labels could be wrong. However, the experiments show robustness.
Transferability: The principle of selective synergy could be applied to other multi‑branch self‑supervised systems where asymmetric reliability exists, e.g., multi‑modality alignment (RGB vs. depth), temporal prediction vs. reconstruction, etc. This opens a broader research direction: identifying and exploiting module‑specific reliability gradients.
Critique on Metrics: FG‑ARI and mBO, while standard, have known biases. For instance, mBO rewards larger predicted masks (since max IoU ignores false positives) and FG‑ARI heavily penalises over‑fragmentation. SSync’s high mBO but sometimes slightly lower FG‑ARI on YouTube‑VIS might indicate a tendency to produce slightly larger masks that better cover objects but occasionally mix adjacent ones. The qualitative results partially mitigate this concern, but future work could consider metrics that equally weigh precision and recall, e.g., Panoptic Quality.
Conclusion on the Paper: Overall, SSync is a well‑motivated, thoroughly validated, and practical contribution to the field of object‑centric learning. Its emphasis on error‑resistant, low‑complexity alignment sets a new baseline for future research in video (and image) decomposition.

Selective Synergistic Learning for Video Object-Centric Learning