1. Bibliographic Information

1.1. Title

The central topic of the paper is "ViT-Up: Faithful Feature Upsampling for Vision Transformers". This title highlights the method's name (ViT-Up) and its primary function: performing faithful upsampling of features generated by Vision Transformers (ViT).

1.2. Authors

The authors are Krispin Wandel, Jingchuan Wang, and Hesheng Wang.

Affiliation: All authors are associated with Shanghai Jiao Tong University, Shanghai, China.
Research Background: Based on the affiliations and the topic (computer vision, deep learning), the authors are likely researchers in the fields of Computer Vision and Pattern Recognition, focusing on visual representation learning and Vision Transformers.

1.3. Journal/Conference

The paper is currently an ARXIV PREPRINT.

Status: It was published on arXiv on June 12, 2026. As a preprint, it has not yet undergone the full peer-review process of a top-tier conference like CVPR, ICCV, or NeurIPS, though it is structured as a full research paper.
Influence: arXiv is the primary repository for pre-publication research in computer science and physics, making it the standard venue for disseminating cutting-edge work before formal conference acceptance.

1.4. Publication Year

The paper was published in 2026 (specifically June 12, 2026).

1.5. Abstract

The paper introduces ViT-Up, an implicit feature upsampling framework designed for Vision Transformers.

Objective: To address the bottleneck where ViTs operate on small patch-token grids due to computational costs, limiting their use in dense prediction tasks like segmentation and depth estimation.
Methodology: ViT-Up replaces external image guidance (used in prior works) with layer-wise query construction derived from intermediate ViT hidden states. This allows for feature prediction at arbitrary continuous image coordinates while maintaining alignment with the backbone feature space.
Results: ViT-Up consistently outperforms state-of-the-art image-guided upsamplers (like NAF and UPLiFT) on dense prediction and semantic correspondence tasks. For instance, on the DINOv3-S+ backbone, it improves mIoU on Cityscapes by up to +2.07 and PCK@0.10 on SPair-71k by +4.17. Gains are even larger with the DINOv3-B backbone, demonstrating favorable scaling.

1.6. Original Source Link

ArXiv Link: https://arxiv.org/abs/2606.14024
PDF Link: https://arxiv.org/pdf/2606.14024v1
Publication Status: Preprint.

2. Executive Summary

2.1. Background & Motivation

Core Problem: Vision Transformers (ViT) have become the dominant architecture for visual representation learning. However, due to the quadratic computational cost of global self-attention, they are typically operated on relatively low-resolution patch-token grids (e.g., $14 \times 14$ or $28 \times 28$ ). This low resolution creates a bottleneck for dense prediction tasks (like semantic segmentation and monocular depth estimation) which require pixel-level precision.
Importance & Challenges: Simply increasing the input resolution is computationally expensive and can degrade feature quality by moving the backbone out-of-distribution. While task-specific decoders exist, they undermine the efficiency of foundation models. Recent "task-agnostic" upsamplers (e.g., JAFAR, NAF) rely on shallow image encoders for guidance. The authors identify a critical flaw in this approach: the shallow encoders lack the semantic capacity to distinguish visually similar but semantically distinct regions, leading to feature leakage, fragmentation, and blur.
Innovative Idea: The paper proposes shifting away from external image guidance. Instead, it leverages the intrinsic hierarchical structure of the ViT itself. Shallow layers in a ViT retain high-resolution spatial details, while deeper layers hold global semantics. By utilizing these intermediate hidden states directly, the method can perform upsampling that is natively aligned with the model's learned representation.

2.2. Main Contributions / Findings

ViT-Up Framework: A task-agnostic, implicit feature upsampling method that treats upsampling as coordinate-conditioned feature prediction. It constructs dense query features layer-by-layer from intermediate ViT representations, preserving the backbone's feature-space structure and mitigating artifacts like leakage.
Multi-scale Supervision Strategy: A training strategy using a student-teacher distillation setup. A frozen teacher ViT provides multi-scale feature targets, while a student (with a LoRA-adapted backbone) learns to recover fine details from downscaled images using feature reconstruction and relational losses.
Performance & Scaling: ViT-Up consistently outperforms state-of-the-art baselines (NAF, UpLiFT, AnyUp, JAFAR) on semantic segmentation, depth estimation, and semantic correspondence. Crucially, unlike image-guided methods bottlenecked by fixed shallow encoders, ViT-Up scales effectively with larger backbone capacities (e.g., DINOv3-B), showing significant performance gains as the backbone grows.

3.1. Foundational Concepts

To understand this paper, one must grasp several core concepts in deep learning and computer vision:

Vision Transformer (ViT): Unlike Convolutional Neural Networks (CNN) that process images using sliding windows, a ViT splits an image into fixed-size non-overlapping patches (e.g., $16 \times 16$ pixels). Each patch is flattened into a vector and linearly embedded to create a "patch token". These tokens are processed by a standard Transformer encoder, which relies on Self-Attention.
Self-Attention: The mechanism that allows the model to weigh the importance of different parts of the input sequence. The standard formula for scaled dot-product attention is: $ \mathrm{Attention}(Q, K, V) = \mathrm{softmax}\left(\frac{QK^T}{\sqrt{d_k}}\right)V $ Here, $Q$ (Query), $K$ (Key), and $V$ (Value) are matrices derived from the input tokens, and $d_k$ is the dimension of the keys. This operation has a quadratic complexity $O(N^2)$ with respect to the number of tokens $N$ , which is why ViTs are often limited to low resolutions (small $N$ ).
Dense Prediction: Tasks that require a prediction for every pixel in an image, such as Semantic Segmentation (classifying each pixel) and Monocular Depth Estimation (predicting distance for each pixel from a single image).
Feature Upsampling: The process of increasing the spatial resolution of a feature map. Simple methods like bilinear interpolation lack semantic fidelity. Advanced methods try to predict high-frequency details.
Implicit Neural Representations: A paradigm where a neural network (usually an MLP) represents a signal as a continuous function of coordinates (e.g., mapping (x, y) coordinates to RGB color or feature vectors). This allows querying the signal at arbitrary resolutions.
LoRA (Low-Rank Adaptation): A parameter-efficient fine-tuning technique. Instead of updating all weights in a large matrix $W$ , it adds trainable low-rank decomposition matrices $A$ and $B$ such that $W_{new} = W + BA$ . This allows adapting large models with significantly fewer trainable parameters.
FiLM (Feature-wise Linear Modulation): A technique to condition a neural network on auxiliary information. It applies an affine transformation to features: $\gamma \odot x + \beta$ , where $\gamma$ (scale) and $\beta$ (shift) are generated from the conditioning input.
RoPE (Rotary Position Embedding): A positional encoding scheme that encodes relative positions by rotating the query and key vectors in complex space. It helps the attention mechanism distinguish between tokens based on their relative distance.

3.2. Previous Works

The paper categorizes prior work into four main areas:

Task-Dependent Feature Upsamplers: Methods like DPT (Dense Prediction Transformer) and ViT-Det reassemble intermediate ViT states into multi-scale features for specific tasks. CARAFE and DySample use dynamic kernels or sampling offsets. These are task-specific, unlike the task-agnostic goal of this paper.
Image Super-Resolution & Local Implicit Functions: LIIF (Local Implicit Image Function) treats super-resolution as a continuous coordinate decoding problem. It assumes local latent codes contain enough info for reconstruction. The authors argue this assumption might break for high-dimensional semantic features where information is bottlenecked in a patch token.
Implicit Feature Upsampling: FeatUp fits an implicit field per image (slow). LoftUp is a learnable alternative using cross-attention where high-res queries attend to all low-res tokens. The authors note LoftUp's shallow projection of RGB to semantic space may cause incorrect attention (leakage).
Guided Feature Upsampling: Methods like JAFAR, AnyUp, NAF, and UpLiFT use a separate, high-resolution image encoder to guide the upsampling of low-res features. While producing sharp edges, the authors critique their reliance on shallow encoders which lack the semantic depth to prevent feature leakage between similar-looking objects.

3.3. Technological Evolution

The field has evolved from simple interpolation (bilinear) to dynamic, learned upsampling (CARAFE) and then to foundation-model era approaches. Initially, upsampling was part of the task decoder (e.g., U-Net). With the rise of ViT backbones, the focus shifted to "task-agnostic" upsampling to reuse expensive backbone features. Early attempts used external guidance (images), but the trend is moving towards leveraging the internal richness of the foundation models themselves (e.g., LoftUp, and now ViT-Up).

3.4. Differentiation Analysis

The core difference between ViT-Up and prior state-of-the-art methods (like NAF, UpLiFT) is the source of guidance.

Prior Work: Uses a separate, shallow image encoder (e.g., a small CNN) to guide the upsampling. This is "backbone-agnostic" but semantically weak, leading to leakage.
ViT-Up: Is specialized for ViTs. It uses the intermediate hidden states of the ViT backbone itself as the source of guidance. By constructing queries layer-by-layer through the ViT hierarchy, it maintains alignment with the backbone's semantic space and exploits the rich multi-scale information already computed by the backbone, avoiding the semantic weakness of shallow external encoders.

4. Methodology

4.1. Principles

The core principle of ViT-Up is to treat feature upsampling as a coordinate-conditioned implicit feature decoding problem. Instead of processing a grid, the model takes a continuous coordinate $x_q$ as input and predicts the feature vector at that exact location. It does this by traversing the ViT's layer hierarchy: it initializes a query embedding and progressively refines it using information from specific intermediate layers of the frozen ViT backbone. This approach ensures the output features are "faithful" to the backbone's representation space.

4.2. Core Methodology In-depth (Layer by Layer)

Problem Formulation

Given a Vision Transformer with $L$ layers and patch size $p$ , processing an image of size (H, W) yields low-resolution patch tokens $H_l \in \mathbb{R}^{h \times w \times C}$ , where $h=H/p$ and $w=W/p$ . The goal is to obtain substantially denser last hidden states $H_L^{\text{up}} \in \mathbb{R}^{h^* \times w^* \times C}$ where $h^*, w^* \gg h, w$ , while preserving the feature space structure.

Architecture Overview

The architecture consists of three main stages: Query Embedding, ViT-Up Blocks (iterative refinement), and a final Decoder.

Figure 1: Overview of the ViT-Up framework. It illustrates the process of taking an input image and a query coordinate to predict a high-resolution feature, which can be applied to tasks like segmentation and depth estimation.

Step 1: Query Embedding

The first step is to create an initial query embedding $q_0$ for a given continuous coordinate $x_q$ . ViT-Up reuses the patch embedding layer of the ViT backbone (typically a convolution with kernel size $p$ and stride $p$ ).

Process: The input image is resized so that the resulting patch-token grid has a resolution of $224 \times 224$ . This high-resolution grid is computed once and cached.
Interpolation: To get the feature at an arbitrary continuous coordinate $x_q$ , the model performs bilinear interpolation on this cached $224 \times 224$ grid at the location $x_q$ .
Result: This yields the initial query embedding $q_0 \in \mathbb{R}^C$ . This keeps the query aligned with the backbone's patch-embedding space.

The initial query $q_0$ is passed through a sequence of $T$ ViT-Up blocks (denoted as $U_t$ ). In the main configuration, $T=6$ , and these blocks consume hidden states from specific backbone layers (e.g., layers 2, 4, 6, 8, 10, 12).

Each block $U_t$ takes the previous query $q_{t-1}$ , the coordinate $x_q$ , and the low-resolution hidden state $H_{l[t]}$ from a specific backbone layer as input. The update rule is: $ q_t = U_t(q_{t-1}, x_q, H_{l[t]}), \quad t=1, \ldots, T $

Inside each block, four operations occur sequentially:

2.1 Transition MLP

Since the block might skip layers in the backbone, the query feature space must be aligned to the current backbone layer. A transition MLP with a residual connection is applied: $ x = q_{t-1} + \mathrm{MLP}{\mathrm{transition}}(\mathrm{LN}(q{t-1})) $ Here, $\mathrm{LN}$ denotes Layer Normalization. This adapts the query to the semantic level of the current backbone layer.

2.2 Cross-Window Attention

The query aggregates context from the spatial backbone tokens $H_{l[t]}$ using a cross-window multi-head attention mechanism. First, the query and backbone tokens are normalized: $ \hat{x} = \mathrm{LN}Q(x), \quad \hat{H}{l[t]} = \mathrm{LN}{KV}(H{l[t]}) $ Then, the attention output is computed. Crucially, the queries and keys are modulated with continuous 2D RoPE (Rotary Position Embedding). The query rotation $R_q$ is evaluated at the continuous coordinate $x_q$ , while key rotations $R_{\mathbf{X}}$ are evaluated at the discrete patch-token centers. $ z^{\mathrm{attn}} = \mathrm{CW}\text{-}\mathrm{MHA}\left(R_q W_Q \hat{x}, R_{\mathbf{X}} W_K \hat{H}{l[t]}, W_V \hat{H}{l[t]}\right) $

$W_Q, W_K, W_V$ are projection weights.
$\mathrm{CW}\text{-}\mathrm{MHA}$ stands for Cross-Window Multi-Head Attention, which restricts attention to a local window around the query for efficiency.
The output is then projected: $ x^{\mathrm{attn}} = W_O z^{\mathrm{attn}} $

2.3 FeatX (Sub-Token Detail Extraction)

While cross-attention gathers token-level context, it can blur high-frequency details. To recover sub-token details (information within a patch), ViT-Up uses FeatX. Let $\mathbf{X}$ be the grid of patch-token center coordinates. The algorithm finds the index $k_{\mathrm{nn}}$ of the patch whose center is closest to the query coordinate $x_q$ . The nearest neighbor's coordinate and feature are: $ x_{\mathrm{nn}} = \mathbf{X}[k_{\mathrm{nn}}], \quad h_{\mathrm{nn}} = H_{l[t]}[k_{\mathrm{nn}}] $ It computes the relative offset between the query and the patch center, normalized by patch size $p$ : $ \Delta x = (x_q - x_{\mathrm{nn}}) / p $ This offset is embedded using a sinusoidal positional encoding $E_{\text{pos}}$ : $ p_{\Delta x} = E_{\text{pos}}(\Delta x) \in \mathbb{R}^{64} $ An MLP predicts FiLM parameters (scale $\gamma$ and shift $\beta$ ) from this offset encoding: $ (\gamma, \beta) = \mathrm{MLP}{\text{FiLM}}(p{\Delta x}) $ These parameters modulate the normalized nearest-neighbor feature: $ \bar{h}{\mathrm{nn}} = (1 + \gamma) \odot \mathrm{LN}(h{\mathrm{nn}}) + \beta $ Finally, another MLP extracts the sub-token feature: $ x^{\text{sub-token}} = \mathrm{MLP}{\text{sub-token}}(\bar{h}{\mathrm{nn}}) $

Figure 2: Detailed architecture of ViT-Up. (B) shows the query embedding initialization. (C) details the ViT-Up block combining transition, cross-attention, and FeatX. (D) illustrates the FeatX mechanism for extracting sub-token details using relative offsets and FiLM modulation.

2.4 Fusion MLP

The transition output, attention output, and sub-token output are summed: $ x^{\text{fused}} = x + x^{\mathrm{attn}} + x^{\text{sub-token}} $ A final fusion MLP produces the query representation for the next stage: $ q_t = x^{\text{fused}} + \mathrm{MLP}_{\text{fusion}}(\mathrm{LN}(x^{\text{fused}})) $

Step 3: Decoder

After passing through all $T$ blocks, the final query representation $q_T$ is mapped back to the ViT feature space using a decoder $D_T$ (a single-layer MLP with LayerNorm and linear projection) to produce the output feature $o_T^q$ .

Backbone Adaptation

To provide extra capacity without full fine-tuning, the low-resolution backbone is adapted using LoRA. For a linear projection $W$ , the adapted projection is: $ W_{\text{adapted}} = W + \frac{\alpha}{r} BA $ Where $A$ and $B$ are low-rank matrices (rank $r$ ), and $\alpha$ is the scale. This is applied to patch embeddings and attention projections.

4.3. Training Strategy

Multi-scale Feature Supervision

Training is challenging because obtaining high-resolution ground truth features is expensive. ViT-Up uses a student-teacher distillation strategy.

Teacher: A frozen ViT processes the training image at multiple resolutions (e.g., $224^2, 448^2, 896^2$ ), producing feature targets at token grids $14 \times 14$ , $28 \times 28$ , and $56 \times 56$ .
Student: Receives a downscaled version of the image (scale factor $s \sim \mathcal{U}(0.1, 1.0)$ ) pasted into a fixed $448 \times 448$ canvas. It samples a $56 \times 56$ grid of query coordinates over the visible image region.
Prediction: The student predicts features at these coordinates, producing a dense map $\hat{H}^{56}_t$ .
Multi-scale Comparison: The student's predictions are supervised at the finest teacher resolution ( $56 \times 56$ ) and also after average-pooling to coarser resolutions ( $28 \times 28, 14 \times 14$ ). This forces the student to be consistent across scales.

Losses

The model is optimized using three complementary losses.

Target-Normalized L2 Loss: This loss normalizes the target feature vector $\mathbf{f}$ to have zero mean and unit variance before computing the L2 distance. Let $\mu(\mathbf{f})$ and $\sigma(\mathbf{f})$ be the mean and std dev of $\mathbf{f}$ . $ \mathcal{L}_{\text{L2}} = \left| \frac{\hat{\mathbf{f}} - \mu(\mathbf{f})}{\sigma(\mathbf{f})} - \frac{\mathbf{f} - \mu(\mathbf{f})}{\sigma(\mathbf{f})} \right|_2^2 $
Cosine Similarity Loss: Encourages angular alignment. $ \mathcal{L}_{\text{cos}} = 1 - \frac{\hat{\mathbf{f}}^\top \mathbf{f}}{|\hat{\mathbf{f}}|_2 |\mathbf{f}|_2 + \epsilon} $
Relational KL Loss: Preserves the pairwise structure of the feature space. It computes the pairwise similarity matrices $S$ (teacher) and $\hat{S}$ (student) using L2-normalized features and a temperature $\tau$ , then minimizes the KL divergence between their softmax distributions. $ \mathcal{L}{\mathrm{rel}} = \mathrm{KL}\left(\mathrm{softmax}(S) | \mathrm{softmax}(\hat{S})\right) $ The total loss is a weighted sum over all selected layers and resolutions: $ \mathcal{L} = \sum{\ell \in \mathcal{L}} \sum_{n \in \mathcal{N}} \left( \lambda_{\mathrm{L2}}\mathcal{L}{\mathrm{L2}}^{\ell, n} + \lambda{\mathrm{cos}}\mathcal{L}{\mathrm{cos}}^{\ell, n} + \lambda{\mathrm{rel}}\mathcal{L}_{\mathrm{rel}}^{\ell, n} \right) $

Figure 3: The training strategy. A frozen teacher provides multi-scale targets. The student (with LoRA-adapted backbone) learns to predict features at continuous coordinates from a downscaled input, supervised by reconstruction and relational losses.

5. Experimental Setup

5.1. Datasets

The experiments utilize a mix of training and evaluation datasets:

Training: ImageNet-1K is used to train the upsampler, consistent with prior work.
Dense Linear Probing Evaluation:
- VOC (PASCAL Visual Object Classes): A standard dataset for object detection and segmentation.
- COCO (Common Objects in Context): A large-scale dataset for object detection, segmentation, and captions.
- ADE20K: A scene parsing dataset with extensive class variety.
- Cityscapes: Focuses on semantic urban scene understanding (street scenes), containing many thin structures (poles, traffic lights) which are ideal for testing upsampling fidelity.
Correspondence Evaluation:
- SPair-71k: A large-scale benchmark for semantic correspondence, requiring matching parts of objects (e.g., matching a dog's ear in one image to the same ear in another).
- NAVI: A dataset for category-agnostic image collections with 3D shape and pose annotations, used for geometric correspondence.
  
  These datasets are chosen because they rigorously test different aspects of feature quality: semantic segmentation (class consistency), depth estimation (geometric consistency), and correspondence (fine-grained structural fidelity).

5.2. Evaluation Metrics

The paper uses several standard metrics for computer vision tasks.

mIoU (mean Intersection over Union):
- Concept: Measures the overlap between the predicted segmentation and the ground truth. It calculates the Intersection over Union for each class and averages them.
- Formula: $ \text{mIoU} = \frac{1}{n_{classes}} \sum_{i=1}^{n_{classes}} \frac{TP_i}{TP_i + FP_i + FN_i} $
- Symbols: $TP_i$ (True Positives for class $i$ ), $FP_i$ (False Positives), $FN_i$ (False Negatives).
Pixel Accuracy:
- Concept: The percentage of pixels correctly classified.
- Formula: $ \text{Acc} = \frac{\sum_{i} n_{ii}}{\sum_{i} t_i} $
- Symbols: $n_{ii}$ is the number of correct pixels for class $i$ , $t_i$ is the total number of pixels for class $i$ .
PCK (Percentage of Correct Keypoints):
- Concept: Used in correspondence tasks. A predicted keypoint is correct if it falls within a threshold distance of the ground truth keypoint.
- Formula: $ \text{PCK} = \frac{1}{N} \sum_{i=1}^N \mathbb{I}(| \mathbf{p}_i^{\text{pred}} - \mathbf{p}_i^{\text{gt}} | < \text{threshold}) $
- Symbols: $\mathbf{p}$ are points, $\mathbb{I}$ is the indicator function (1 if true, 0 otherwise).
Depth Accuracy ( $\delta_1$ ) and RMSE:
- Concept: $\delta_1$ measures the percentage of pixels where the prediction is within a threshold (usually 1.25) of the ground truth. RMSE is the Root Mean Squared Error.
- Formula ( $\delta_1$ ): $ \delta_1 = \frac{1}{N} \sum_{i} \mathbb{I}\left(\max\left(\frac{d_i^{\text{pred}}}{d_i^{\text{gt}}}, \frac{d_i^{\text{gt}}}{d_i^{\text{pred}}}\right) < 1.25\right) $
- Formula (RMSE): $ \text{RMSE} = \sqrt{\frac{1}{N} \sum_{i=1}^N (d_i^{\text{pred}} - d_i^{\text{gt}})^2} $

5.3. Baselines

ViT-Up is compared against:

Bilinear Interpolation: A standard non-parametric baseline.
JAFAR: An image-guided upsampler using cross-attention.
AnyUp: A feature-agnostic upsampler designed to work across different backbones.
UpLiFT: A recursive upsampler using local attenders.
NAF: A zero-shot upsampler using neighborhood attention filtering. These baselines represent the state-of-the-art in feature upsampling, particularly those relying on image guidance or recursive refinement.

6. Results & Analysis

6.1. Core Results Analysis

The paper demonstrates that ViT-Up achieves state-of-the-art performance across multiple benchmarks.

Dense Linear Probing: On DINOv3-S+, ViT-Outperforms baselines on all segmentation datasets (COCO, VOC, ADE20K, Cityscapes). The gains are most significant on Cityscapes (+2.07 mIoU), a dataset rich in thin structures, proving ViT-Up's ability to recover fine spatial detail. It also improves depth estimation on COCO.

Semantic Correspondence: On SPair-71k, ViT-Up substantially outperforms baselines (+4.17 PCK@0.10). This is a critical result because correspondence is sensitive to feature blur and leakage. The success here indicates that ViT-Up produces semantically coherent and structurally precise features, unlike image-guided methods that suffer from leakage between similar-looking objects.

Scaling: When moving to the larger DINOv3-B backbone, ViT-Up's performance gains increase significantly (e.g., +3.36 mIoU on Cityscapes, +8.09 PCK on SPair-71k). Conversely, baselines like NAF see smaller gains or even degradation because their external shallow encoders cannot scale with the backbone.

6.2. Data Presentation (Tables)

The following are the results from Table 1 of the original paper:

Method	Semantic Segmentation								Depth Estimation
Method	COCO		VOC		ADE20K		Cityscapes		COCO
mIoU ↑	Acc ↑	mIoU ↑	Acc ↑	mIoU ↑	Acc ↑	mIoU ↑	Acc ↑	δ1 ↑	RMSE ↓
Bilinear	63.10	81.85	84.88	96.45	43.27	76.17	61.36	93.44	61.52	62.80
JAFAR	62.50	81.50	83.88	96.16	42.48	75.81	57.78	92.47	60.64	64.92
AnyUp	63.03	81.83	84.54	96.34	42.77	76.02	58.96	92.93	61.66	62.62
UpLiFT	63.79	82.28	85.69	96.72	44.24	76.71	63.08	93.94	61.84	61.79
NAF	63.86	82.33	85.84	96.72	44.17	76.69	63.34	94.13	62.17	61.15
ViT-Up (Ours)	64.09	82.49	87.47	97.14	44.73	77.06	65.41	94.73	62.72	59.82
Gain vs. best baseline	+0.23	+0.16	+1.63	+0.42	+0.49	+0.35	+2.07	+0.60	+0.55	+1.33

The following are the results from Table 2 of the original paper:

Method	Semantic Correspondence (SPair-71k)			Geometric Correspondence (NAVI)
Method	0.10 ↑	0.05 ↑	0.01 ↑	0.10 ↑	0.05 ↑	0.01 ↑
Bilinear	51.27	33.74	3.83	80.16	51.18	33.58
JAFAR	36.82	18.59	1.89	79.04	47.02	26.60
AnyUp	37.63	19.31	1.97	80.31	48.78	28.37
UpLiFT	46.87	29.15	3.43	79.35	49.05	30.49
NAF	48.68	33.96	2.89	80.03	50.29	31.62
ViT-Up (Ours)	55.44	39.07	7.30	80.81	51.59	33.83
Gain vs. best baseline	+4.17	+5.11	+3.47	+0.50	+0.41	+0.25

The following are the results from Table 4 of the original paper, demonstrating scaling to DINOv3-B:

Method	Seg. Probing, mIoU ↑		SPair-71k, PCK ↑
Method	VOC	Cityscapes	0.1	0.05	0.01
Bilinear	87.20	64.21	50.01	34.03	3.89
NAF	88.07	66.45	47.19	29.09	3.41
ViT-Up (Ours)	89.18	69.81	58.10	42.00	7.93
Gain vs. best baseline	+1.11	+3.36	+8.09	+7.97	+4.04

The following are the results from Table 9 of the original paper, showing runtime and memory:

Method	Params	Chunk	Time [ms] ↓			VRAM [MiB] ↓
Method	Params	Chunk	112²	224²	448²	112²	224²	448²
JAFAR	0.6M	-	41.3	43.5	52.9	541.0	1721.6	6448.5
AnyUp	0.8M	-	6.8	15.3	59.8	485.9	774.1	2760.7
UpLiFT	0.8M	-	6.4	7.2	10.1	606.6	765.7	1391.5
NAF	0.7M	-	29.1	30.7	37.4	602.6	602.9	1780.5
ViT-Up (Ours)	24.9M	112²	14.2	24.1	62.6	503.7	503.9	626.7
		224²	-	22.1	55.4	-	1069.6	1143.6
		448²	-	-	52.0	-	-	3709.3

6.3. Ablation Studies / Parameter Analysis

The authors conducted extensive ablation studies to validate the design choices.

Individual Components (Table 5):

Cross-Attention & FeatX: Removing either component significantly degrades performance. Cross-Attention is crucial for context, while FeatX is vital for recovering fine sub-token details (texture).
LoRA: Removing LoRA slightly reduces performance, confirming that adapting the backbone provides useful capacity.
KL Loss: Removing the relational KL loss lowers SPair-71k performance, highlighting its importance in preserving feature geometry.

Figure 4: Qualitative ablation of cross-attention and FeatX. Removing FeatX loses texture (arm ends of sea star). Removing Cross-Attention introduces pixelation artifacts. The full model combines both for coherent, detailed features.

Number of Refinement Layers (Table 6):
Increasing the number of ViT-Up blocks (depth) consistently improves performance, especially on Cityscapes and strict PCK thresholds. This confirms that progressive, layer-wise refinement is more effective than a single-step prediction.

Output Feature Resolution (Table 7):

Performance saturates after a certain resolution (e.g., $224 \times 224$ or $448 \times 448$ ). Most semantic information is captured before full image resolution, suggesting a trade-off between resolution and compute.

Runtime and Memory:

ViT-Up has more parameters (24.9M) than baselines (~0.7M), but parameter count is not the main bottleneck.
Memory: ViT-Up supports "query chunking" (processing queries in batches). This allows it to achieve very low memory usage (e.g., 626.7 MiB at 448² with chunking), often lower than baselines like JAFAR.
Runtime: ViT-Up is competitive. It is slower than UpLiFT (which is highly optimized) but comparable to JAFAR and AnyUp.

7. Conclusion & Reflections

7.1. Conclusion Summary

The paper successfully addresses the limitation of low-resolution features in Vision Transformers by proposing ViT-Up, a faithful, implicit feature upsampling framework. By replacing external image guidance with layer-wise query construction from the ViT's internal hierarchy, ViT-Up avoids feature leakage and preserves the semantic integrity of the backbone features. The method demonstrates state-of-the-art performance on dense prediction (segmentation, depth) and semantic correspondence, with the added benefit of scaling effectively as the backbone capacity increases.

7.2. Limitations & Future Work

Information Bottleneck: Like all post-hoc upsamplers, ViT-Up is bounded by the information in the backbone's hidden states. If the backbone hasn't encoded a detail, ViT-Up cannot recover it. The authors suggest using moderately higher-resolution backbone inputs (e.g., $56 \times 56$ ) as a practical mitigation.
Backbone Coupling: ViT-Up is currently trained separately for each backbone because components like the transition MLP and FeatX are tightly coupled to the backbone's internal layer organization. This makes it less "plug-and-play" than backbone-agnostic methods like AnyUp. Future work could explore joint training with the backbone to encourage features that are more amenable to continuous reconstruction.

7.3. Personal Insights & Critique

Innovation: The shift from "external guidance" to "internal hierarchy utilization" is a significant conceptual leap. It correctly identifies that the shallow encoders used in prior works are the weak link. By treating the ViT as a hierarchical feature generator, ViT-Up leverages the massive investment already made in training the backbone.
Methodology: The use of FeatX with FiLM modulation based on relative offsets is a clever, lightweight way to inject sub-token spatial awareness without requiring a full high-resolution encoder.
Results: The gains on SPair-71k are particularly compelling. Semantic correspondence is a "stress test" for feature quality; success here strongly suggests that ViT-Up features are structurally sound, not just visually sharp.
Critique: The reliance on LoRA for backbone adaptation is a practical choice for efficiency, but the "Backbone Coupling" limitation is real. If ViT-Up could be made truly backbone-agnostic (or if a standard "upsampling-friendly" ViT architecture emerged), its adoption would likely skyrocket. Additionally, while the memory efficiency is good, the runtime is not strictly better than all baselines, though the quality trade-off justifies it.
Transferability: The concept of "hierarchical implicit decoding" could potentially be applied to other domains where hierarchical features exist but are low-resolution, such as video processing or medical imaging, where high-resolution detail is critical but expensive to compute directly.

ViT-Up: Faithful Feature Upsampling for Vision Transformers