Native Visual Understanding: Resolving Resolution Dilemmas in Vision-Language Models

1 Peking University, 2 Shanghai AI Laboratory,
3 Shandong University, 4 Beihang University, 5 Tsinghua University
Preprint on arXiv

* Equal Contribution   

Abstract

Vision-Language Models (VLMs) face significant challenges when dealing with the diverse resolutions and aspect ratios of real-world images, as most existing models rely on fixed, low-resolution inputs. While recent studies have explored integrating native resolution visual encoding to improve model performance, such efforts remain fragmented and lack a systematic framework within the open-source community. Moreover, existing benchmarks fall short in evaluating VLMs under varied visual conditions, often neglecting resolution as a critical factor. To address the "Resolution Dilemma" stemming from both model design and benchmark limitations, we introduce RC-Bench, a novel benchmark specifically designed to systematically evaluate VLM capabilities under extreme visual conditions, with an emphasis on resolution and aspect ratio variations. In conjunction, we propose NativeRes-LLaVA, an open-source training framework that empowers VLMs to effectively process images at their native resolutions and aspect ratios. Based on RC-Bench and NativeRes-LLaVA, we conduct comprehensive experiments on existing visual encoding strategies. The results show that Native Resolution Visual Encoding significantly improves the performance of VLMs on RC-Bench as well as other resolution-centric benchmarks.


Architecture

Architecture Diagram

Open-source Release

Results

Method #Data MaxRes RS Acc. ACV↓ RCV↓
LLaVA-NeXT-SigLip 1.22M 768×768 Crop 22.3 40.9 34.5
LLaVA-NeXT-QwenViT 728×728 Crop 42.4 27.5 27.6
NativeRes-LLaVA 378×378 Fixed 40.1 32.6 32.4
NativeRes-LLaVA 728×728 Native 49.6 21.8 15.4
NativeRes-LLaVA 1260×1260 Native 51.9 18.0 12.5
LLaVA-NeXT-SigLip 1.34M 768×768 Crop 27.7 28.6 36.6
LLaVA-NeXT-QwenViT 728×728 Crop 47.9 23.5 25.0
NativeRes-LLaVA 378×378 Fixed 45.4 25.5 25.2
NativeRes-LLaVA 728×728 Native 53.6 19.1 16.8
NativeRes-LLaVA 1792×1792 Native 59.6 14.2 7.2
InternVL2_5-8B InternVL2_5-8B
Qwen2-VL-7B-Instruct Qwen2-VL-7B-Instruct
Cambrian Cambrian
GPT-4o GPT-4o
Kimi-VL-A3B-Instruct Kimi-VL-A3B-Instruct
llava-onevision-qwen2-7b-ov llava-onevision-qwen2-7b-ov
NativeResLLaVA-FixedRes NativeResLLaVA-FixedRes
SEED-VL-1.5 SEED-VL-1.5
deepseek-vl2 deepseek-vl2
InternVL3-8B InternVL3-8B
NativeResLLaVA-AnyRes NativeResLLaVA-AnyRes
NativeResLLaVA-NativeRes NativeResLLaVA-NativeRes
Qwen2.5-VL-7B-Instruct Qwen2.5-VL-7B-Instruct

Data Distribution

To help intuitively understand the dataset characteristics, we visualize the distribution of images by aspect ratio and resolution.

BibTeX

BibTex Code Here