Native Visual Understanding: Resolving Resolution Dilemmas in Vision-Language Models

Junbo Niu^*1,2, Yuanhong Zheng^*3, Ziyang Miao⁴, Hejun Dong¹, Chunjiang Ge⁵, Hao Liang¹, Ma Lu¹, Bohan Zeng¹, Qiahao Zheng¹, Conghui He², Wentao Zhang¹

¹ Peking University, ² Shanghai AI Laboratory,
³ Shandong University, ⁴ Beihang University, ⁵ Tsinghua University
Preprint on arXiv
^* Equal Contribution

Paper

HuggingFace Code arXiv

Vision-Language Models (VLMs) face significant challenges when dealing with the diverse resolutions and aspect ratios of real-world images, as most existing models rely on fixed, low-resolution inputs. While recent studies have explored integrating native resolution visual encoding to improve model performance, such efforts remain fragmented and lack a systematic framework within the open-source community. Moreover, existing benchmarks fall short in evaluating VLMs under varied visual conditions, often neglecting resolution as a critical factor. To address the "Resolution Dilemma" stemming from both model design and benchmark limitations, we introduce RC-Bench, a novel benchmark specifically designed to systematically evaluate VLM capabilities under extreme visual conditions, with an emphasis on resolution and aspect ratio variations. In conjunction, we propose NativeRes-LLaVA, an open-source training framework that empowers VLMs to effectively process images at their native resolutions and aspect ratios. Based on RC-Bench and NativeRes-LLaVA, we conduct comprehensive experiments on existing visual encoding strategies. The results show that Native Resolution Visual Encoding significantly improves the performance of VLMs on RC-Bench as well as other resolution-centric benchmarks.

Method	#Data	MaxRes	RS	Acc.	ACV↓	RCV↓
LLaVA-NeXT-SigLip	1.22M	768×768	Crop	22.3	40.9	34.5
LLaVA-NeXT-QwenViT		728×728	Crop	42.4	27.5	27.6
NativeRes-LLaVA		378×378	Fixed	40.1	32.6	32.4
NativeRes-LLaVA		728×728	Native	49.6	21.8	15.4
NativeRes-LLaVA		1260×1260	Native	51.9	18.0	12.5
LLaVA-NeXT-SigLip	1.34M	768×768	Crop	27.7	28.6	36.6
LLaVA-NeXT-QwenViT		728×728	Crop	47.9	23.5	25.0
NativeRes-LLaVA		378×378	Fixed	45.4	25.5	25.2
NativeRes-LLaVA		728×728	Native	53.6	19.1	16.8
NativeRes-LLaVA		1792×1792	Native	59.6	14.2	7.2

Method	LLM	#Data	RS	Acc.	ACV↓	RCV↓
Cambrain-1	LLaMA3-8B-Instruct	9.5M	Hybrid	44.1	19.3	30.1
LLaVA-OneVision	Qwen2-7B-Instruct	9.35M	Crop	55.3	13.4	11.4
Intern-VL-2.5	InternLM2.5-7B-chat	142B^Tok		65.0	9.7	5.7
Intern-VL-3	Qwen2.5-7B	200B^Tok		60.4	28.3	5.9
DeepSeek-VL2	28B-A4B	800B^Tok		65.3	16.4	3.8
NativeRes-LLaVA	Qwen2-7B-Instruct	1.34M	Native	59.6	14.2	7.2
Qwen2-VL	Qwen2-7B-Instruct	1.4T^Tok		77.6	5.8	2.0
Qwen2.5-VL	Qwen2.5-7B-Instruct	4.1T^Tok		80.4	6.3	2.1
Kimi-VL	16B-A3B	4.4T^Tok		77.6	7.9	1.7
Seed1.5-VL	A20B	3T^+Tok		75.1	7.4	3.3