arxiv:2504.20996

X-Fusion: Introducing New Modality to Frozen Large Language Models

Published on Apr 29

· Submitted by

Sichengmo on Apr 30

Upvote

Authors:

Sicheng Mo ,

Yong Jae Lee ,

Bolei Zhou ,

Abstract

X-Fusion enhances pretrained LLMs for multimodal tasks by integrating vision-specific information through a dual-tower design and freezes the LLM's parameters for consistency in understanding and generation.

AI-generated summary

We propose X-Fusion, a framework that extends pretrained Large Language Models (LLMs) for multimodal tasks while preserving their language capabilities. X-Fusion employs a dual-tower design with modality-specific weights, keeping the LLM's parameters frozen while integrating vision-specific information for both understanding and generation. Our experiments demonstrate that X-Fusion consistently outperforms alternative architectures on both image-to-text and text-to-image tasks. We find that incorporating understanding-focused data improves generation quality, reducing image data noise enhances overall performance, and feature alignment accelerates convergence for smaller models but has minimal impact on larger ones. Our findings provide valuable insights into building efficient unified multimodal models.

View arXiv page View PDF Project page Add to collection

Community

Sichengmo

Paper author Paper submitter 24 days ago

This paper proposes X-Fusion, a framework that extends pretrained Large Language Models (LLMs) for multimodal tasks while preserving their language capabilities.