MultiFinBen: A Multilingual, Multimodal, and Difficulty-Aware Benchmark for Financial LLM Evaluation
Abstract
MultiFinBen is a multilingual and multimodal benchmark for financial domain tasks, evaluating LLMs across modalities and linguistic settings, revealing challenges in complex cross-lingual and multimodal financial reasoning.
Recent advances in large language models (LLMs) have accelerated progress in financial NLP and applications, yet existing benchmarks remain limited to monolingual and unimodal settings, often over-relying on simple tasks and failing to reflect the complexity of real-world financial communication. We introduce MultiFinBen, the first multilingual and multimodal benchmark tailored to the global financial domain, evaluating LLMs across modalities (text, vision, audio) and linguistic settings (monolingual, bilingual, multilingual) on domain-specific tasks. We introduce two novel tasks, including PolyFiQA-Easy and PolyFiQA-Expert, the first multilingual financial benchmarks requiring models to perform complex reasoning over mixed-language inputs; and EnglishOCR and SpanishOCR, the first OCR-embedded financial QA tasks challenging models to extract and reason over information from visual-text financial documents. Moreover, we propose a dynamic, difficulty-aware selection mechanism and curate a compact, balanced benchmark rather than simple aggregation existing datasets. Extensive evaluation of 22 state-of-the-art models reveals that even the strongest models, despite their general multimodal and multilingual capabilities, struggle dramatically when faced with complex cross-lingual and multimodal tasks in financial domain. MultiFinBen is publicly released to foster transparent, reproducible, and inclusive progress in financial studies and applications.
Community
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- Aya Vision: Advancing the Frontier of Multilingual Multimodality (2025)
- SpaCE-10: A Comprehensive Benchmark for Multimodal Large Language Models in Compositional Spatial Intelligence (2025)
- Unifying Multimodal Large Language Model Capabilities and Modalities via Model Merging (2025)
- Judging Quality Across Languages: A Multilingual Approach to Pretraining Data Filtering with Language Models (2025)
- Vela: Scalable Embeddings with Voice Large Language Models for Multimodal Retrieval (2025)
- IndicSQuAD: A Comprehensive Multilingual Question Answering Dataset for Indic Languages (2025)
- Multilingual Prompt Engineering in Large Language Models: A Survey Across NLP Tasks (2025)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment:
@librarian-bot
recommend
Models citing this paper 0
No model linking this paper
Datasets citing this paper 4
Spaces citing this paper 0
No Space linking this paper