Scaling Computer-Use Grounding via User Interface Decomposition and Synthesis Paper • 2505.13227 • Published May 19 • 45
OneIG-Bench: Omni-dimensional Nuanced Evaluation for Image Generation Paper • 2506.07977 • Published 23 days ago • 39
LiveCodeBench Pro: How Do Olympiad Medalists Judge LLMs in Competitive Programming? Paper • 2506.11928 • Published 19 days ago • 22
SciVer: Evaluating Foundation Models for Multimodal Scientific Claim Verification Paper • 2506.15569 • Published 14 days ago • 11
MultiFinBen: A Multilingual, Multimodal, and Difficulty-Aware Benchmark for Financial LLM Evaluation Paper • 2506.14028 • Published 16 days ago • 90
DeepResearch Bench: A Comprehensive Benchmark for Deep Research Agents Paper • 2506.11763 • Published 19 days ago • 60
VIKI-R: Coordinating Embodied Multi-Agent Cooperation via Reinforcement Learning Paper • 2506.09049 • Published 22 days ago • 33