MT-Video-Bench: A Holistic Video Understanding Benchmark for Evaluating Multimodal LLMs in Multi-Turn Dialogues Paper • 2510.17722 • Published Oct 20, 2025 • 20
IF-VidCap: Can Video Caption Models Follow Instructions? Paper • 2510.18726 • Published Oct 21, 2025 • 26
DR$^{3}$-Eval: Towards Realistic and Reproducible Deep Research Evaluation Paper • 2604.14683 • Published 6 days ago • 32
Frontier-Eng: Benchmarking Self-Evolving Agents on Real-World Engineering Tasks with Generative Optimization Paper • 2604.12290 • Published 8 days ago • 16
ContextBench: A Benchmark for Context Retrieval in Coding Agents Paper • 2602.05892 • Published Feb 5 • 4
The Latent Space: Foundation, Evolution, Mechanism, Ability, and Outlook Paper • 2604.02029 • Published 20 days ago • 143