SkillsBench: Benchmarking How Well Agent Skills Work Across Diverse Tasks Paper • 2602.12670 • Published 13 days ago • 52
DSBench: How Far Are Data Science Agents to Becoming Data Science Experts? Paper • 2409.07703 • Published Sep 12, 2024 • 66
FAITHSCORE: Evaluating Hallucinations in Large Vision-Language Models Paper • 2311.01477 • Published Nov 2, 2023 • 1