IBM and UC Berkeley Diagnose Why Enterprise Agents Fail Using IT-Bench and MAST
Ayhan Sebin Saurabh Jha Rohan Arora Daby Sow Mert Cemri Melissa Pan Ion Stoica ITBench HF Space ITBench HF Dataset MAST HF Dataset ITBench Github MAST Github IBM Research and UC Berkeley collaborated to study how agentic LLM systems break in real-world IT automation, for tasks involving incident triage, logs/metrics queries, and Kubernetes actions in long-horizon tool loops. Benchmarks typically reduce performance to a single number, telling you whether an agent failed but never why. To solve this black-box […]
Read more