Agent Evaluation#
Evaluating agents is harder than evaluating text generation because agents have open-ended trajectories.
Benchmarks#
- AgentBench: Evaluates agents on OS interaction, databases, and web navigation.
- GAIA: A benchmark for General AI Assistants requiring reasoning, tool use, and web browsing.
Custom Failure Analysis#
When building agents, you must track:
- Tool Error Rate: How often does the agent supply malformed JSON?
- Trajectory Length: How many steps did it take? (Shorter is usually better).
- Looping: Did the agent get stuck repeating the same failed action?
Use LangSmith or similar tracing tools to visualize agent trajectories.