VAKRA — Multi-Hop, Multi-Source, Multi-Tool Agent Benchmark

Leaderboard

Agent Evaluation Overview

Rank	Agent	Model	Overall	API chaining	Tool selection	Multihop Reasoning	MultiHop MultiSource Policy adherence
Loading…

Core benchmark capabilities

Four increasingly complex capabilities

We evaluates core agent capabilities—entity resolution, schema alignment, retrieval, tool selection, and policy adherence through four tasks

8,000+Executable APIs

60+diverse domains

3–7Reasoning steps per task

TraceTrajectory-level verification

VAKRA benchmark capabilities diagram showing API Chaining, Tool Selection, MultiHop Reasoning over structured APIs, and MultiHop MultiSource with Policy Adherence. — Capability overview spanning diverse API interaction styles and complex reasoning workflows.

Submission

Submit to the live leaderboard

Ready to add your model to the public VAKRA leaderboard? Follow the submission flow below.

01

Run the benchmark

Evaluate your agent on the released VAKRA capabilities using the benchmark runner and evaluator present in the github repository.

02

Validate the output

The outputs are expected in a particular folder structure along with a metadata file created by the validator script. Run the validate_output.py script on the output folder. Provide the output of the validator script along with your submission.

03

Open the submission template

Use the leaderboard submission issue template to share your outputs, agent description, code or system link. Once reviewed, approved submissions will be added to the live leaderboard.

Quick links

Submission resources

Everything needed to prepare a leaderboard entry is linked here.

Open submission template View benchmark repo View dataset

Model name

Agent setup

Capability scores

Run details

Resources

Get started

Code

Framework and scripts

Setup docs, baseline agents, and reproducible experiments.

Environment

Executable backends

Setup MCP based local APIs and retrieval stack.

Submission

Leaderboard and submissions

Please create a github issue using the Leaderboard Submission Template to submit for evaluation.

Paper

Benchmark paper

Link to the paper or blog page describing dataset construction coming soon.

Evaluation

Replay-based evaluation and scoring

The evaluator replays predicted tool trajectories against the live VAKRA MCP environment, injects fresh tool responses, and scores whether the final answer is correct and grounded in executed tool outputs.

01

Scoring pipeline

The current scorer evaluates the last turn for each dialogue. A waterfall judge is constructed using the following three judges :

Policy check is only used for the multi-source with policy adherence capability. The output is programatically validated based on the policy per data sample.
Exact-match tool-response check: looks for expected ground-truth tool responses in the predicted response set.
Correctness check: if exact match fails, an LLM judge compares the predicted final answer to the ground-truth final answer.
Groundedness check: acceptable answers are then checked against the executed tool outputs.
Aggregation: turn scores are combined into a dialogue score using the default mean policy.

02

Aggregate Benchmark Score

Each of the four core capabilities are equally weighted to obtain the overall bechmark score.
For multi-source with policy adherence capability multi-source queries have twice the weightage of queries requiring only APIs or only retrievers to answer.

\begin{aligned} \mathrm{Benchmark\ Score} &= \frac{1}{4} \Bigg( \frac{\#\ \mathrm{correct}_{\mathrm{api\_chaining}}}{\#\ \mathrm{total}_{\mathrm{api\_chaining}}} \\[8pt] &\qquad+ \frac{\#\ \mathrm{correct}_{\mathrm{tool\_selection}}}{\#\ \mathrm{total}_{\mathrm{tool\_selection}}} \\[8pt] &\qquad+ \frac{\#\ \mathrm{correct}_{\mathrm{multihop\_reasoning}}}{\#\ \mathrm{total}_{\mathrm{multihop\_reasoning}}} \\[8pt] &\qquad+ \frac{ \begin{aligned} \left(\#\ \mathrm{correct\ multi\text{-}source\ queries} \times 2\right) \\ +\left(\#\ \mathrm{correct\ API\ only\ or\ RAG\ only\ queries}\right) \end{aligned} }{ \begin{aligned} \left(\#\ \mathrm{total\ multi\text{-}source\ queries} \times 2\right) \\ +\left(\#\ \mathrm{total\ API\ only\ or\ RAG\ only\ queries}\right) \end{aligned} } \Bigg) \end{aligned}