VAKRA brand mark
VAKRA
eValuating API and Knowledge Retrieval Agents using multi-hop, multi-source dialogues
Get started
Leaderboard

Agent Evaluation Overview

Rank Agent Model Overall API chaining Tool
selection
Multihop Reasoning MultiHop MultiSource
Policy adherence
Loading…
Core benchmark capabilities

Four increasingly complex capabilities

We evaluates core agent capabilities—entity resolution, schema alignment, retrieval, tool selection, and policy adherence through four tasks

8,000+Executable APIs
60+diverse domains
3–7Reasoning steps per task
TraceTrajectory-level verification
VAKRA benchmark capabilities diagram showing API Chaining, Tool Selection, MultiHop Reasoning over structured APIs, and MultiHop MultiSource with Policy Adherence.
Capability overview spanning diverse API interaction styles and complex reasoning workflows.
Submission

Submit to the live leaderboard

Ready to add your model to the public VAKRA leaderboard? Follow the submission flow below.

01

Run the benchmark

Evaluate your agent on the released VAKRA capabilities using the benchmark runner and evaluator present in the github repository.

02

Validate the output

The outputs are expected in a particular folder structure along with a metadata file created by the validator script. Run the validate_output.py script on the output folder. Provide the output of the validator script along with your submission.

03

Open the submission template

Use the leaderboard submission issue template to share your outputs, agent description, code or system link. Once reviewed, approved submissions will be added to the live leaderboard.

Quick links

Submission resources

Everything needed to prepare a leaderboard entry is linked here.

Model name
Agent setup
Capability scores
Run details
Resources

Get started

Evaluation

Replay-based evaluation and scoring

The evaluator replays predicted tool trajectories against the live VAKRA MCP environment, injects fresh tool responses, and scores whether the final answer is correct and grounded in executed tool outputs.

01

Scoring pipeline

The current scorer evaluates the last turn for each dialogue. A waterfall judge is constructed using the following three judges :

  • Policy check is only used for the multi-source with policy adherence capability. The output is programatically validated based on the policy per data sample.
  • Exact-match tool-response check: looks for expected ground-truth tool responses in the predicted response set.
  • Correctness check: if exact match fails, an LLM judge compares the predicted final answer to the ground-truth final answer.
  • Groundedness check: acceptable answers are then checked against the executed tool outputs.
  • Aggregation: turn scores are combined into a dialogue score using the default mean policy.
02

Aggregate Benchmark Score

  • Each of the four core capabilities are equally weighted to obtain the overall bechmark score.
  • For multi-source with policy adherence capability multi-source queries have twice the weightage of queries requiring only APIs or only retrievers to answer.
\[ \begin{aligned} \mathrm{Benchmark\ Score} &= \frac{1}{4} \Bigg( \frac{\#\ \mathrm{correct}_{\mathrm{api\_chaining}}}{\#\ \mathrm{total}_{\mathrm{api\_chaining}}} \\[8pt] &\qquad+ \frac{\#\ \mathrm{correct}_{\mathrm{tool\_selection}}}{\#\ \mathrm{total}_{\mathrm{tool\_selection}}} \\[8pt] &\qquad+ \frac{\#\ \mathrm{correct}_{\mathrm{multihop\_reasoning}}}{\#\ \mathrm{total}_{\mathrm{multihop\_reasoning}}} \\[8pt] &\qquad+ \frac{ \begin{aligned} \left(\#\ \mathrm{correct\ multi\text{-}source\ queries} \times 2\right) \\ +\left(\#\ \mathrm{correct\ API\ only\ or\ RAG\ only\ queries}\right) \end{aligned} }{ \begin{aligned} \left(\#\ \mathrm{total\ multi\text{-}source\ queries} \times 2\right) \\ +\left(\#\ \mathrm{total\ API\ only\ or\ RAG\ only\ queries}\right) \end{aligned} } \Bigg) \end{aligned} \]