Heading 1

Heading 2

Heading 3

Heading 4

Heading 5
Heading 6

Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur.

Block quote

Ordered list

  1. Item 1
  2. Item 2
  3. Item 3

Unordered list

  • Item A
  • Item B
  • Item C

Text link

Bold text

Emphasis

Superscript

Subscript

Enabling Systematic Evaluation of Agentic AI in Real Web Environments
Delivery Highlights
Delivering Human Preference Signals for Web-Grounded, Multimodal Systems

Deccan delivered a training-grade human evaluation dataset to support development of agentic, multimodal AI systems operating over real web environments. The work focused on producing structured preference signals that expose grounding, reasoning, and execution failures across multi-turn tasks, with evaluation aligned to training relevance rather than surface response quality.

Client
Fortune 10 BigTech Hyperscaler
Dataset Type
Human Preference Evaluation
Domain
Agentic, web-aware AI
Dataset Scale
~40,000+ evaluations
Capability
Multimodal Agentic Evaluation
Delivery Highlights
40,000+
training-grade human preference evaluations delivered
9 dimension
behavioral rubric applied to every sample
45-90 mins
AHT per sample

The Problem

Evaluating agentic, multimodal AI systems differs from evaluating single-turn or text-only models. Each evaluation must account for how an agent interacts with real websites, interprets page-level visual and textual content, and maintains intent across multi-turn trajectories.

Fluent final responses often mask failures in grounding, source usage, or multi-step reasoning. From outputs alone, it is difficult to determine whether an agent accessed the correct webpages or relied on visible content. This limits the usefulness of traditional evaluation for model iteration.

Evaluation complexity increases further when tasks span multiple webpages, screenshots, and extended interaction history, requiring sustained human judgment rather than checklist-based annotation.

Deccan’s Approach

Deccan combined preference-based evaluation with a structured behavioral rubric and a severity-based taxonomy, supported by an execution model designed for high-complexity work.

Each task was evaluated through pairwise human comparison using full conversational context and webpage state. Evaluators produced a preference ranking and a short justification tied to explicit behavioral criteria, preserving both directionality and cause.

Behavioral Rubric and Signal Structure

Agent behavior was evaluated across nine dimensions, selected to reflect how web-grounded, multi-turn agents succeed or fail in practice:

Each dimension was tagged using severity labels (no / minor / major issues), enabling aggregation by failure type without false numerical precision.

Execution and Evaluation

Tasks frequently involved multiple webpages, screenshots, and extended interaction histories. Deccan trained evaluators specifically on web-dependent agent behavior, including access verification, visual interpretation, and trajectory-level failure identification.

Quality control focused on judgment calibration, with evaluator decisions reviewed against rubric definitions and severity criteria. This allowed continuous delivery of preference data without degradation in signal quality.

Key Takeaways

  • Agentic AI evaluation requires structured, training-aligned preference signals
  • Behavioral decomposition enables localization of failure modes
  • High-complexity evaluation depends on calibrated human judgment
  • Preference data supports incremental model iteration

Conclusion

This engagement demonstrates that training-grade evaluation for agentic, multimodal AI systems can be delivered at scale when grounded in explicit behavioral rubrics, preference-based judgment, and disciplined execution. Deccan produced a consistent evaluation dataset suitable for frontier model development without relaxing standards as volume increased.

Pristine Data. At Scale.
With Speed.

Contact Us