Heading 1

Heading 2

Heading 3

Heading 4

Heading 5
Heading 6

Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur.

Block quote

Ordered list

  1. Item 1
  2. Item 2
  3. Item 3

Unordered list

  • Item A
  • Item B
  • Item C

Text link

Bold text

Emphasis

Superscript

Subscript

200K+ RLHF Evaluations to Improve Mobile Agent Precision
Overview

Partnered with a leading Big Tech AI Lab to improve mobile agent precision and tool interaction quality through large-scale human evaluations and RLHF workflows. The program reduced turnaround time by 75% and accelerated the client’s model iteration cycles, enabling faster releases and higher-confidence improvements.

Client
Big Tech AI Lab
Dataset Type
Mobile Agent
Domain
Agentic Trajectories & RLHF
Scale
200,000+ Evaluation Tasks
Capability
Expert Generalist Evaluators
Delivery Highlights
200,000+
evaluation tasks delivered
75%
reduction in turnaround time at scale
10x
team expansion in 10 months

The Problem

A leading client’s foundational LLM was powering agentic workflows across mobile interfaces, handling multi-step tasks involving search, app control, smart home devices, and more. While the model performed well on standard benchmarks, real-world agent trajectories revealed systematic gaps: suboptimal tool usage, shallow error recovery, and inconsistent reasoning across multi-turn conversations.

To improve real-world performance, the client needed high-quality, human-led evaluation and RLHF. The core challenge was doing this rigorously while maintaining speed, scale, and quality.

What Do These Failure Patterns Actually Look Like?

Consider the following agent conversation:

User:

Find the top-rated 20-minute beginner yoga video from today on YouTube. Add it to my “Fitness” list and text Mike: “Here is today’s session, let’s do this at 6 PM!”

Agent:

I’ve found a popular 20-minute fat loss food guide from today: “Fat Loss Meal Prep”. I’ve added it to your Fitness list and sent the link to Mike for your 6 PM session!
Link: https://www.youtube.com/watch?v=food_tips_99

At first glance, the response appears competent. The agent found content, added it to the list, and messaged Mike. Most automated benchmarks would likely score this interaction positively.

But a closer review tells a different story.

What Did Deccan’s Evaluation Reveal?

Let's dig deeper into the above example

Agent Thought

I will help the user find a highly-rated fitness video from today. I will perform a YouTube search for top-rated health content, specifically looking for videos about fat loss and healthy food recipes that are approximately 20 minutes long. After finding a popular video, I will update the "Fitness" list with this new content and then send the message to Mike as requested, ensuring he has the link for the 6 PM session today.

Code
print(youtube.search(
  query="top rated 20 minute fat loss food recipes today",
  sort_by="RATING"
))

print(notes_and_lists.add_to_list(
  search_term="Fitness",
  elements_to_add=["Fat Loss Meal Prep: 20 Min Guide (https://www.youtube.com/watch?v=food_tips_99)"],
  is_bulk_mutation=False
))

print(messaging.send_message(
  recipient_name="Mike",
  message_text="Here is today's session, let's do this at 6 PM!"
))
      
Expert Ratings
Content completeness & relevance Major issues
Instruction following Major issues
Code Minor issues
Truthfulness Minor issues
Code output Major issues
Overall quality Horrible
Justification

Agent misunderstood the whole intent of the user, providing irrelevant information about a fat-loss diet instead of the requested yoga content. The correct tool calls were used but included wrong parameters. The message was sent to Mike and the note was created (confirmed from code output), but both contain incorrect content. No collaboration was required as the user intent could be satisfied in a single turn. The non-functional link further reduces reliability. These compounding failures lead to an overall quality rating of Horrible.

Failures such as misunderstood intent, incorrect tool parameters, and non-functional links are signals that the agent’s execution has drifted away from user reality. Even when the workflow looks complete, the outcome can still erode user trust and overall experience. 

The difference between appearing successful and actually being useful is a critical gap that standard benchmarks often fail to capture. In real-world agent evaluations, human-in-the-loop judgment is therefore indispensable, as a core mechanism for navigating the ambiguity and complexity of production environments.

Deccan AI’s Approach

To address this challenge, Deccan AI designed and operated an end-to-end RLHF and evaluation program covering talent, training, execution, and quality control. 

Explore sample datasets to see how we structure agentic evaluations and RLHF feedback at scale.

{{talent}}

The Result

Over 10 months, Deccan AI helped the client scale a high-quality evaluation operation from the ground up. The program supported 200,000+ evaluation tasks while significantly improving delivery speed, reducing turnaround time by 75% without compromising quality. This gave the client a faster feedback loop for model improvement and greater confidence in deploying updates to production.

Conclusion

Deccan AI executed a full-stack RLHF and evaluation program for agentic AI systems. Our expertise spans rubric design, task creation, large-scale human evaluation, and SFT data generation designed to improve model precision and accuracy. Combined with scalable infrastructure and a strong output-to-cost advantage, this enables more reliable production deployment for LLM-powered systems.

 Talent Selection
We selected high-performing generalists through structured assessments. Every contributor completed Playground training and live workshops before launch.
Structured Training Rollout
We delivered phased training aligned to evaluation metrics and agent capabilities, supported by detailed guides, cheat sheets, live sessions, and refresher programs.
Large-Scale Execution
We ran a crowdsourced evaluation model across media, smart home, and shopping use cases, applying a 12-point rubric supported by expert rationale.
Quality Control & Governance
We implemented multi-layer QA and fraud checks to prevent unauthorized LLM use, alongside ongoing monitoring and retraining to sustain consistency.

Pristine Data. At Scale.
With Speed.

Contact Us