200K+ RLHF Evaluations to Improve Mobile Agent Precision
Improving mobile agent precision through large-scale, high-quality RLHF and human evaluation, reducing turnaround time while accelerating model iteration cycles.
.webp)

200,000+
75%
10x
Partnered with a leading Big Tech AI Lab to improve mobile agent precision and tool interaction quality through large-scale human evaluations and RLHF workflows. The program reduced turnaround time by 75% and accelerated the client’s model iteration cycles, enabling faster releases and higher-confidence improvements.
The Problem
A leading client’s foundational LLM was powering agentic workflows across mobile interfaces, handling multi-step tasks involving search, app control, smart home devices, and more. While the model performed well on standard benchmarks, real-world agent trajectories revealed systematic gaps: suboptimal tool usage, shallow error recovery, and inconsistent reasoning across multi-turn conversations.
To improve real-world performance, the client needed high-quality, human-led evaluation and RLHF. The core challenge was doing this rigorously while maintaining speed, scale, and quality.
What Do These Failure Patterns Actually Look Like?
Consider the following agent conversation:
At first glance, the response appears competent. The agent found content, added it to the list, and messaged Mike.
But a closer review tells a different story.
What Did Deccan’s Evaluation Reveal?
Let's dig deeper into the above example
Failures such as misunderstood intent, incorrect tool parameters, and non-functional links are signals that the agent’s execution has drifted away from user reality. Even when the workflow looks complete, the outcome can still erode user trust and overall experience.
The difference between appearing successful and actually being useful is a critical gap that standard benchmarks often fail to capture. In real-world agent evaluations, human-in-the-loop judgment is therefore indispensable, as a core mechanism for navigating the ambiguity and complexity of production environments.
Deccan AI’s Approach
To address this challenge, Deccan AI designed and operated an end-to-end RLHF and evaluation program covering talent, training, execution, and quality control.
Explore sample datasets to see how we structure agentic evaluations and RLHF feedback at scale.
{{talent}}
The Result
Over 10 months, Deccan AI helped the client scale a high-quality evaluation operation from the ground up. The program supported 200,000+ evaluation tasks while significantly improving delivery speed, reducing turnaround time by 75% without compromising quality. This gave the client a faster feedback loop for model improvement and greater confidence in deploying updates to production.
Conclusion
Deccan AI executed a full-stack RLHF and evaluation program for agentic AI systems. Our expertise spans rubric design, task creation, large-scale human evaluation, and SFT data generation designed to improve model precision and accuracy. Combined with scalable infrastructure and a strong output-to-cost advantage, this enables more reliable production deployment for LLM-powered systems.
