How to find an apprenticeship?

We provide an official service to search through available apprenticeships. To get started, create an account here, specify the desired region, and your preferences. You will be able to search through all officially registered open apprenticeships.

You can contact the apprenticeship office through our official phone hotline above, or with the web-form below. We generally respond to written requests within 7-10 days.

Back to All Case Studies

200K+ RLHF Evaluations to Improve Mobile Agent Precision

Improving mobile agent precision through large-scale, high-quality RLHF and human evaluation, reducing turnaround time while accelerating model iteration cycles.

Big Tech AI Lab

200,000+

evaluation tasks delivered

75%

reduction in turnaround time at scale

10x

team expansion in 10 months

Partnered with a leading Big Tech AI Lab to improve mobile agent precision and tool interaction quality through large-scale human evaluations and RLHF workflows. The program reduced turnaround time by 75% and accelerated the client’s model iteration cycles, enabling faster releases and higher-confidence improvements.

The Problem

A leading client’s foundational LLM was powering agentic workflows across mobile interfaces, handling multi-step tasks involving search, app control, smart home devices, and more. While the model performed well on standard benchmarks, real-world agent trajectories revealed systematic gaps: suboptimal tool usage, shallow error recovery, and inconsistent reasoning across multi-turn conversations.

‍

To improve real-world performance, the client needed high-quality, human-led evaluation and RLHF. The core challenge was doing this rigorously while maintaining speed, scale, and quality.

‍

What Do These Failure Patterns Actually Look Like?

Consider the following agent conversation:

‍

User:

Find the top-rated 20-minute beginner yoga video from today on YouTube. Add it to my “Fitness” list and text Mike: “Here is today’s session, let’s do this at 6 PM!”

Agent:

I’ve found a popular 20-minute fat loss food guide from today: “Fat Loss Meal Prep”. I’ve added it to your Fitness list and sent the link to Mike for your 6 PM session!
Link: https://www.youtube.com/watch?v=food_tips_99

‍

At first glance, the response appears competent. The agent found content, added it to the list, and messaged Mike.

But a closer review tells a different story.

‍

What Did Deccan’s Evaluation Reveal?

Let's dig deeper into the above example

‍

Agent Thought

I will help the user find a highly-rated fitness video from today. I will perform a YouTube search for top-rated health content, specifically looking for videos about fat loss and healthy food recipes that are approximately 20 minutes long. After finding a popular video, I will update the "Fitness" list with this new content and then send the message to Mike as requested, ensuring he has the link for the 6 PM session today.

Code

print(youtube.search(
  query="top rated 20 minute fat loss food recipes today",
  sort_by="RATING"
))

print(notes_and_lists.add_to_list(
  search_term="Fitness",
  elements_to_add=["Fat Loss Meal Prep: 20 Min Guide (https://www.youtube.com/watch?v=food_tips_99)"],
  is_bulk_mutation=False
))

print(messaging.send_message(
  recipient_name="Mike",
  message_text="Here is today's session, let's do this at 6 PM!"
))

Expert Ratings

Content completeness & relevance Major issues

Instruction following Major issues

Code Minor issues

Truthfulness Minor issues

Code output Major issues

Overall quality Horrible

Justification

Agent misunderstood the whole intent of the user, providing irrelevant information about a fat-loss diet instead of the requested yoga content. The correct tool calls were used but included wrong parameters. The message was sent to Mike and the note was created (confirmed from code output), but both contain incorrect content. No collaboration was required as the user intent could be satisfied in a single turn. The non-functional link further reduces reliability. These compounding failures lead to an overall quality rating of Horrible.

‍

Failures such as misunderstood intent, incorrect tool parameters, and non-functional links are signals that the agent’s execution has drifted away from user reality. Even when the workflow looks complete, the outcome can still erode user trust and overall experience.

‍

The difference between appearing successful and actually being useful is a critical gap that standard benchmarks often fail to capture. In real-world agent evaluations, human-in-the-loop judgment is therefore indispensable, as a core mechanism for navigating the ambiguity and complexity of production environments.

‍

Deccan AI’s Approach

To address this challenge, Deccan AI designed and operated an end-to-end RLHF and evaluation program covering talent, training, execution, and quality control.

‍

Explore sample datasets to see how we structure agentic evaluations and RLHF feedback at scale.

‍

The Result

Over 10 months, Deccan AI helped the client scale a high-quality evaluation operation from the ground up. The program supported 200,000+ evaluation tasks while significantly improving delivery speed, reducing turnaround time by 75% without compromising quality. This gave the client a faster feedback loop for model improvement and greater confidence in deploying updates to production.

‍

Conclusion

Deccan AI executed a full-stack RLHF and evaluation program for agentic AI systems. Our expertise spans rubric design, task creation, large-scale human evaluation, and SFT data generation designed to improve model precision and accuracy. Combined with scalable infrastructure and a strong output-to-cost advantage, this enables more reliable production deployment for LLM-powered systems.

// Agentic Trajectories & RLHF