The Problem
A leading client’s foundational LLM was powering agentic workflows across mobile interfaces, handling multi-step tasks involving search, app control, smart home devices, and more. While the model performed well on standard benchmarks, real-world agent trajectories revealed systematic gaps: suboptimal tool usage, shallow error recovery, and inconsistent reasoning across multi-turn conversations.
To improve real-world performance, the client needed high-quality, human-led evaluation and RLHF. The core challenge was doing this rigorously while maintaining speed, scale, and quality.
What Do These Failure Patterns Actually Look Like?
Consider the following agent conversation:
Find the top-rated 20-minute beginner yoga video from today on YouTube.
Add it to my “Fitness” list and text Mike: “Here is today’s session,
let’s do this at 6 PM!”
I’ve found a popular 20-minute fat loss food guide from today:
“Fat Loss Meal Prep”. I’ve added it to your Fitness list and sent the link
to Mike for your 6 PM session!
Link:
https://www.youtube.com/watch?v=food_tips_99
At first glance, the response appears competent. The agent found content, added it to the list, and messaged Mike. Most automated benchmarks would likely score this interaction positively.
But a closer review tells a different story.
What Did Deccan’s Evaluation Reveal?
Let's dig deeper into the above example
Agent Thought
I will help the user find a highly-rated fitness video from today. I will perform a YouTube search for top-rated health content, specifically looking for videos about fat loss and healthy food recipes that are approximately 20 minutes long. After finding a popular video, I will update the "Fitness" list with this new content and then send the message to Mike as requested, ensuring he has the link for the 6 PM session today.
Code
print(youtube.search(
query="top rated 20 minute fat loss food recipes today",
sort_by="RATING"
))
print(notes_and_lists.add_to_list(
search_term="Fitness",
elements_to_add=["Fat Loss Meal Prep: 20 Min Guide (https://www.youtube.com/watch?v=food_tips_99)"],
is_bulk_mutation=False
))
print(messaging.send_message(
recipient_name="Mike",
message_text="Here is today's session, let's do this at 6 PM!"
))
Expert Ratings
Content completeness & relevance
Major issues
Instruction following
Major issues
Code
Minor issues
Truthfulness
Minor issues
Code output
Major issues
Overall quality
Horrible
Justification
Agent misunderstood the whole intent of the user, providing irrelevant information about a fat-loss diet instead of the requested yoga content. The correct tool calls were used but included wrong parameters. The message was sent to Mike and the note was created (confirmed from code output), but both contain incorrect content. No collaboration was required as the user intent could be satisfied in a single turn. The non-functional link further reduces reliability. These compounding failures lead to an overall quality rating of
Horrible.
Failures such as misunderstood intent, incorrect tool parameters, and non-functional links are signals that the agent’s execution has drifted away from user reality. Even when the workflow looks complete, the outcome can still erode user trust and overall experience.
The difference between appearing successful and actually being useful is a critical gap that standard benchmarks often fail to capture. In real-world agent evaluations, human-in-the-loop judgment is therefore indispensable, as a core mechanism for navigating the ambiguity and complexity of production environments.
Deccan AI’s Approach
To address this challenge, Deccan AI designed and operated an end-to-end RLHF and evaluation program covering talent, training, execution, and quality control.
Explore sample datasets to see how we structure agentic evaluations and RLHF feedback at scale.
{{talent}}
The Result
Over 10 months, Deccan AI helped the client scale a high-quality evaluation operation from the ground up. The program supported 200,000+ evaluation tasks while significantly improving delivery speed, reducing turnaround time by 75% without compromising quality. This gave the client a faster feedback loop for model improvement and greater confidence in deploying updates to production.
Conclusion
Deccan AI executed a full-stack RLHF and evaluation program for agentic AI systems. Our expertise spans rubric design, task creation, large-scale human evaluation, and SFT data generation designed to improve model precision and accuracy. Combined with scalable infrastructure and a strong output-to-cost advantage, this enables more reliable production deployment for LLM-powered systems.