From Compliance to Foresight: Benchmarking Deep Research Agents
April 17, 2026

.png)
Overview
We ran 100+ PhD-level research prompts through ChatGPT Deep Research and Gemini Deep Research, scored by domain SMEs across a 7-rubric framework, including two novel rubrics for forward-looking scientific reasoning. The study focuses on examining the “Proactive” research capabilities of the SOTA models.
Our hybrid tasking approach - directive prompt (more flexible) and prescriptive prompts (constrained), resulted in a counterintuitive insight: more specific instructions made outputs worse. Gemini's format compliance dropped 29 points under prescriptive constraints; ChatGPT dropped only 4 — but paid for that discipline by losing 11 points on Prediction Accuracy and 13 on Future Scope. We also confirmed, statistically, that forecasting research outcomes is a distinct capability. Our study exposes that the models are not comparatively better, but rather they are at opposite ends of a behavioral spectrum.
The Question No Existing Benchmark Asks
Every major AI benchmark tests retrieval - find the right paper, cite the correct number, reproduce the established finding. Good tests exist for this: ResearchRubrics, ExpertLongBench, DeepResearch Bench, ResearcherBench. We referenced all four.
But retrieval is what a “Reactive” research agent does.
A “Proactive” research agent does something harder:
- Explain why an outcome is likely
- Propose hypotheses that don't yet exist in the literature
- Recognize when no published answer exists, and construct a defensible one rather than hedging toward the closest approximation
These capabilities require rubrics that don't currently exist. So we built them.
What We Built and Why
Prompts: 100+ PhD-level tasks across Physics, Biology, Chemistry, and Mathematics. Each cleared a 5-criteria calibration gate: topic complexity, keyword density, constraint rigidity, inference necessity, and frontier inquiry. Prompts that didn't demand genuine reasoning didn't make the cut.
Scoring: 7-dimension rubric, rated 1–3 by domain SMEs (Master's and PhD candidates):
QC: Multi-layered review, conducted by domain SME peer review, followed by an LLM-as-Judge reference check to eliminate false positives.
The critical design decision: Prompts were split into two types:
- Directive Prompts (DP): open-ended, high structural freedom
- Prescriptive Prompts (PP): rigid formatting requirements (JSON schemas, Markdown tables)
This split became the instrument that revealed the study's central finding.
Results: Two Poles, Not a Ranking
ChatGPT dominates on retrieval and attribution precision. Gemini dominates on reasoning, foresight, and outcome forecasting - the rubrics that distinguish a research associate from a research assistant.
Each score was annotated 1 (inadequate), 2 (competent), or 3 (excellent). All percentages below reflect the % of excellent annotations (score = 3) — the share of responses that didn't just pass, but excelled. This is a stricter bar than pass/fail and more meaningful for evaluating frontier model capability.
.png)
The Compliance Paradox
Here's where it gets interesting — and counterintuitive.
The two tables reveal entirely different collapse profiles.
Gemini's format compliance drops 29 points under prescriptive prompts — but its Prediction Accuracy doesn't move at all, and Core Analysis barely shifts. It is actively sacrificing structure to protect its scientific reasoning. Source Integrity even improves slightly under PP (+7 pts), suggesting that structural constraints push it toward more careful attribution.
ChatGPT does the opposite: its format compliance holds (−4 pts), but Prediction Accuracy drops 11 points and Future Scope drops 13 points. It maintains discipline on structure while losing depth on the forecasting rubrics that define a Research Associate over a Research Assistant. Under pressure, ChatGPT retreats to safe territory.
The drop from Directive to Prescriptive prompts is not noise. Fisher’s exact test confirms significance for both models - Gemini (p-value = 0.001) and ChatGPT (p-value = 0.024
The Forward-Looking Reasoning Cluster
The two novel rubrics are not just measuring “general model quality.” Spearman correlation analysis shows that Prediction Accuracy & Outcome Forecasting has near-zero correlation with every retrieval rubric (ρ ≈ 0.003 with Information Retrieval, ρ ≈ −0.007 with Core Analysis).
A model that retrieves facts perfectly is not systematically better at forecasting outcomes. The two novel rubrics correlate strongly with each other (Gemini ρ = 0.375, ChatGPT ρ = 0.500, both p-value < 0.001), forming a distinct Forward-Looking Reasoning cluster, independent of the retrieval cluster and confirmed across both models.
.png)
What This Means for How You Prompt Research Agents
The DP/PP framework isn't just a study design tool — it's a diagnostic for any team deploying research agents in production.
Use ChatGPT Deep Research when:
- Format compliance is non-negotiable (downstream parsers, structured extraction, attribution logging)
- The task is retrieval-first: literature summarization, source verification, data extraction
Use Gemini Deep Research when:
- You need forward-looking reasoning: hypothesis generation, roadmap drafting, forecasting technical trade-offs at the frontier
- You have a human-in-the-loop for review — format compliance at 44% is not production-safe without it
For prompt design: Overly prescriptive prompt templates can actively undermine the most valuable outputs from research agents. If you're asking Gemini to forecast and you're also demanding rigid JSON output, the Compliance Paradox suggests you may be trading away the thing you actually need.
What This Is Not
This is not a model ranking. It evaluates a specific set of capabilities, PhD-level STEM research reasoning, under controlled conditions.
- Results do not generalize beyond PhD-level STEM research tasks.
- Two models were evaluated, and no additional DRA baselines were included.
- The behavioral patterns are statistically confirmed, but this is a benchmark of the current moment.
Why This Benchmark Exists
The question driving AI research agent deployment has shifted. It is no longer “can the model retrieve facts?” It is “can the model reason forward?”
The distinction between a proactive and a reactive research agent is not vocabulary; it is a different evaluation target. The Prediction Accuracy & Outcome Forecasting and Future Scope rubrics introduced here are a step toward an evaluation framework honest enough to measure it.
The Compliance Paradox is a reminder that behavioral stress-testing, not just aggregate scoring, is where the signal lives.
Extended results and methodology will be published in an upcoming academic venue.
Reach out to us at hey@deccan.ai for more information, work samples, etc.
- Visual Aesthetics (Simplicity, Diversity, Colorfulness, Craftsmanship)
- Quality Adherence (Object & Layout Fidelity, Attribute Fidelity, Edit Precision, Context Preservation, Seamlessness, Text Legibility, Knowledge Grounding)
- Creativity & Novelty
- Fairness & Representation
- Gold samples injected mid-evaluations for drift monitoring (3 independent annotations per task).
- QC adjudication layer for disagreements (~12% of datapoints).
- This is NOT a definitive ranking. It's a structured snapshot under specific rubrics and prompts.
- Scores do NOT predict performance on prompts outside the taxonomy's covered combinations.
- Results do NOT account for inference cost, latency, API availability, or pricing — only output quality.
Explore other Research
View all




.png)
