From Compliance to Foresight: Benchmarking Deep Research Agents

The study focuses on examining the “Proactive” research capabilities of the SOTA models using novel rubrics

Overview

We ran 100+ PhD-level research prompts through ChatGPT Deep Research and Gemini Deep Research, scored by domain SMEs across a 7-rubric framework, including two novel rubrics for forward-looking scientific reasoning. The study focuses on examining the “Proactive” research capabilities of the SOTA models. 

Our hybrid tasking approach - directive prompt (more flexible) and prescriptive prompts (constrained), resulted in a counterintuitive insight: more specific instructions made outputs worse. Gemini's format compliance dropped 29 points under prescriptive constraints; ChatGPT dropped only 4 — but paid for that discipline by losing 11 points on Prediction Accuracy and 13 on Future Scope. We also confirmed, statistically, that forecasting research outcomes is a distinct capability. Our study exposes that the models are not comparatively better, but rather they are at opposite ends of a behavioral spectrum.

The Question No Existing Benchmark Asks

Every major AI benchmark tests retrieval - find the right paper, cite the correct number, reproduce the established finding. Good tests exist for this: ResearchRubrics, ExpertLongBench, DeepResearch Bench, ResearcherBench. We referenced all four.

But retrieval is what a “Reactive” research agent does.

A “Proactive” research agent does something harder:

  • Explain why an outcome is likely 
  • Propose hypotheses that don't yet exist in the literature
  • Recognize when no published answer exists, and construct a defensible one rather than hedging toward the closest approximation

These capabilities require rubrics that don't currently exist. So we built them.

What We Built and Why

PhD-Level Prompts → 5-Criteria Calibration Gate → DP vs PP Split → Model Response → 7-Rubric SME Scoring → Two-Level QC → Results

Prompts: 100+ PhD-level tasks across Physics, Biology, Chemistry, and Mathematics. Each cleared a 5-criteria calibration gate: topic complexity, keyword density, constraint rigidity, inference necessity, and frontier inquiry. Prompts that didn't demand genuine reasoning didn't make the cut.

Scoring: 7-dimension rubric, rated 1–3 by domain SMEs (Master's and PhD candidates):

Rubric What It Tests
Information Retrieval & Accuracy Locate specific, relevant data points within peer-reviewed literature and extract them without distortion, omission, or hallucination.
Source Integrity & Attribution Verify that every claim is derived solely from open-source research and is fully traceable.
Core Analysis & Logic Cognitive depth of the response by distinguishing between reporting facts and synthesizing mechanisms.
Future Scope (novel) Identify whether a solution exists, provide its own unique critique or synthesis, or else derive a novel hypothesis framework.
Prediction Accuracy & Outcome Forecasting (novel) Simulate the future by logically forecasting likely results or failure modes, while distinguishing between blind optimism and calculated, data-backed projections.
Format Compliance Adapt to strict structural constraints (PP) or organize the response autonomously (DP).
Safety Boundary between deep research and harmful content, and penalize false refusals as severely as dangerous outputs.

QC: Multi-layered review, conducted by domain SME peer review, followed by an LLM-as-Judge reference check to eliminate false positives.

The critical design decision: Prompts were split into two types:

  • Directive Prompts (DP): open-ended, high structural freedom
  • Prescriptive Prompts (PP): rigid formatting requirements (JSON schemas, Markdown tables)

This split became the instrument that revealed the study's central finding.

Results: Two Poles, Not a Ranking

ChatGPT dominates on retrieval and attribution precision. Gemini dominates on reasoning, foresight, and outcome forecasting - the rubrics that distinguish a research associate from a research assistant.

Each score was annotated 1 (inadequate), 2 (competent), or 3 (excellent). All percentages below reflect the % of excellent annotations (score = 3) — the share of responses that didn't just pass, but excelled. This is a stricter bar than pass/fail and more meaningful for evaluating frontier model capability.

The Compliance Paradox

Here's where it gets interesting — and counterintuitive.

Gemini
Rubric DP PP Change
Format Compliance 73% 44% −29 pts
Future Scope 67% 50% −17 pts
Info Retrieval & Accuracy 71% 58% −13 pts
Core Analysis & Logic 90% 87% −3 pts
Prediction Accuracy & Forecasting 56% 56% 0 pts
Source Integrity & Attribution 33% 40% +7 pts
Safety 100% 100% 0 pts

ChatGPT
Rubric DP PP Change
Future Scope 48% 35% −13 pts
Prediction Accuracy & Forecasting 38% 27% −11 pts
Info Retrieval & Accuracy 79% 69% −10 pts
Core Analysis & Logic 81% 77% −4 pts
Format Compliance 69% 65% −4 pts
Source Integrity & Attribution 65% 63% −2 pts
Safety 100% 100% 0 pts

The two tables reveal entirely different collapse profiles. 

Gemini's format compliance drops 29 points under prescriptive prompts — but its Prediction Accuracy doesn't move at all, and Core Analysis barely shifts. It is actively sacrificing structure to protect its scientific reasoning. Source Integrity even improves slightly under PP (+7 pts), suggesting that structural constraints push it toward more careful attribution.

ChatGPT does the opposite: its format compliance holds (−4 pts), but Prediction Accuracy drops 11 points and Future Scope drops 13 points. It maintains discipline on structure while losing depth on the forecasting rubrics that define a Research Associate over a Research Assistant. Under pressure, ChatGPT retreats to safe territory.

The drop from Directive to Prescriptive prompts is not noise. Fisher’s exact test confirms significance for both models - Gemini (p-value = 0.001) and ChatGPT (p-value = 0.024

The Forward-Looking Reasoning Cluster

The two novel rubrics are not just measuring “general model quality.” Spearman correlation analysis shows that Prediction Accuracy & Outcome Forecasting has near-zero correlation with every retrieval rubric (ρ ≈ 0.003 with Information Retrieval, ρ ≈ −0.007 with Core Analysis). 

A model that retrieves facts perfectly is not systematically better at forecasting outcomes. The two novel rubrics correlate strongly with each other (Gemini ρ = 0.375, ChatGPT ρ = 0.500, both p-value < 0.001), forming a distinct Forward-Looking Reasoning cluster, independent of the retrieval cluster and confirmed across both models.

What This Means for How You Prompt Research Agents

The DP/PP framework isn't just a study design tool — it's a diagnostic for any team deploying research agents in production.

Use ChatGPT Deep Research when:

  • Format compliance is non-negotiable (downstream parsers, structured extraction, attribution logging)
  • The task is retrieval-first: literature summarization, source verification, data extraction

Use Gemini Deep Research when:

  • You need forward-looking reasoning: hypothesis generation, roadmap drafting, forecasting technical trade-offs at the frontier
  • You have a human-in-the-loop for review — format compliance at 44% is not production-safe without it

For prompt design: Overly prescriptive prompt templates can actively undermine the most valuable outputs from research agents. If you're asking Gemini to forecast and you're also demanding rigid JSON output, the Compliance Paradox suggests you may be trading away the thing you actually need.

What This Is Not

This is not a model ranking. It evaluates a specific set of capabilities, PhD-level STEM research reasoning, under controlled conditions.

  • Results do not generalize beyond PhD-level STEM research tasks.
  • Two models were evaluated, and no additional DRA baselines were included.
  • The behavioral patterns are statistically confirmed, but this is a benchmark of the current moment.

Why This Benchmark Exists

The question driving AI research agent deployment has shifted. It is no longer “can the model retrieve facts?” It is “can the model reason forward?”

The distinction between a proactive and a reactive research agent is not vocabulary; it is a different evaluation target. The Prediction Accuracy & Outcome Forecasting and Future Scope rubrics introduced here are a step toward an evaluation framework honest enough to measure it.

The Compliance Paradox is a reminder that behavioral stress-testing, not just aggregate scoring, is where the signal lives.

Extended results and methodology will be published in an upcoming academic venue.

Reach out to us at hey@deccan.ai for more information, work samples, etc.

1
Models Evaluated 
Seedream 4.0, Higgsfield Soul, GPT Image 1, Flux.1 Kontext and Nano Banana Pro.
2
Task Diversity
Multi-complexity tasks across image generation (novel/knowledge-based) and image editing (with/without preservation) domain.
3
Prompt Design
40 novel prompts constructed from a 5-dimension taxonomy (Use Case × Content Type × Style × Conversation Type × Composition) ensuring diverse coverage.
4
Rubric Dimensions
  • Visual Aesthetics (Simplicity, Diversity, Colorfulness, Craftsmanship) 
  • Quality Adherence (Object & Layout Fidelity, Attribute Fidelity, Edit Precision, Context Preservation, Seamlessness, Text Legibility, Knowledge Grounding) 
  • Creativity & Novelty
  • Fairness & Representation
5
Scoring & Rating
Scored model performance on 4-point Likert scale (1=lowest, 4=highest) and implemented win-rate matrix over N=40 head-to-head comparisons.
6
Quality Control
  • Gold samples injected mid-evaluations for drift monitoring (3 independent annotations per task). 
  • QC adjudication layer for disagreements (~12% of datapoints). 
Note:
  • This is NOT a definitive ranking. It's a structured snapshot under specific rubrics and prompts.
  • Scores do NOT predict performance on prompts outside the taxonomy's covered combinations.
  • Results do NOT account for inference cost, latency, API availability, or pricing — only output quality.