How to find an apprenticeship?

We provide an official service to search through available apprenticeships. To get started, create an account here, specify the desired region, and your preferences. You will be able to search through all officially registered open apprenticeships.

You can contact the apprenticeship office through our official phone hotline above, or with the web-form below. We generally respond to written requests within 7-10 days.

All Research

From Compliance to Foresight: Benchmarking Deep Research Agents

The study focuses on examining the “Proactive” research capabilities of the SOTA models using novel rubrics

Nishit Verma

Aravapalli Sumadhura

Turiya Dandu

Akshat Rajkumar Mathur

Tanmay Asthana

Anshool Agarwal

Build a Competitive Intelligence Shield

The study focuses on examining the “Proactive” research capabilities of the SOTA models using novel rubrics

Overview

We ran 100+ PhD-level research prompts through ChatGPT Deep Research and Gemini Deep Research, scored by domain SMEs across a 7-rubric framework, including two novel rubrics for forward-looking scientific reasoning. The study focuses on examining the “Proactive” research capabilities of the SOTA models.

Our hybrid tasking approach - directive prompt (more flexible) and prescriptive prompts (constrained), resulted in a counterintuitive insight: more specific instructions made outputs worse. Gemini's format compliance dropped 29 points under prescriptive constraints; ChatGPT dropped only 4 — but paid for that discipline by losing 11 points on Prediction Accuracy and 13 on Future Scope. We also confirmed, statistically, that forecasting research outcomes is a distinct capability. Our study exposes that the models are not comparatively better, but rather they are at opposite ends of a behavioral spectrum.

‍

The Question No Existing Benchmark Asks

Every major AI benchmark tests retrieval - find the right paper, cite the correct number, reproduce the established finding. Good tests exist for this: ResearchRubrics, ExpertLongBench, DeepResearch Bench, ResearcherBench. We referenced all four.

But retrieval is what a “Reactive” research agent does.

A “Proactive” research agent does something harder:

Explain why an outcome is likely
Propose hypotheses that don't yet exist in the literature
Recognize when no published answer exists, and construct a defensible one rather than hedging toward the closest approximation

These capabilities require rubrics that don't currently exist. So we built them.

‍

What We Built and Why

PhD-Level Prompts → 5-Criteria Calibration Gate → DP vs PP Split → Model Response → 7-Rubric SME Scoring → Two-Level QC → Results

‍

Prompts: 100+ PhD-level tasks across Physics, Biology, Chemistry, and Mathematics. Each cleared a 5-criteria calibration gate: topic complexity, keyword density, constraint rigidity, inference necessity, and frontier inquiry. Prompts that didn't demand genuine reasoning didn't make the cut.

Scoring: 7-dimension rubric, rated 1–3 by domain SMEs (Master's and PhD candidates):

‍

Rubric	What It Tests
Information Retrieval & Accuracy	Locate specific, relevant data points within peer-reviewed literature and extract them without distortion, omission, or hallucination.
Source Integrity & Attribution	Verify that every claim is derived solely from open-source research and is fully traceable.
Core Analysis & Logic	Cognitive depth of the response by distinguishing between reporting facts and synthesizing mechanisms.
Future Scope (novel)	Identify whether a solution exists, provide its own unique critique or synthesis, or else derive a novel hypothesis framework.
Prediction Accuracy & Outcome Forecasting (novel)	Simulate the future by logically forecasting likely results or failure modes, while distinguishing between blind optimism and calculated, data-backed projections.
Format Compliance	Adapt to strict structural constraints (PP) or organize the response autonomously (DP).
Safety	Boundary between deep research and harmful content, and penalize false refusals as severely as dangerous outputs.

‍

QC: Multi-layered review, conducted by domain SME peer review, followed by an LLM-as-Judge reference check to eliminate false positives.

The critical design decision: Prompts were split into two types:

Directive Prompts (DP): open-ended, high structural freedom
Prescriptive Prompts (PP): rigid formatting requirements (JSON schemas, Markdown tables)

This split became the instrument that revealed the study's central finding.

‍

Results: Two Poles, Not a Ranking

ChatGPT dominates on retrieval and attribution precision. Gemini dominates on reasoning, foresight, and outcome forecasting - the rubrics that distinguish a research associate from a research assistant.

Each score was annotated 1 (inadequate), 2 (competent), or 3 (excellent). All percentages below reflect the % of excellent annotations (score = 3) — the share of responses that didn't just pass, but excelled. This is a stricter bar than pass/fail and more meaningful for evaluating frontier model capability.

‍

‍

The Compliance Paradox

Here's where it gets interesting — and counterintuitive.

‍

Gemini

Rubric	DP	PP	Change
Format Compliance	73%	44%	−29 pts
Future Scope	67%	50%	−17 pts
Info Retrieval & Accuracy	71%	58%	−13 pts
Core Analysis & Logic	90%	87%	−3 pts
Prediction Accuracy & Forecasting	56%	56%	0 pts
Source Integrity & Attribution	33%	40%	+7 pts
Safety	100%	100%	0 pts

‍

ChatGPT

Rubric	DP	PP	Change
Future Scope	48%	35%	−13 pts
Prediction Accuracy & Forecasting	38%	27%	−11 pts
Info Retrieval & Accuracy	79%	69%	−10 pts
Core Analysis & Logic	81%	77%	−4 pts
Format Compliance	69%	65%	−4 pts
Source Integrity & Attribution	65%	63%	−2 pts
Safety	100%	100%	0 pts

‍

The two tables reveal entirely different collapse profiles.

Gemini's format compliance drops 29 points under prescriptive prompts — but its Prediction Accuracy doesn't move at all, and Core Analysis barely shifts. It is actively sacrificing structure to protect its scientific reasoning. Source Integrity even improves slightly under PP (+7 pts), suggesting that structural constraints push it toward more careful attribution.

ChatGPT does the opposite: its format compliance holds (−4 pts), but Prediction Accuracy drops 11 points and Future Scope drops 13 points. It maintains discipline on structure while losing depth on the forecasting rubrics that define a Research Associate over a Research Assistant. Under pressure, ChatGPT retreats to safe territory.

‍

The drop from Directive to Prescriptive prompts is not noise. Fisher’s exact test confirms significance for both models - Gemini (p-value = 0.001) and ChatGPT (p-value = 0.024

‍

The Forward-Looking Reasoning Cluster

The two novel rubrics are not just measuring “general model quality.” Spearman correlation analysis shows that Prediction Accuracy & Outcome Forecasting has near-zero correlation with every retrieval rubric (ρ ≈ 0.003 with Information Retrieval, ρ ≈ −0.007 with Core Analysis).

A model that retrieves facts perfectly is not systematically better at forecasting outcomes. The two novel rubrics correlate strongly with each other (Gemini ρ = 0.375, ChatGPT ρ = 0.500, both p-value < 0.001), forming a distinct Forward-Looking Reasoning cluster, independent of the retrieval cluster and confirmed across both models.

‍

‍

What This Means for How You Prompt Research Agents

The DP/PP framework isn't just a study design tool — it's a diagnostic for any team deploying research agents in production.

Use ChatGPT Deep Research when:

Format compliance is non-negotiable (downstream parsers, structured extraction, attribution logging)
The task is retrieval-first: literature summarization, source verification, data extraction

Use Gemini Deep Research when:

You need forward-looking reasoning: hypothesis generation, roadmap drafting, forecasting technical trade-offs at the frontier
You have a human-in-the-loop for review — format compliance at 44% is not production-safe without it

For prompt design: Overly prescriptive prompt templates can actively undermine the most valuable outputs from research agents. If you're asking Gemini to forecast and you're also demanding rigid JSON output, the Compliance Paradox suggests you may be trading away the thing you actually need.

‍

What This Is Not

This is not a model ranking. It evaluates a specific set of capabilities, PhD-level STEM research reasoning, under controlled conditions.

Results do not generalize beyond PhD-level STEM research tasks.
Two models were evaluated, and no additional DRA baselines were included.
The behavioral patterns are statistically confirmed, but this is a benchmark of the current moment.

‍

Why This Benchmark Exists

The question driving AI research agent deployment has shifted. It is no longer “can the model retrieve facts?” It is “can the model reason forward?”

The distinction between a proactive and a reactive research agent is not vocabulary; it is a different evaluation target. The Prediction Accuracy & Outcome Forecasting and Future Scope rubrics introduced here are a step toward an evaluation framework honest enough to measure it.

The Compliance Paradox is a reminder that behavioral stress-testing, not just aggregate scoring, is where the signal lives.

‍

Extended results and methodology will be published in an upcoming academic venue.

‍

Acknowledgments

Mansi Rawat, Abhay Kasturi, Pinki Yadav, Charu Sharma, Thapandev V, A Shwetha

More Research

View All

IF Benchmark: Constraint Choice Predicts Failure Rate Better Than Model Choice

This study evaluates instruction-following reliability in frontier models under controlled, multi-constraint conditions.

Read full research

Image Generation Study: Deeper Assessment of Image SOTA Models

We introduce a new framework for end-to-end evaluation of next-gen image generation models

Read full research

View All

Models Evaluated

Seedream 4.0, Higgsfield Soul, GPT Image 1, Flux.1 Kontext and Nano Banana Pro.

Task Diversity

Multi-complexity tasks across image generation (novel/knowledge-based) and image editing (with/without preservation) domain.

Prompt Design

40 novel prompts constructed from a 5-dimension taxonomy (Use Case × Content Type × Style × Conversation Type × Composition) ensuring diverse coverage.

Rubric Dimensions

Visual Aesthetics (Simplicity, Diversity, Colorfulness, Craftsmanship)
Quality Adherence (Object & Layout Fidelity, Attribute Fidelity, Edit Precision, Context Preservation, Seamlessness, Text Legibility, Knowledge Grounding)
Creativity & Novelty
Fairness & Representation

Scoring & Rating

Scored model performance on 4-point Likert scale (1=lowest, 4=highest) and implemented win-rate matrix over N=40 head-to-head comparisons.

Quality Control

Gold samples injected mid-evaluations for drift monitoring (3 independent annotations per task).
QC adjudication layer for disagreements (~12% of datapoints).

Note:

This is NOT a definitive ranking. It's a structured snapshot under specific rubrics and prompts.
Scores do NOT predict performance on prompts outside the taxonomy's covered combinations.
Results do NOT account for inference cost, latency, API availability, or pricing — only output quality.

This doesn’t have to end here

Accuracy is Intelligence

Book a Demo