Logo Stargazer

A Scalable Model-Fitting Benchmark Environment for AI Agents under Astrophysical Constraints

Xinge Liu1*, Terry Jingchen Zhang2*, Bernhard Schölkopf3,4, Zhijing Jin1,2,3, Kristen Menou1
1University of Toronto, 2Vector Institute, 3Max Planck Institute for Intelligent Systems, Tübingen, Germany, 4ELLIS Institute Tübingen
Robot mascot waving
Hi! I scan RV curves for hidden planets.
Click to chat
Robot mascot with magnifying glass
120 tasks across three tiers!
Click to chat

Introduction

We introduce Stargazer, a scalable environment for evaluating AI agents on dynamic, iterative physics-grounded model-fitting tasks using inference on radial-velocity time series data. The benchmark contains 120 tasks across three tiers, including 20 real archival cases, and covers settings ranging from high-SNR single-planet systems to complex multi-planet configurations requiring careful low-SNR analysis.

Across eight frontier agents, we observe a persistent gap between numerical optimization and physical recovery: agents often produce statistically good fits while failing to recover the correct planetary system parameters. Stargazer is designed to evaluate agentic workflows involving period search, iterative Keplerian fitting, model selection, and repeated submission under feedback.

Logo Overview of Stargazer

The benchmark combines a scalable synthetic task generator with real radial-velocity systems from archival astronomy data. Each episode exposes an RV dataset and asks an agent to infer the underlying planetary configuration using tools for scientific computation and structured submission.

Overview of Stargazer

Overview of Stargazer. Left: 120 RV tasks controlled by six physical factors. Center: agents run a periodogram-to-Keplerian workflow and are graded on statistical and physical criteria. Right: strong statistical fit does not guarantee correct physical recovery.

Environment and Evaluation

Agents interact with the environment through a ReAct-style loop with two tools: a PythonREPL for analysis code and a submit_action interface for proposing candidate planetary systems. After each submission, the evaluator returns criterion-level feedback so agents can revise their hypothesis.

A task passes only if all four criteria are satisfied simultaneously: ok_delta_bic, ok_rms, ok_match, and ok_count. This conjunction gate is what makes the benchmark physically meaningful rather than a pure curve-fitting exercise.

Stargazer framework

Framework of Stargazer. Left: task generation from synthetic physics or archival RV data. Center: agent iteration loop of analysis, submission, and diagnostic feedback. Right: evaluator forward-models submissions and grades with ΔBIC, RMS, Match, and Count.

Robot mascot with telescope
Watch how agents iterate through period search and fitting.
Click to chat

Interactive Case Study

This case study is integrated into the homepage flow as a dedicated section. It illustrates a representative failure mode in Stargazer: the agent achieves a strong statistical fit, but still fails physical recovery because it converges to a two-planet alias and misses the true planet count.

Failure Case Demo

Alias Convergence Hides the Missing Planet

This run reaches an excellent residual RMS, but still fails because the model converges to a two-planet alias and misses the true planet count. The benchmark rewards physical recovery, not just a visually tight fit.

Fail Match -- ΔBIC -- RMS -- Detected -- Tokens --

Ground Truth vs Alias Recovery

True Planets
PlanetPeriodKe
Recovered Alias Fit
PlanetPeriodKe

Evaluator Diagnostics

Agent Timeline: 6 selected steps

Interactive walkthrough of a failure case where fit quality passes, but the recovered planetary system is still wrong.

Logo Main Results

Robot mascot peeking
A low RMS can still hide the wrong physics.
Click to chat

We evaluate eight frontier agents and two deterministic baselines. Three findings stand out. First, statistical fit quality does not imply physical recovery. Second, increasing test-time compute yields only marginal gains and often reflects recursive failure loops rather than meaningful exploration. Third, all evaluated agents fail on the real-data subset, even though these systems are solved by human astronomers in the literature.

Skills help on Easy-tier tasks mainly by improving efficiency and letting agents reach submission more often, but they do not reliably transfer to Hard-tier reasoning. The benchmark therefore exposes a bottleneck in model selection and physically grounded inference rather than in raw numerical optimization.

Main Results Table

Model Pass Rate (%) Env Done (%) Pass@3 (%) Real
Easy Med Hard Easy Med Hard Easy Med Hard (20 tasks)
Classical Pipeline 95.035.05.0 100100100 95.035.05.0 ---
Nested Sampling 95.032.50.0 100100100 95.032.50.0 ---
o3-mini 40.024.20.0 76.735.84.2 95.040.00.0 0.0
GPT-5-mini 76.733.62.5 76.746.75.0 95.035.02.5 0.0
GPT-5.2 40.030.05.8 40.033.35.8 75.037.512.5 0.0
Kimi-K2.5 13.317.90.8 13.319.22.5 40.035.02.5 0.0
Qwen-3.5-Plus 26.725.01.6 25.025.81.7 60.030.02.5 0.0
Gemini-3.1-Pro 71.735.05.0 71.735.05.8 95.035.07.5 0.0
Claude-Sonnet-4.6 68.322.50.8 68.328.30.8 75.032.52.5 0.0
GPT-5.3-codex 80.030.84.2 88.348.37.5 95.040.07.5 0.0

Main results on Stargazer across tiers computed over three independent runs. Pass rates and Env Done rates are averaged; Pass@3 reports the fraction of tasks solved in at least one run. Bold = best per column; underlined = second best.

Statistical versus physical

Statistical criteria remain relatively high while physical recovery drops sharply on harder tiers.

Match threshold analysis

Match-threshold sensitivity analysis shows that model rankings remain stable across a broad threshold range.

Effect of Skills Injection

Model Pass Rate Default Pass Rate + Skills Pass@3 Default Pass@3 + Skills
EasyMedHard EasyMedHard EasyMedHard EasyMedHard
GPT-5-mini 76.733.62.5 90.0 +13.335.8 +2.22.5 +0.0 95.035.02.5 95.0 +0.037.5 +2.52.5 +0.0
Qwen-3.5-Plus 26.725.01.6 48.3 +21.623.3 -1.70.0 -1.6 60.030.02.5 90.0 +30.035.0 +5.00.0 -2.5
Gemini-3.1-Pro 71.735.05.0 100.0 +28.356.7 +21.716.7 +11.7 95.035.07.5 100.0 +5.080.0 +45.035.0 +27.5
GPT-5.3-codex 80.030.84.2 74.6 -5.453.2 +22.425.6 +21.4 95.040.07.5 100.0 +5.067.5 +27.531.6 +24.1
Robot mascot peeking
These figures show where agents loop or recover.
Click to chat

Additional Figures

Failure correlation heatmap

Failure correlation heatmap across task drivers and evaluation outcomes.

Case study RV fits

Representative case-study trajectories illustrating successful recovery and local-minimum failure modes.

Per-Criterion Success Rates

Model Easy Medium Hard
ΔBICRMSMatchCount ΔBICRMSMatchCount ΔBICRMSMatchCount
Kimi-K2.510050.050.010089.872.727.377.384.964.21.935.8
Qwen-3.5-Plus90.085.080.010087.587.536.480.789.871.20.027.1
o3-mini98.380.870.082.592.578.627.769.289.762.30.727.4
GPT-5.210093.186.210094.096.443.477.195.290.511.959.5
Claude-Sonnet-4.697.997.987.295.710098.746.182.996.696.63.410.3
Gemini-3.1-Pro10095.891.710095.597.339.676.688.886.73.158.2
GPT-5-mini10092.783.692.793.394.339.073.393.384.40.025.6
GPT-5.3-codex94.784.278.994.792.690.738.979.610010033.333.3
Robot mascot reading
Thanks for reading! Consider citing our work.
Click to chat

BibTeX

@article{liu2026stargazer,
  title={Stargazer: A Scalable Model-Fitting Benchmark Environment for AI Agents under Astrophysical Constraints},
  author={Liu, Xinge and Zhang, Terry Jingchen and Sch{\"o}lkopf, Bernhard and Jin, Zhijing and Menou, Kristen},
  year={2026},
  journal={arXiv preprint arXiv:2604.15664},
  url={https://arxiv.org/abs/2604.15664}
}