Stargazer

A Scalable Model-Fitting Benchmark Environment for AI Agents under Astrophysical Constraints

Xinge Liu^1*, Terry Jingchen Zhang^2*, Bernhard Schölkopf^3,4, Zhijing Jin^1,2,3, Kristen Menou¹

¹University of Toronto, ²Vector Institute, ³Max Planck Institute for Intelligent Systems, Tübingen, Germany, ⁴ELLIS Institute Tübingen

Paper Code

🤗

Dataset

Hi! I scan RV curves for hidden planets.

Click to chat

120 tasks across three tiers!

Click to chat

Introduction

We introduce Stargazer, a scalable environment for evaluating AI agents on dynamic, iterative physics-grounded model-fitting tasks using inference on radial-velocity time series data. The benchmark contains 120 tasks across three tiers, including 20 real archival cases, and covers settings ranging from high-SNR single-planet systems to complex multi-planet configurations requiring careful low-SNR analysis.

Across eight frontier agents, we observe a persistent gap between numerical optimization and physical recovery: agents often produce statistically good fits while failing to recover the correct planetary system parameters. Stargazer is designed to evaluate agentic workflows involving period search, iterative Keplerian fitting, model selection, and repeated submission under feedback.

Overview of Stargazer

The benchmark combines a scalable synthetic task generator with real radial-velocity systems from archival astronomy data. Each episode exposes an RV dataset and asks an agent to infer the underlying planetary configuration using tools for scientific computation and structured submission.

Overview of Stargazer. Left: 120 RV tasks controlled by six physical factors. Center: agents run a periodogram-to-Keplerian workflow and are graded on statistical and physical criteria. Right: strong statistical fit does not guarantee correct physical recovery.

Environment and Evaluation

Agents interact with the environment through a ReAct-style loop with two tools: a PythonREPL for analysis code and a submit_action interface for proposing candidate planetary systems. After each submission, the evaluator returns criterion-level feedback so agents can revise their hypothesis.

A task passes only if all four criteria are satisfied simultaneously: ok_delta_bic, ok_rms, ok_match, and ok_count. This conjunction gate is what makes the benchmark physically meaningful rather than a pure curve-fitting exercise.

Framework of Stargazer. Left: task generation from synthetic physics or archival RV data. Center: agent iteration loop of analysis, submission, and diagnostic feedback. Right: evaluator forward-models submissions and grades with ΔBIC, RMS, Match, and Count.

Watch how agents iterate through period search and fitting.

Click to chat

Interactive Case Study

This case study is integrated into the homepage flow as a dedicated section. It illustrates a representative failure mode in Stargazer: the agent achieves a strong statistical fit, but still fails physical recovery because it converges to a two-planet alias and misses the true planet count.

Failure Case Demo

Alias Convergence Hides the Missing Planet

This run reaches an excellent residual RMS, but still fails because the model converges to a two-planet alias and misses the true planet count. The benchmark rewards physical recovery, not just a visually tight fit.

Fail Match -- ΔBIC -- RMS -- Detected -- Tokens --

Ground Truth vs Alias Recovery

True Planets

Planet	Period	K	e

Recovered Alias Fit

Planet	Period	K	e

Evaluator Diagnostics

Agent Timeline: 6 selected steps

Interactive walkthrough of a failure case where fit quality passes, but the recovered planetary system is still wrong.

Main Results

A low RMS can still hide the wrong physics.

Click to chat

We evaluate eight frontier agents and two deterministic baselines. Three findings stand out. First, statistical fit quality does not imply physical recovery. Second, increasing test-time compute yields only marginal gains and often reflects recursive failure loops rather than meaningful exploration. Third, all evaluated agents fail on the real-data subset, even though these systems are solved by human astronomers in the literature.

Skills help on Easy-tier tasks mainly by improving efficiency and letting agents reach submission more often, but they do not reliably transfer to Hard-tier reasoning. The benchmark therefore exposes a bottleneck in model selection and physically grounded inference rather than in raw numerical optimization.

Main Results Table

Model	Pass Rate (%)			Env Done (%)			Pass@3 (%)			Real
Model	Easy	Med	Hard	Easy	Med	Hard	Easy	Med	Hard	(20 tasks)
Classical Pipeline	95.0	35.0	5.0	100	100	100	95.0	35.0	5.0	---
Nested Sampling	95.0	32.5	0.0	100	100	100	95.0	32.5	0.0	---
o3-mini	40.0	24.2	0.0	76.7	35.8	4.2	95.0	40.0	0.0	0.0
GPT-5-mini	76.7	33.6	2.5	76.7	46.7	5.0	95.0	35.0	2.5	0.0
GPT-5.2	40.0	30.0	5.8	40.0	33.3	5.8	75.0	37.5	12.5	0.0
Kimi-K2.5	13.3	17.9	0.8	13.3	19.2	2.5	40.0	35.0	2.5	0.0
Qwen-3.5-Plus	26.7	25.0	1.6	25.0	25.8	1.7	60.0	30.0	2.5	0.0
Gemini-3.1-Pro	71.7	35.0	5.0	71.7	35.0	5.8	95.0	35.0	7.5	0.0
Claude-Sonnet-4.6	68.3	22.5	0.8	68.3	28.3	0.8	75.0	32.5	2.5	0.0
GPT-5.3-codex	80.0	30.8	4.2	88.3	48.3	7.5	95.0	40.0	7.5	0.0

Main results on Stargazer across tiers computed over three independent runs. Pass rates and Env Done rates are averaged; Pass@3 reports the fraction of tasks solved in at least one run. Bold = best per column; underlined = second best.

Statistical criteria remain relatively high while physical recovery drops sharply on harder tiers.

Match-threshold sensitivity analysis shows that model rankings remain stable across a broad threshold range.

Effect of Skills Injection

Model	Pass Rate Default			Pass Rate + Skills			Pass@3 Default			Pass@3 + Skills
Model	Easy	Med	Hard	Easy	Med	Hard	Easy	Med	Hard	Easy	Med	Hard
GPT-5-mini	76.7	33.6	2.5	90.0 +13.3	35.8 +2.2	2.5 +0.0	95.0	35.0	2.5	95.0 +0.0	37.5 +2.5	2.5 +0.0
Qwen-3.5-Plus	26.7	25.0	1.6	48.3 +21.6	23.3 -1.7	0.0 -1.6	60.0	30.0	2.5	90.0 +30.0	35.0 +5.0	0.0 -2.5
Gemini-3.1-Pro	71.7	35.0	5.0	100.0 +28.3	56.7 +21.7	16.7 +11.7	95.0	35.0	7.5	100.0 +5.0	80.0 +45.0	35.0 +27.5
GPT-5.3-codex	80.0	30.8	4.2	74.6 -5.4	53.2 +22.4	25.6 +21.4	95.0	40.0	7.5	100.0 +5.0	67.5 +27.5	31.6 +24.1

These figures show where agents loop or recover.

Click to chat

Additional Figures

Failure correlation heatmap across task drivers and evaluation outcomes.

Representative case-study trajectories illustrating successful recovery and local-minimum failure modes.

Per-Criterion Success Rates

Model	Easy				Medium				Hard
Model	ΔBIC	RMS	Match	Count	ΔBIC	RMS	Match	Count	ΔBIC	RMS	Match	Count
Kimi-K2.5	100	50.0	50.0	100	89.8	72.7	27.3	77.3	84.9	64.2	1.9	35.8
Qwen-3.5-Plus	90.0	85.0	80.0	100	87.5	87.5	36.4	80.7	89.8	71.2	0.0	27.1
o3-mini	98.3	80.8	70.0	82.5	92.5	78.6	27.7	69.2	89.7	62.3	0.7	27.4
GPT-5.2	100	93.1	86.2	100	94.0	96.4	43.4	77.1	95.2	90.5	11.9	59.5
Claude-Sonnet-4.6	97.9	97.9	87.2	95.7	100	98.7	46.1	82.9	96.6	96.6	3.4	10.3
Gemini-3.1-Pro	100	95.8	91.7	100	95.5	97.3	39.6	76.6	88.8	86.7	3.1	58.2
GPT-5-mini	100	92.7	83.6	92.7	93.3	94.3	39.0	73.3	93.3	84.4	0.0	25.6
GPT-5.3-codex	94.7	84.2	78.9	94.7	92.6	90.7	38.9	79.6	100	100	33.3	33.3

Thanks for reading! Consider citing our work.

Click to chat

BibTeX

@article{liu2026stargazer,
  title={Stargazer: A Scalable Model-Fitting Benchmark Environment for AI Agents under Astrophysical Constraints},
  author={Liu, Xinge and Zhang, Terry Jingchen and Sch{\"o}lkopf, Bernhard and Jin, Zhijing and Menou, Kristen},
  year={2026},
  journal={arXiv preprint arXiv:2604.15664},
  url={https://arxiv.org/abs/2604.15664}
}