We introduce Stargazer, a scalable environment for evaluating AI agents on dynamic, iterative physics-grounded model-fitting tasks using inference on radial-velocity time series data. The benchmark contains 120 tasks across three tiers, including 20 real archival cases, and covers settings ranging from high-SNR single-planet systems to complex multi-planet configurations requiring careful low-SNR analysis.
Across eight frontier agents, we observe a persistent gap between numerical optimization and physical recovery: agents often produce statistically good fits while failing to recover the correct planetary system parameters. Stargazer is designed to evaluate agentic workflows involving period search, iterative Keplerian fitting, model selection, and repeated submission under feedback.
The benchmark combines a scalable synthetic task generator with real radial-velocity systems from archival astronomy data. Each episode exposes an RV dataset and asks an agent to infer the underlying planetary configuration using tools for scientific computation and structured submission.
Overview of Stargazer. Left: 120 RV tasks controlled by six physical factors. Center: agents run a periodogram-to-Keplerian workflow and are graded on statistical and physical criteria. Right: strong statistical fit does not guarantee correct physical recovery.
Agents interact with the environment through a ReAct-style loop with two tools: a PythonREPL for analysis code and a submit_action interface for proposing candidate planetary systems. After each submission, the evaluator returns criterion-level feedback so agents can revise their hypothesis.
A task passes only if all four criteria are satisfied simultaneously: ok_delta_bic, ok_rms, ok_match, and ok_count. This conjunction gate is what makes the benchmark physically meaningful rather than a pure curve-fitting exercise.
Framework of Stargazer. Left: task generation from synthetic physics or archival RV data. Center: agent iteration loop of analysis, submission, and diagnostic feedback. Right: evaluator forward-models submissions and grades with ΔBIC, RMS, Match, and Count.
This case study is integrated into the homepage flow as a dedicated section. It illustrates a representative failure mode in Stargazer: the agent achieves a strong statistical fit, but still fails physical recovery because it converges to a two-planet alias and misses the true planet count.
This run reaches an excellent residual RMS, but still fails because the model converges to a two-planet alias and misses the true planet count. The benchmark rewards physical recovery, not just a visually tight fit.
| Planet | Period | K | e |
|---|
| Planet | Period | K | e |
|---|
Interactive walkthrough of a failure case where fit quality passes, but the recovered planetary system is still wrong.
We evaluate eight frontier agents and two deterministic baselines. Three findings stand out. First, statistical fit quality does not imply physical recovery. Second, increasing test-time compute yields only marginal gains and often reflects recursive failure loops rather than meaningful exploration. Third, all evaluated agents fail on the real-data subset, even though these systems are solved by human astronomers in the literature.
Skills help on Easy-tier tasks mainly by improving efficiency and letting agents reach submission more often, but they do not reliably transfer to Hard-tier reasoning. The benchmark therefore exposes a bottleneck in model selection and physically grounded inference rather than in raw numerical optimization.
| Model | Pass Rate (%) | Env Done (%) | Pass@3 (%) | Real | ||||||
|---|---|---|---|---|---|---|---|---|---|---|
| Easy | Med | Hard | Easy | Med | Hard | Easy | Med | Hard | (20 tasks) | |
| Classical Pipeline | 95.0 | 35.0 | 5.0 | 100 | 100 | 100 | 95.0 | 35.0 | 5.0 | --- |
| Nested Sampling | 95.0 | 32.5 | 0.0 | 100 | 100 | 100 | 95.0 | 32.5 | 0.0 | --- |
| o3-mini | 40.0 | 24.2 | 0.0 | 76.7 | 35.8 | 4.2 | 95.0 | 40.0 | 0.0 | 0.0 |
| GPT-5-mini | 76.7 | 33.6 | 2.5 | 76.7 | 46.7 | 5.0 | 95.0 | 35.0 | 2.5 | 0.0 |
| GPT-5.2 | 40.0 | 30.0 | 5.8 | 40.0 | 33.3 | 5.8 | 75.0 | 37.5 | 12.5 | 0.0 |
| Kimi-K2.5 | 13.3 | 17.9 | 0.8 | 13.3 | 19.2 | 2.5 | 40.0 | 35.0 | 2.5 | 0.0 |
| Qwen-3.5-Plus | 26.7 | 25.0 | 1.6 | 25.0 | 25.8 | 1.7 | 60.0 | 30.0 | 2.5 | 0.0 |
| Gemini-3.1-Pro | 71.7 | 35.0 | 5.0 | 71.7 | 35.0 | 5.8 | 95.0 | 35.0 | 7.5 | 0.0 |
| Claude-Sonnet-4.6 | 68.3 | 22.5 | 0.8 | 68.3 | 28.3 | 0.8 | 75.0 | 32.5 | 2.5 | 0.0 |
| GPT-5.3-codex | 80.0 | 30.8 | 4.2 | 88.3 | 48.3 | 7.5 | 95.0 | 40.0 | 7.5 | 0.0 |
Main results on Stargazer across tiers computed over three independent runs. Pass rates and Env Done rates are averaged; Pass@3 reports the fraction of tasks solved in at least one run. Bold = best per column; underlined = second best.
Statistical criteria remain relatively high while physical recovery drops sharply on harder tiers.
Match-threshold sensitivity analysis shows that model rankings remain stable across a broad threshold range.
| Model | Pass Rate Default | Pass Rate + Skills | Pass@3 Default | Pass@3 + Skills | ||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Easy | Med | Hard | Easy | Med | Hard | Easy | Med | Hard | Easy | Med | Hard | |
| GPT-5-mini | 76.7 | 33.6 | 2.5 | 90.0 +13.3 | 35.8 +2.2 | 2.5 +0.0 | 95.0 | 35.0 | 2.5 | 95.0 +0.0 | 37.5 +2.5 | 2.5 +0.0 |
| Qwen-3.5-Plus | 26.7 | 25.0 | 1.6 | 48.3 +21.6 | 23.3 -1.7 | 0.0 -1.6 | 60.0 | 30.0 | 2.5 | 90.0 +30.0 | 35.0 +5.0 | 0.0 -2.5 |
| Gemini-3.1-Pro | 71.7 | 35.0 | 5.0 | 100.0 +28.3 | 56.7 +21.7 | 16.7 +11.7 | 95.0 | 35.0 | 7.5 | 100.0 +5.0 | 80.0 +45.0 | 35.0 +27.5 |
| GPT-5.3-codex | 80.0 | 30.8 | 4.2 | 74.6 -5.4 | 53.2 +22.4 | 25.6 +21.4 | 95.0 | 40.0 | 7.5 | 100.0 +5.0 | 67.5 +27.5 | 31.6 +24.1 |
Failure correlation heatmap across task drivers and evaluation outcomes.
Representative case-study trajectories illustrating successful recovery and local-minimum failure modes.
| Model | Easy | Medium | Hard | |||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| ΔBIC | RMS | Match | Count | ΔBIC | RMS | Match | Count | ΔBIC | RMS | Match | Count | |
| Kimi-K2.5 | 100 | 50.0 | 50.0 | 100 | 89.8 | 72.7 | 27.3 | 77.3 | 84.9 | 64.2 | 1.9 | 35.8 |
| Qwen-3.5-Plus | 90.0 | 85.0 | 80.0 | 100 | 87.5 | 87.5 | 36.4 | 80.7 | 89.8 | 71.2 | 0.0 | 27.1 |
| o3-mini | 98.3 | 80.8 | 70.0 | 82.5 | 92.5 | 78.6 | 27.7 | 69.2 | 89.7 | 62.3 | 0.7 | 27.4 |
| GPT-5.2 | 100 | 93.1 | 86.2 | 100 | 94.0 | 96.4 | 43.4 | 77.1 | 95.2 | 90.5 | 11.9 | 59.5 |
| Claude-Sonnet-4.6 | 97.9 | 97.9 | 87.2 | 95.7 | 100 | 98.7 | 46.1 | 82.9 | 96.6 | 96.6 | 3.4 | 10.3 |
| Gemini-3.1-Pro | 100 | 95.8 | 91.7 | 100 | 95.5 | 97.3 | 39.6 | 76.6 | 88.8 | 86.7 | 3.1 | 58.2 |
| GPT-5-mini | 100 | 92.7 | 83.6 | 92.7 | 93.3 | 94.3 | 39.0 | 73.3 | 93.3 | 84.4 | 0.0 | 25.6 |
| GPT-5.3-codex | 94.7 | 84.2 | 78.9 | 94.7 | 92.6 | 90.7 | 38.9 | 79.6 | 100 | 100 | 33.3 | 33.3 |
@article{liu2026stargazer,
title={Stargazer: A Scalable Model-Fitting Benchmark Environment for AI Agents under Astrophysical Constraints},
author={Liu, Xinge and Zhang, Terry Jingchen and Sch{\"o}lkopf, Bernhard and Jin, Zhijing and Menou, Kristen},
year={2026},
journal={arXiv preprint arXiv:2604.15664},
url={https://arxiv.org/abs/2604.15664}
}