简体中文 | English

ASTRA：Automated Synthesis of agentic Trajectories and Reinforcement Arenas

🆕 Updates

Date	Updates
2026/01/30	📄 Paper Release
2026/01/22	🎉 Release Code, Models, and Datasets

📖 Overview

This repository provides an end-to-end pipeline for fully automated, verifiable synthesis of high-quality data and environments, with native support for process-level rewards. It is designed for training models with multi-step reasoning and tool-use capabilities and easy to scale to new tasks and tools. Here are the two main modules:

Trajectory Synthesis: Automatically generate high-quality, multi-step interactive trajectories and verified by reward system.
Environment Synthesis: Fully automatically synthesize interactive environments with no human labels required that provide step-wise process rewards to enable RLVR.

Module	Function	Directory
Trajectory Synthesis	Tool graph construction → Task generation → Trajectory collection → Reward assessment	`trajectory_synthesis/`
Environment Synthesis	Question decomposition → Automatic tool environment generation → RLVR training data	`env_synthesis/`

🏆 Model Performance

We release two models: ASTRA-32B-Thinking-v1 and ASTRA-14B-Thinking-v1, which are trained with SFT and RL using our synthesized data. Below are the evaluation results on BFCL-V3-MT:

Model	Base	Long Context	Miss Func	Miss Param	Average ↓
Claude-Opus-4-5-20251101	81.5	70.5	64.0	58.0	68.5
GLM-4.6	74.5	66.5	68.0	63.0	68.0
ASTRA-32B-Thinking-v1	76.5	66.5	65.5	48.5	64.3
Gemini-3-Pro-Preview	69.0	64.0	63.0	56.5	63.1
o3-2025-04-16	68.0	63.0	63.5	54.5	62.3
Claude-Sonnet-4-5-20250929	69.0	59.0	65.0	52.5	61.4
Grok-4-1-fast-reasoning	70.5	62.5	59.5	43.0	58.9
ASTRA-14B-Thinking-v1	67.0	61.0	56.0	48.5	58.1
LoopTool-32B (Report From Paper)	-	-	-	-	57.8
Claude-Haiku-4-5-20251001	63.5	56.0	42.5	52.5	53.6
Kimi-K2-Instruct	62.0	55.0	41.0	44.5	50.6
Qwen3-32B	59.0	51.5	47.5	40.5	49.6
Qwen3-30B-A3B-Thinking-2507	66.0	58.0	31.5	35.5	47.8
TouCan-32B (Report From Paper)	-	-	-	-	46.5
Qwen3-14B	50.5	48.0	39.5	40.0	44.5
Qwen3-30B-A3B-Instruct-2507	43.5	41.0	10.5	25.0	30.0

🔄 Pipelines

Part 1: Trajectory Synthesis

Starting from MCP Server tool documentation, build tool dependency graphs and generate high-quality SFT training data.

mcp_servers.jsonl → Graph construction → Task generation → LLM interaction → Reward assessment → SFT data

👉 For detailed usage instructions, please refer to trajectory_synthesis/README.md

Part 2: Environment Synthesis

Automatically generate executable tool environments from Q&A pairs, supporting RLVR training.

QA data → Question decomposition → Tool necessity check → Verification → Environment synthesis → Tool merging

👉 For detailed usage instructions, please refer to env_synthesis/README.md

📜 License

This project is licensed under Apache 2.0 License.

📎 Citation

@misc{tian2026astraautomatedsynthesisagentic,
      title={ASTRA: Automated Synthesis of agentic Trajectories and Reinforcement Arenas}, 
      author={Xiaoyu Tian and Haotian Wang and Shuaiting Chen and Hao Zhou and Kaichi Yu and Yudian Zhang and Jade Ouyang and Junxi Yin and Jiong Chen and Baoyan Guo and Lei Zhang and Junjie Tao and Yuansheng Song and Ming Cui and Chengwei Liu},
      year={2026},
      eprint={2601.21558},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2601.21558}, 
}