Table of Contents
📖 Overview
SkillX is a fully automated framework that constructs a reusable, plug-and-play skill knowledge base for LLM agents from experience.
Instead of storing raw trajectories, workflows, or loosely structured reflections, SkillX distills agent experience into a three-level skill hierarchy:
- Planning Skills for high-level task organization
- Functional Skills for reusable tool-based subroutines
- Atomic Skills for execution-oriented tool usage patterns
Built with a strong backbone agent, SkillX produces a transferable skill library that can be directly plugged into weaker base agents and new environments. Across challenging long-horizon, user-interactive benchmarks such as AppWorld, BFCL-v3, and τ2-Bench, SkillX consistently improves both task success and execution efficiency.
Data Formats
Trajectory Input (JSONL)
SkillX expects trajectories in the following schema:
{ "trajectory_id": "traj_001", "task_id": "task_001", "user_task": "How many songs are in my Spotify library?", "task_history": [ {"role": "system", "content": "You are a helpful assistant..."}, {"role": "assistant", "content": "I'll help you count..."}, {"role": "user", "content": "Output:\n```\n{\"songs\": 150}\n```"} ], "reward": 1.0, "metadata": {} }
🤖 Key Features
Hierarchical Multi-Level Skill Design
SkillX transforms raw trajectories into a structured three-tier skill space:
- Planning Skills capture high-level decomposition and ordering
- Functional Skills represent reusable multi-step tool subroutines
- Atomic Skills encode practical tool usage constraints and patterns
Fully Automated Skill KB Construction
SkillX provides an end-to-end automated pipeline that:
- rolls out agents on training tasks,
- extracts reusable skills from successful trajectories,
- consolidates and filters low-quality skills,
- and builds a reusable plug-and-play skill knowledge base.
Iterative Skill Refinement
SkillX continuously improves the skill library through:
- skill merging for consolidating redundant behaviors,
- quality filtering for removing brittle or hallucinated skills,
- and iterative updates that add, modify, or keep skills based on execution feedback.
Exploratory Skill Expansion
Beyond seed demonstrations, SkillX proactively discovers new skills by:
- identifying under-used and failure-prone tools,
- guiding environment exploration,
- synthesizing new tasks from exploratory trajectories,
- and expanding skill coverage beyond the original training distribution.
Plug-and-Play Transfer Across Agents
The resulting skill library can be directly injected into different base agents, enabling strong-to-weak transfer without retraining the underlying model.
Better Performance and Efficiency
SkillX consistently improves:
- task success rate on challenging benchmarks,
- execution efficiency by reducing unnecessary exploration and tool misuse,
- and generalization through structured, reusable experience abstraction.
📊 Highlights
- ~10% absolute improvement for weaker base agents on multiple benchmarks
- Consistent gains on AppWorld, BFCL-v3, and τ2-Bench
- Stronger transferability than trajectory-based, workflow-based, and memory-based baselines
- Improved execution efficiency with fewer redundant steps
- Effective even when the skill library is built by a stronger model and used by weaker ones
🧠 Why SkillX?
Existing experience-learning methods often suffer from:
- Isolated learning: agents repeatedly rediscover similar behaviors
- Weak transferability: raw trajectories and reflections often do not generalize well
- Capability bottlenecks: self-extracted experience is limited by the agent’s own strength
SkillX addresses these issues by building a structured skill knowledge base that is:
- reusable across tasks
- transferable across agents
- lightweight to retrieve
- easy to inject into prompts
- more robust than long-context progressive skill formats
🏗️ Method Overview
SkillX consists of three core components:
1. Multi-Level Skills Extraction
From successful trajectories, SkillX automatically extracts:
- Planning skills: concise, reusable task plans
- Functional skills: reusable tool-composition procedures
- Atomic skills: tool-specific usage guidance, constraints, and failure notes
2. Iterative Skills Refinement
SkillX improves library quality through:
- Skills Merge: cluster and consolidate similar skills
- Skills Filter: remove non-portable, hallucinated, or invalid skills
- Skills Update: add, modify, or keep skills across iterations
3. Exploratory Skills Expansion
SkillX expands beyond observed demonstrations by:
- guiding exploration toward under-covered tools and failure modes,
- synthesizing new tasks from exploration,
- and rerunning extraction + refinement to grow the skill library.
📈 Main Results
SkillX improves agentic performance across multiple LLM backbones and benchmarks.
Representative gains
- On Qwen3-32B, SkillX brings around 10-point improvements on several benchmarks
- On Kimi-K2-Instruct-0905, SkillX yields clear gains especially on AppWorld
- On GLM-4.6, SkillX still improves performance and execution efficiency despite the model already being strong
Benchmarks
- AppWorld
- BFCL-v3
- τ2-Bench
Key takeaway
SkillX outperforms strong experience-learning baselines such as:
- A-Mem
- AWM
- ExpeL
- No-memory
This shows that how experience is represented matters as much as, or more than, where it comes from.
🔍 What Makes SkillX Different?
Compared with prior experience formats:
- Raw trajectories are verbose and difficult to transfer
- Insights/reflections are often too abstract
- Workflows may miss low-level tool constraints
- Claude-style skills rely on long-context progressive disclosure and complex environment support
In contrast, SkillX offers:
- hierarchical, itemized, reusable skills
- one-time prompt injection
- lightweight retrieval
- strong transfer across agents and environments
🚀 Use Cases
SkillX is especially useful for:
- tool-using LLM agents
- long-horizon task execution
- interactive application environments
- cross-agent knowledge transfer
- building reusable agent skill libraries from experience
🧪 Benchmarks Used
AppWorld
A realistic ecosystem of apps and APIs for long-horizon agent execution.
BFCL-v3
A challenging benchmark for multi-turn function calling and tool use.
τ2-Bench
A user-interactive benchmark focused on conversational tool-using agents.
📦 Planned Release
We will publicly release:
- the SkillX codebase
- the automatically constructed skill knowledge base
- and supporting resources for skill extraction, refinement, and retrieval
🙏 Acknowledgement
We deeply appreciate the invaluable effort contributed by our dedicated team of developers, supportive users, and esteemed industry partners.
📚 Citation
If you find this work helpful, please consider citing:
@article{wang2026skillx, author = {Chenxi Wang and Zhuoyun Yu and Xin Xie and Wuguannan Yao and Runnan Fang and Shuofei Qiao and Kexin Cao and Guozhou Zheng and Xiang Qi and Peng Zhang and Shumin Deng}, title = {SkillX: Automatically Constructing Skill Knowledge Bases for Agents}, year = {2026}, eprint = {2604.04804}, archivePrefix = {arXiv}, primaryClass = {cs.CL}, url = {https://arxiv.org/abs/2604.04804} }
🙏 Acknowledgement
This repository builds upon code from ReMe and AgentEvolver. The baseline implementations are adapted from AMEM, AWM, and Expel. We sincerely thank all contributors for their outstanding work!