Awesome-Self-Evolving-Agents

🤖 We're still cooking — Stay tuned! 🤖
⭐ Give us a star if you like it! ⭐

Figure: A visual taxonomy of AI agent evolution and optimisation techniques, categorised into three major directions: single-agent optimisation, multi-agent optimisation, and domain-specific optimisation. The tree structure illustrates the development of these approaches from 2023 to 2025, including representative methods within each branch.

AI Agents Development Path

Development Path

Conceptual Framework of the Self-Evolving AI Agents

Conceptual Framework

Open-Source Framework

(EMNLP'25 Demo) EvoAgentX: An Automated Framework for Evolving Agentic Workflows [💻 Code] [📝 Paper]
(Arxiv'25) MASLab: A Unified and Comprehensive Codebase for LLM-based Multi-Agent Systems [📝 Paper] [💻 Code]

1. Single-Agent Optimisation

1.1 🤖 LLM Behaviour Optimisation

1.1.1 📌 Training-Based Behaviour Optimisation

(1) 🔧 Supervised Fine-Tuning Approaches

(ICLR'24) ToRA: A tool-integrated reasoning agent for mathematical problem solving [📝 Paper] [💻 Code]
(NeurIPS'22) STaR : Bootstrapping reasoning with reasoning [📝 Paper] [💻 Code]
(Arxiv'24) NExT: Teaching large language models to reason about code execution [📝 Paper]
(EMNLP'24) MuMath-Code: Combining Tool-Use Large Language Models with Multi-perspective Data Augmentation for Mathematical Reasoning [📝 Paper]
(ICML'25) MAS-GPT: Training LLMs to build LLM-based multi-agent systems [📝 Paper] [💻 Code]

(2) 🔧 Reinforcement Learning Approaches

(ICML'24) Self-Rewarding Language Models [📝 Paper] [💻 Code]
(Arxiv'24) Tulu 3: Pushing Frontiers in Open Language Model Post-Training [📝 Paper] [💻 Code]
(EMNLP'24) Learning Planning-based Reasoning by Trajectories Collection and Process Reward Synthesizing [📝 Paper] [💻 Code]
(Arxiv'24) Agent Q: Advanced Reasoning and Learning for Autonomous AI Agents [📝 Paper]
(Arxiv'24) DeepSeek-Prover: Advancing Theorem Proving in LLMs through Large-Scale Synthetic Data [📝 Paper]
(ICML'25) Diving into Self-Evolving Training for Multimodal Reasoning [📝 Paper] [💻 Code]
(Arxiv'25) Absolute Zero: Reinforced Self-play Reasoning with Zero Data [📝 Paper]
(Arxiv'25) R-Zero: Self-Evolving Reasoning LLM from Zero Data [📝 Paper] [💻 Code]
(Arxiv'25) SPIRAL: Self-Play on Zero-Sum Games Incentivizes Reasoning via Multi-Agent Multi-Turn Reinforcement Learning [📝 Paper] [💻 Code]
(Arxiv'25) DistFlow: A Fully Distributed RL Framework for Scalable and Efficient LLM Post-Training [📝 Paper] [💻 Code]
(Arxiv'25) Vision-Zero: Scalable VLM Self-Improvement via Strategic Gamified Self-Play [📝 Paper] [💻 Code]
(Arxiv'25) Parallel-R1: Towards Parallel Thinking via Reinforcement Learning [📝 Paper] [💻 Code]
(Arxiv'25) SSRL: Self-Search Reinforcement Learning [📝 Paper] [💻 Code]
(Arxiv'25) SeRL: Self-Play Reinforcement Learning for Large Language Models with Limited Data [📝 Paper] [💻 Code]

1.1.2 📌 Test-Time Behaviour Optimisation

(1) 🔧 Feedback-Based Approaches

(ICLR'23) CodeT: Code Generation with Generated Tests [📝 Paper] [💻 Code]
(ICML'23) LEVER: Learning to Verify Language-to-Code Generation with Execution [📝 Paper] [💻 Code]
(ESEC/FSE'23) Baldur: Whole-Proof Generation and Repair with Large Language Models [📝 Paper]
(ACL'24) Math-Shepherd: Verify and Reinforce LLMs Step-by-step without Human Annotations [📝 Paper]
(EMNLP'24) Learning Planning-based Reasoning by Trajectories Collection and Process Reward Synthesizing [📝 Paper] [💻 Code]
(Arxiv'24) Skywork-Reward: Bag of Tricks for Reward Modeling in LLMs [📝 Paper]
(ICLR'25) Rewarding Progress: Scaling Automated Process Verifiers for LLM Reasoning [📝 Paper]
(Arxiv'25) Skywork-Reward-V2: Scaling Preference Data Curation via Human-AI Synergy [📝 Paper] [💻 Code]

(2) 🔧 Search-Based Approaches

(ICLR'23) Self-consistency improves chain of thought reasoning in language models [📝 Paper]
(ACL'23) Solving Math Word Problems via Cooperative Reasoning induced Language Models [📝 Paper] [💻 Code]
(NeurIPS'23) Tree of thoughts: Deliberate problem solving with large language models [📝 Paper] [💻 Code]
(NeurIPS'24) Buffer of Thoughts: Thought-Augmented Reasoning with Large Language Models [📝 Paper] [💻 Code]
(COLM'24) Deductive Beam Search: Decoding Deducible Rationale for Chain-of-Thought Reasoning [📝 Paper] [💻 Code]
(AAAI'24) Graph of thoughts: Solving elaborate problems with large language models [📝 Paper] [💻 Code]
(ICML'25) Forest-of-Thought: Scaling Test-Time Compute for Enhancing LLM Reasoning [📝 Paper] [💻 Code]

（3）🔧 Reasoning-Based Approaches

(EMNLP’25) START: Self‑taught Reasoner with Tools [📝 Paper]
(ArXiv’25) CoRT: Code‑integrated Reasoning within Thinking [📝 Paper] [💻 Code]

1.2 💬 Prompt Optimisation

1.2.1 📌 Edit-Based Prompt Optimisation

(EMNLP'22) GPS: Genetic Prompt Search for Efficient Few-shot Learning [📝 Paper] [💻 Code]
(EACL'23) GrIPS: Gradient-free, Edit-based Instruction Search for Prompting Large Language Models [📝 Paper] [💻 Code]
(ICLR'23) TEMPERA: Test-Time Prompting via Reinforcement Learning [📝 Paper] [💻 Code]
(ACL'24) Plum: Prompt Learning using Metaheuristic [📝 Paper] [💻 Code]

1.2.2 📌 Evolutionary Prompt Optimisation

(ICLR'24) EvoPrompt: Connecting LLMs with Evolutionary Algorithms Yields Powerful Prompt Optimizers [📝 Paper] [💻 Code]
(ICML'24) Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution [📝 Paper]
(Arxiv'25) GEPA: Reflective Prompt Evolution Can Outperform Reinforcement Learning [📝 Paper]

1.2.3 📌 Generative Prompt Optimisation

(ICLR'23) Large Language Models Are Human-Level Prompt Engineers [📝 Paper] [💻 Code]
(ICLR'24) PromptAgent: Strategic Planning with Language Models Enables Expert-level Prompt Optimization [📝 Paper] [💻 Code]
(ICLR'24) Large Language Models as Optimizers [📝 Paper] [💻 Code]
(ICLR'24) Retroformer: Retrospective Large Language Agents with Policy Gradient Optimization [📝 Paper] [💻 Code]
(EMNLP'24) Optimizing Instructions and Demonstrations for Multi-Stage Language Model Programs [📝 Paper] [💻 Code]
(Arxiv'24) Prompt Optimization with Human Feedback [📝 Paper] [💻 Code]
(Arxiv'24) StraGo: Harnessing Strategic Guidance for Prompt Optimization [📝 Paper]
(Arxiv'25) Self-Supervised Prompt Optimization [📝 Paper]

1.2.4 📌 Text Gradient-Based Prompt Optimisation

(EMNLP'23) Automatic Prompt Optimization with "Gradient Descent" and Beam Search [📝 Paper] [💻 Code]
(Arxiv'24) TextGrad: Automatic "Differentiation" via Text [📝 Paper] [💻 Code]
(Arxiv'24) How to Correctly do Semantic Backpropagation on Language-based Agentic Systems [📝 Paper] [💻 Code]
(Arxiv'24) GRAD-SUM: Leveraging Gradient Summarization for Optimal Prompt Engineering [📝 Paper]
(AAAI'25) Unleashing the Potential of Large Language Models as Prompt Optimizers: Analogical Analysis with Gradient-based Model Optimizers [📝 Paper] [💻 Code]
(ICML'25) REVOLVE: Optimizing AI Systems by Tracking Response Evolution in Textual Optimization [📝 Paper] [💻 Code]
(Arxiv'25) PersonaAgent: When Large Language Model Agents Meet Personalization at Test Time [📝 Paper]

1.3 🧠 Memory Optimization

(ICML'24) A Human-Inspired Reading Agent with Gist Memory of Very Long Contexts [📝 Paper]
(ICML'24) Agent Workflow Memory [📝 Paper]
(AAAI'24) MemoryBank: Enhancing Large Language Models with Long-Term Memory [📝 Paper]
(EMNLP'24) GraphReader: Building graph-based agent to enhance long-context [📝 Paper]
(Arxiv'24) "My agent understands me better": Integrating Dynamic Human-like Memory Recall and Consolidation in LLM-Based Agents [📝 Paper]
(ICLR'25) Compress to Impress: Unleashing the Potential of Compressive Memory in Real-World Long-Term Conversations [📝 Paper]
(ICLR'25) Boosting knowledge intensive reasoning of llms via inference-time hybrid information [📝 Paper] [💻 Code]
(ACL'25) Improving factuality with explicit working memory [📝 Paper]
(Arxiv'25) A-MEM: Agentic Memory for LLM Agents [📝 Paper]
(Arxiv'25) Mem0: Building Production-Ready AI Agents with Scalable Long-Term Memory [📝 Paper]
(Arxiv'25) Memento: Fine‑tuning LLM Agents without Fine‑tuning LLMs [📝 Paper] [💻 Code]
(Arxiv'25) Memory-R1: Enhancing Large Language Model Agents to Manage and Utilize Memories via Reinforcement Learning [📝 Paper]
(Arxiv'25) Seeing, Listening, Remembering, and Reasoning: A Multimodal Agent with Long-Term Memory [📝 Paper] [💻 Code]
(Arxiv'25) PersonaAgent: When Large Language Model Agents Meet Personalization at Test Time [📝 Paper]

1.4 🧰 Tool Optimization

1.4.1 📌 Training-Based Tool Optimisation

(1) Supervised Fine-Tuning for Tool Optimisation

(NeurIPS'23) GPT4Tools: Teaching Large Language Model to Use Tools via Self-instruction [📝 Paper] [💻 Code]
(ICLR'24) ToolLLM: Facilitating Large Language Models to Master 16000+ Real-world APIs [📝 Paper] [💻 Code]
(ACL'24) LLMs in the Imaginarium: Tool Learning through Simulated Trial and Error [📝 Paper] [💻 Code]
(AAAI'24) Confucius: Iterative tool learning from introspection feedback by easy-to-difficult curriculum [📝 Paper] [💻 Code]
(ICLR'25) Learning Evolving Tools for Large Language Models [📝 Paper] [💻 Code]
(ICLR'25) Facilitating Multi-turn Function Calling for LLMs via Compositional Instruction Tuning [📝 Paper] [💻 Code]
(ICLR'25) Multi-modal Agent Tuning: Building a VLM-Driven Agent for Efficient Tool Usage [📝 Paper] [💻 Code]
(Arxiv'25) Magnet: Multi-turn Tool-use Data Synthesis and Distillation via Graph Translation [📝 Paper]
(ICML'25) Adapting While Learning: Grounding LLMs for Scientific Problems with Intelligent Tool Usage Adaptation [📝 Paper] [💻 Code]

(2) Reinforcement Learning for Tool Optimisation

(Arxiv'25) ReTool: Reinforcement Learning for Strategic Tool Use in LLMs [📝 Paper] [💻 Code]
(Arxiv'25) ToolRL: Reward is All Tool Learning Needs [📝 Paper] [💻 Code]
(Arxiv'25) Nemotron-Research-Tool-N1: Exploring Tool-Using Language Models with Reinforced Reasoning [📝 Paper] [💻 Code]
(Arxiv'25) Synthetic Data Generation & Multi-Step RL for Reasoning & Tool Use [📝 Paper]
(Arxiv'25) Iterative Tool Usage Exploration for Multimodal Agents via Step-wise Preference Tuning [📝 Paper] [💻 Code]
(Arxiv'25) Tool-Star: Empowering LLM-Brained Multi-Tool Reasoner via Reinforcement Learning [📝 Paper] [💻 Code]
(Arxiv'25) Agentic Reinforced Policy Optimization [📝 Paper] [💻 Code]
(Arxiv'25) AutoTIR: Autonomous Tools Integrated Reasoning via Reinforcement Learning [📝 Paper] [💻 Code]

1.4.2 📌 Inference-Time Tool Optimisation

(1) Prompt-Based Optimisation

(NAACL'25) EASYTOOL: Enhancing LLM-based Agents with Concise Tool Instruction [📝 Paper] [💻 Code]
(ICLR'25) From Exploration to Mastery: Enabling LLMs to Master Tools via Self-Driven Interactions [📝 Paper] [💻 Code]
(ACL'25) Zero-shot Tool Instruction Optimization for LLM Agents via Tool Play [📝 Paper] [💻 Code]

(2) Reasoning-Based Optimisation

(ICLR'24) ToolLLM: Facilitating Large Language Models to Master 16000+ Real-world APIs [📝 Paper] [💻 Code]
(ICLR'24) ToolChain*: Efficient Action Space Navigation in Large Language Models with A* Search [📝 Paper]
(ICLR'25) Tool-Planner: Task Planning with Clusters across Multiple Tools [📝 Paper] [💻 Code]
(Arxiv'25) MCP-Zero: Active Tool Discovery for Autonomous LLM Agents [📝 Paper][💻 Code]

1.4.3 📌 Tool Functionality Optimisation

(EMNLP'23) CREATOR : Tool creation for disentangling abstract and concrete reasoning of large language model [📝 Paper] [💻 Code]
(ICML'24) Offline Training of Language Model Agents with Functions as Learnable Weights [📝 Paper]
(CVPR'24) CLOVA: A Closed-Loop Visual Assistant with Tool Usage and Update [📝 Paper] [💻 Code]
(Arxiv'25) Alita: Generalist Agent Enabling Scalable Agentic Reasoning with Minimal Predefinition and Maximal Self-Evolution [📝 Paper] [💻 Code]

1.5 🧰 Unified Optimization

(Arxiv'25) Building Self-Evolving Agents via Experience-Driven Lifelong Learning: A Framework and Benchmark [📝 Paper] [💻 Code]
(Arxiv'25) EvoAgent: Self-evolving Agent with Continual World Model for Long-Horizon Tasks [📝 Paper]

2. Multi-Agent Optimisation

2.1 ⚙️ Automatic Multi-Agent Construction

（ICML'25) MetaAgent: Automatically Constructing Multi-Agent Systems Based on Finite State Machines [📝 Paper] [💻 Code]

2.2 🚀 MAS Optimisation

(Arxiv' 25) R&D-Agent: Automating Data-Driven AI Solution Building Through LLM-Powered Automated Research, Development, and Evolution [📝 Paper] [💻 Code]
(ICML'25) Multi-Agent Architecture Search via Agentic Supernet [📝 Paper] [💻Code]
(ICML'25) MA-LoT: Multi-Agent Lean-based Long Chain-of-Thought Reasoning enhances Formal Theorem Proving [📝 Paper]
(ICLR'25) AFlow: Automating Agentic Workflow Generation [📝 Paper] [💻 Code]
(ICLR'25) WorkflowLLM: Enhancing Workflow Orchestration Capability of Large Language Models [📝 Paper]
(ICLR'25) Flow: Modularized Agentic Workflow Automation [📝 Paper]
(ICLR'25) Automated Design of Agentic Systems [📝 Paper] [💻 Code]
(Arxiv'25) FlowReasoner: Reinforcing Query-Level Meta-Agents [📝 Paper]
(Arxiv'25) AgentNet: Decentralized Evolutionary Coordination for LLM-Based Multi-Agent Systems [📝 Paper]
(Arxiv'25) MAS-GPT: Training LLMs to Build LLM-Based Multi-Agent Systems [📝 Paper]
(Arxiv'25) FlowAgent: Achieving Compliance and Flexibility for Workflow Agents [📝 Paper]
(Arxiv'25) ScoreFlow: Mastering LLM Agent Workflows via Score-Based Preference Optimization [📝 Paper] [💻 Code]
(Arxiv'25) Multi-Agent Design: Optimizing Agents with Better Prompts and Topologies [📝 Paper]
(Arxiv'25) MAS-ZERO: Designing Multi-Agent Systems with Zero Supervision [📝 Paper]
(Arxiv'25) MermaidFlow: Redefining Agentic Workflow Generation via Safety-Constrained Evolutionary Programming [📝 Paper]
(ICML'24) GPTSwarm: Language Agents as Optimizable Graphs [📝 Paper] [Code]
(ICLR'24) DSPy: Compiling Declarative Language Model Calls into State-of-the-Art Pipelines [📝 Paper] [💻 Code]
(ICLR'24) AgentVerse: Facilitating Multi-Agent Collaboration and Exploring Emergent Behaviors [📝 Paper] [💻 Code]
(ICLR'24) MetaGPT: Meta Programming for a Multi-Agent Collaborative Framework [📝 Paper] [💻 Code]
(COLM'24) A Dynamic LLM-Powered Agent Network for Task-Oriented Agent Collaboration [📝 Paper]
(COLM'24) AutoGen: Enabling next-Gen LLM Applications via Multi-Agent Conversations [📝 Paper] [💻 Code]
(Arxiv'24) G-Designer: Architecting Multi-Agent Communication Topologies via Graph Neural Networks [📝 Paper]
(Arxiv'24) AutoFlow: Automated Workflow Generation for Large Language Model Agents [📝 Paper] [💻 Code]
(Arxiv'24) Symbolic Learning Enables Self-Evolving Agents [📝 Paper] [💻 Code]
(Arxiv'24) Adaptive In-Conversation Team Building for Language Model Agents [📝 Paper]
(ICLR'25) Self-Evolving Multi-Agent Collaboration Networks for Software Development [📝 Paper] [💻 Code]
(Arxiv'25) Chain‑of‑Agents: End‑to‑End Agent Foundation Models via Multi‑Agent Distillation and Agentic RL [📝 Paper] [💻 Code]
(Arxiv’25) Agent KB: Leveraging Cross‑Domain Experience for Agentic Problem Solving [📝 Paper] [💻 Code]

3. Domain-Specific Optimisation

3.1 🧬 Biomedicine

3.1.1 📌 Medical Diagnosis

(EMNLP'24) MMedAgent: Learning to Use Medical Tools with Multi-modal Agent [📝 Paper] [💻 Code]
(NeurIPS'24) MDAgents: An Adaptive Collaboration of LLMs for Medical Decision-Making [📝 Paper] [💻 Code]
(Arxiv'25) HealthFlow: A Self-Evolving AI Agent with Meta Planning for Autonomous Healthcare Research [📝 Paper][💻 Code]
(Arxiv'25) STELLA: Self-Evolving LLM Agent for Biomedical Research [📝 Paper][💻 Code]
(MICCAI'25) MedAgentSim: Self-Evolving Multi-Agent Simulations for Realistic Clinical Interactions [📝 Paper] [💻 Code]
(Arxiv'25) PathFinder: A Multi-Modal Multi-Agent System for Medical Diagnostic Decision-Making Applied to Histopathology
[📝 Paper]
(Arxiv'25) MDTeamGPT: A Self-Evolving LLM-based Multi-Agent Framework for Multi-Disciplinary Team Medical Consultation
[📝 Paper] [💻 Code]
(Arxiv'25) MedAgent-Pro: Towards Evidence-based Multi-modal Medical Diagnosis via Reasoning Agentic Workflow
[📝 Paper] [💻 Code]
(Arxiv'25) Structural Entropy Guided Agent for Detecting and Repairing Knowledge Deficiencies in LLMs [📝 Paper] [💻 Code]

3.1.2 📌 Molecular Discovery

(ACS omega'24) CACTUS: Chemistry Agent Connecting Tool-Usage to Science [📝 Paper] [💻 Code]
(NMI'24) ChemCrow: Augmenting large language models with chemistry tools [📝 Paper] [💻 Code]
(ICLR'25) ChemAgent: Self-updating Library in Large Language Models Improves Chemical Reasoning[📝 Paper] [💻 Code]
(ICLR'25) OSDA Agent: Leveraging Large Language Models for De Novo Design of Organic Structure Directing Agents [📝 Paper]
(Arxiv'25) DrugAgent: Automating AI-aided Drug Discovery Programming through LLM Multi-Agent Collaboration [📝 Paper]
(Arxiv'25) LIDDIA: Language-based Intelligent Drug Discovery Agent [📝 Paper]
(Arxiv'25) GenoMAS: A Multi-Agent Framework for Scientific Discovery via Code-Driven Gene Expression Analysis
[📝 Paper] [💻 Code]

3.2 💻 Programming

3.2.1 📌 Code Refinement

(Arxiv'23) AgentCoder: Multi-Agent-based Code Generation with Iterative Testing and Optimisation [📝 Paper] [💻 Code]
(Arxiv'23) Self-Refine: Iterative Refinement with Self-Feedback [📝 Paper] [💻 Code]
(EMNLP'24) CodeAgent: Autonomous Communicative Agents for Code Review [📝 Paper] [💻 Code]
(ICLR'25) OpenHands: An Open Platform for AI Software Developers as Generalist Agents [📝 Paper] [💻 Code]
(Arxiv'25) CodeCoR: An LLM-Based Self-Reflective Multi-Agent Framework for Code Generation [📝 Paper]
(Arxiv’25) AlphaEvolve: A coding agent for scientific and algorithmic discovery [📝 Paper]
(Arxiv'25) Darwin Gödel Machine: Open-Ended Evolution of Self-Improving Agents [📝 Paper] [💻 Code]
(Software'25) OpenEvolve: an open-source evolutionary coding agent [📝 Instructions] [💻 Code]
(ICLR'25) Self-Evolving Multi-Agent Collaboration Networks for Software Development [📝 Paper] [💻 Code]

3.2.2 📌 Code Debugging

(ACL'23) Self-Edit: Fault-Aware Code Editor for Code Generation [📝 Paper]
(ICLR'24) Teaching Large Language Models to Self-Debug [📝 Paper]
(ICA'24) RGD: Multi-LLM based agent debugger via refinement and generation guidance. [📝 Paper]
(Arxiv'25) Large Language Model Guided Self-Debugging Code Generation [📝 Paper]

3.3 Scientific Research

(Arxiv’25) PiFlow: Principle‑aware Scientific Discovery with Multi‑Agent Collaboration [📝 Paper] [💻 Code]

3.4 💰📚 Financial and Legal Research

3.4.1 📌 Financial Decision-Making

(Arxiv'25) R&D-Agent-Quant: A Multi-Agent Framework for Data-Centric Factors and Model Joint Optimization [📝 Paper] [💻 Code]
(Arxiv'24) FinRobot: an open-source ai agent platform for financial applications using large language models [📝 Paper] [💻 Code]
(Arxiv'24) PEER: Expertizing domain-specific tasks with a multi-agent framework and tuning methods [📝 Paper] [💻 Code]
(NeurIPS'25) Fincon: A synthesized llm multi-agent system with conceptual verbal reinforcement for enhanced financial decision making [📝 Paper] [💻 Code]

3.4.2 📌 Legal Reasoning

(Arxiv'24) LawLuo: A Multi-Agent Collaborative Framework for Multi-Round Chinese Legal Consultation [📝 Paper]
(ICIC'24) Legalgpt: Legal chain of thought for the legal large language model multi-agent framework [📝 Paper]
(Arxiv'24) LawGPT: A Chinese Legal Knowledge-Enhanced Large Language Model [📝 Paper] [💻 Code]
(ACL Findings'25) AgentCourt: Simulating Court with Adversarial Evolvable Lawyer Agents [📝 Paper] [💻 Code]

3.5 🧩 Other Domain-Specific Optimisation

(Arxiv'25) Agents of Change: Self-Evolving LLM Agents for Strategic Planning [📝 Paper]
(Arxiv'25) EarthLink: A Self-Evolving AI Agent for Climate Science [📝 Paper] [🖥️ System]
(Arxiv'25) SEAgent: Self-Evolving Computer Use Agent with Autonomous Learning from Experience [📝 Paper][💻 Code]

4. Evaluation

4.1 📈 Benchmark-Based Evaluation

(NeurIPS'23) OpenAGI: When LLM Meets Domain Experts [📝 Paper] [💻 Code]
(Arxiv'25) Building Self-Evolving Agents via Experience-Driven Lifelong Learning: A Framework and Benchmark [📝 Paper]
(Arxiv'25) MLGym: A New Framework and Benchmark for Advancing AI Research Agents [📝 Paper] [💻 Code]
(Arxiv'25) X-MAS: Towards Building Multi-Agent Systems with Heterogeneous LLMs [📝 Paper] [💻 Code]

4.1.1 📌 Tool and API-Driven Agents

(Arxiv'23) On the Tool Manipulation Capability of Open-source Large Language Models [📝 Paper] [💻 Code]
(EMNLP'23) API-Bank: A Comprehensive Benchmark for Tool-Augmented LLMs [📝 Paper] [💻 Code]
(NeurIPS'23) ToolQA: A Dataset for LLM Question Answering with External Tools [📝 Paper] [💻 Code]
(ICLR'24) MetaTool Benchmark for Large Language Models: Deciding Whether to Use Tools and Which to Use [📝 Paper] [💻 Code]
(Arxiv'25) Enhancing Open-Domain Task-Solving Capability of LLMs via Autonomous Tool Integration from GitHub [📝 Paper] [💻 Code]
(Arxiv'25) LiveMCP-101: Stress Testing and Diagnosing MCP-enabled Agents on Challenging Queries [📝 Paper]

4.1.2 📌 Web Navigation and Browsing Agents

(ICLR'24) WebArena: A Realistic Web Environment for Building Autonomous Agents [📝 Paper] [💻 Code]
(Arxiv'25) BrowseComp: A Simple Yet Challenging Benchmark for Browsing Agents [📝 Paper] [💻 Code]
(ACL'25) WebWalker: Benchmarking LLMs in Web Traversal [📝 Paper] [💻 Code]

4.1.3 📌 Coding Agents

(ICLR'24) SWE-bench: Can Language Models Resolve Real-World GitHub Issues? [📝 Paper] [💻 Code]
(ICLR'25) Self-Evolving Multi-Agent Collaboration Networks for Software Development [📝 Paper] [💻 Code]

4.1.4 Scientific Research Agents

(Arxiv'25) DataSciBench: An LLM Agent Benchmark for Data Science [📝 Paper] [💻 Code]

4.1.4 📌 Multi-Agent Collaboration and Generalists

(ICLR'23) GAIA: a benchmark for General AI Assistants [📝 Paper] [💻 Code]
(ICLR'24) AgentBench: Evaluating LLMs as Agents [📝 Paper] [💻 Code]
(Arxiv'25) MultiAgentBench: Evaluating the Collaboration and Competition of LLM agents [📝 Paper] [💻 Code]
(Arxiv'25) Benchmarking LLMs' Swarm intelligence [📝 Paper] [💻 Code]

4.1.5 📌 GUI and Multimodal Environment Agents

(ACL'24) Mobile-Bench: An Evaluation Benchmark for LLM-based Mobile Agents [📝 Paper] [💻 Code]
(NeurIPS'24) OSWorld: Benchmarking Multimodal Agents for Open-Ended Tasks in Real Computer Environments [📝 Paper] [💻 Code]
(ICLR'25) AndroidWorld: A Dynamic Benchmarking Environment for Autonomous Agents [📝 Paper] [💻 Code]

4.2 ⚖️ LLM-Based Evaluation

4.2.1 📌 LLM-as-a-Judge

(Arxiv'24) Towards Better Human-Agent Alignment: Assessing Task Utility in LLM-Powered Applications [📝 Paper]
(Arxiv'24) LLMs-as-Judges: A Comprehensive Survey on LLM-based Evaluation Methods [📝 Paper]
(Arxiv'25) LiveIdeaBench: Evaluating LLMs’ Divergent Thinking for Scientific Idea Generation with Minimal Context [📝 Paper] [💻 Code]
(ACL'25) Auto-Arena: Automating LLM Evaluations with Agent Peer Debate and Committee Voting [📝 Paper] [💻 Code]
(Arxiv'25) MCTS-Judge: Test-Time Scaling in LLM-as-a-Judge for Code Correctness Evaluation [📝 Paper]

4.2.2 📌 Agent-as-a-Judge

(Arxiv'24) Agent-as-a-Judge: Evaluate Agents with Agents [📝 Paper] [💻 Code]

4.3 🛡 Safety, Alignment, and Robustness for Lifelong / Self-Evolving Agents

(Arxiv'24) AgentHarm: A Benchmark for Measuring Harmfulness of LLM Agents [📝 Paper ]
(NeurIPS'24 – Datasets & Benchmarks) RedCode: Risky Code Execution and Generation [📝 Paper ]
(Arxiv'24) MobileSafetyBench: Evaluating Safety of Autonomous Agents in Mobile Device Control [📝 Paper] [💻 Code]
(Arxiv'23) Do the Rewards Justify the Means? Measuring Trade-Offs Between Rewards and Ethical Behavior in the MACHIAVELLI Benchmark [📝 Paper ]
(Arxiv'24) R-Judge: Benchmarking Safety Risk Awareness for LLM Judges [📝 Paper] [💻 Code]
(ACL'25) SafeLawBench: Towards Safe Alignment of Large Language Models [📝 Paper ]
(Arxiv'25) Accuracy Paradox in Large Language Models: Regulating Hallucination Risks in Generative AI [📝 Paper ]
(ICLR'25 Spotlight) AutoDAN-Turbo: A Lifelong Agent for Strategy Self-Exploration to Jailbreak LLMs [📝 Paper] [💻 Code]
(ACL'25) AGrail: A Lifelong Agent Guardrail with Effective and Adaptive Safety Detection [📝 Paper] [💻 Code]

📚 Citation

If you find this survey useful in your research and applications, please cite using this BibTeX:

@article{fang2025comprehensive,
  title={A Comprehensive Survey of Self-Evolving AI Agents: A New Paradigm Bridging Foundation Models and Lifelong Agentic Systems},
  author={Fang, Jinyuan and Peng, Yanwen and Zhang, Xi and Wang, Yingxu and Yi, Xinhao and Zhang, Guibin and Xu, Yi and Wu, Bin and Liu, Siwei and Li, Zihao and others},
  journal={arXiv preprint arXiv:2508.07407},
  year={2025}
}

☕ Acknowledgement

We would like to thank Shuyu Guo for his valuable contributions to the early-stage exploration and literature review on agent optimisation.

✉️ Contact Us

If you have any questions or suggestions, please feel free to contact us via:

Email: j.fang.2@research.gla.ac.uk and zaiqiao.meng@gmail.com