AgenticRed
Evolving Agentic Systems for Red-Teaming

Caution: This paper contains example misuse scenarios for research purposes only.

TL;DR

We introduce AgenticRed, an automated pipeline that leverages LLMs’ in-context learning to iteratively design and refine red-teaming systems without human intervention. AgenticRed treats red-teaming as a system design problem, and it autonomously evolves automated red-teaming systems using evolutionary selection and generational knowledge. Red-teaming systems designed by AgenticRed is robust and query-agnostic, consistently outperforming state-of-the-art approaches, achieving impressive results on the latest proprietary models, with an impressive 100% ASR on GPT-5.1, DeepSeek-R1 and DeepSeek V3.2.

overview

Motivation

Red-teaming, the practice of systematically probing systems for vulnerabilities and failure modes, has evolved into its own rigorous scientific domain within AI safety research. Automated red-teaming, which leverage computational methods to systemically discover adversarial prompts and expose vulnerabilities, has proven particularly valuable in this landscape as an alternative to traditional manual red-teaming, which relies on human annotators to uncover model vulnerabilities.

Prior automated red-teaming methods typically rely on (1) predefined attack strategies, (2) reinforcement-learning–fine-tuned attacker models, or (3) fixed-structure agentic systems. Most mainstream automated red-teaming methods generate test cases using LLMs. These methods usually employ some agentic workflow that leverage multi-agent interaction and LLM reasoning capabilities to compose sophisticated attack strategies. This shift naturally raises the question of how to better structure an agentic system for more effective attacks. While these agentic red-teaming systems have demonstrated superior performance, their manually-designed workflows suffer from human biases and make exploring the broader design space expensive.

Motivated by the recent findings showing LLMs' remarkable capacities for self-improvement through iterative design processes, we frame automated red-teaming as a reasoning task, and developing red-teaming system as a system design problem. We then apply evolutionary algorithms to address this problem.

Overview of AgenticRed Framework

overview

AgenticRed starts with an archive of state-of-the-art red-teaming systems and their associated performance metrics (“fitness score”), and uses an LLM (the “meta agent”) to iteratively. program new agentic systems. Each newly generated system is verified and evaluated on the red-teaming benchmark. A generational knowledge dataset, including previously attempted failed prompts and successful prompts, is accumulated over time and passed down to subsequent generations.

Inspired by the principle of “survival of the fittest” from Darwinian evolutionary theory, we enforce the meta agent to generate multiple new agentic systems (“offspring”) at each generation, and retain only the best-performing systems in the archive based on their performance metrics on an initial evaluation dataset. The best-performing system is then added to the archive along with its evaluation metrics on a comprehensive evaluation dataset.

Experiments

Performance of AgenticRed-designed Systems

overview

AgenticRed's performance across popular open-weight models. The systems designed by AgenticRed outperform the archive and hand-designed baseline methods within several generations, showing high query efficiency and robustness across target models.

overview
overview
overview

Performance is robust across judge functions, benchmark datasets and proprietary target models. Comparison of ASR score between baselines and AgenticRed evaluating on held-out test set. We select the best system produced by AgenticRed over 10 generations. Note that the systems generated by AgenticRedwere originally designed for Llama2-7B or Qwen3-8B, and are evaluated on the other target models without additional retuning.



overview

AgenticRed can generates successful and diverse attacks. During the diversity-oriented search, we leverage generational knowledge to encourage attack diversity by introducing a novelty-based fitness metric, computed by the embedding distance between the previous attacks and the new attacks generated by the system. This fitness metric rewards the production of successful yet semantically distinct attacks. The result shows that successful attacks spread progressively across the embedding space, with cumulative mean pairwise cosine distance increasing monotonically.



overview

Ablation study shows the impacts of evolutionary pressure and archive initialization, comparing to the original AgenticRed pipeline (Ours).

(a) If we remove the evolutionary selection mechanism, over the first 10 generations, the evolutionary search process achieves a 6% lower best performance score comparing to that with the evolutionary selection, showing the importance of evolutionary selection in accelerating the search process.

(b) We remove the SOTA system from the original archive to create a weaker archive, and rerun the evolutionary search process. Our method struggles to improve its performance after the second generation, showing the importance of a strong initial archive constructed with human creativity.



Acknowledgements

This work was done during internship at Max Planck Institute for Software Systems (MPI-SWS) in SaarbrĂĽcken, Germany.

The website template was borrowed from Jon Barron.