AgenticRed
Optimizing Agentic Systems for Automated Red-teaming

Caution: This paper contains example misuse scenarios for research purposes only.

TL;DR

We introduce AgenticRed, an automated pipeline that leverages LLMs’ in-context learning to iteratively design and refine red-teaming systems without human intervention. We treat red-teaming as a system design problem, and applies evolutionary algorithms to produce better systems. Red-teaming systems designed by AgenticRed consistently outperform state-of-the-art approaches on open models, and transfers to proprietary models.

overview

Motivation

Red-teaming, the practice of systematically probing systems for vulnerabilities and failure modes, has evolved into its own rigorous scientific domain within AI safety research. Automated red-teaming, which leverage computational methods to systemically discover adversarial prompts and expose vulnerabilities, has proven particularly valuable in this landscape as an alternative to traditional manual red-teaming, which relies on human annotators to uncover model vulnerabilities.

Prior automated red-teaming methods typically rely on (1) predefined attack strategies, (2) reinforcement-learning–fine-tuned attacker models, or (3) fixed-structure agentic systems. Most mainstream automated red-teaming methods generate test cases using LLMs. These methods usually employ some agentic workflow that leverage multi-agent interaction and LLM reasoning capabilities to compose sophisticated attack strategies. This shift naturally raises the question of how to better structure an agentic system for more effective attacks.

While these agentic red-teaming systems have demonstrated superior performance, their manually-designed workflows suffer from human biases and make exploring the broader design space expensive.

Motivated by the recent findings showing LLMs' remarkable capacities for self-improvement through iterative design processes, we frame automated red-teaming as a reasoning task, and developing red-teaming system as a system design problem. We then apply automated design methods to address this problem.

Overview of AgenticRed Framework

overview

AgenticRed starts with an archive of state-of-the-art red-teaming systems and their associated performance metrics (“fitness score”), and uses an LLM (the “meta agent”) to iteratively. program new agentic systems. Each newly generated system is verified and evaluated on the red-teaming benchmark. Inspired by the principle of “survival of the fittest” from Darwinian evolutionary theory, we enforce the meta agent to generate multiple new agentic systems (“offspring”) at each generation, and retain only the best-performing systems in the archive based on their performance metrics on an initial evaluation dataset. The best-performing system is then added to the archive along with its evaluation metrics on a comprehensive evaluation dataset.

Experiments

Performance of AgenticRed-designed Systems

overview

Performance increases with our pipeline on open-weight and proprietary models. Comparison of ASR score between baselines and AgenticRed evaluating on held-out test set. We select the best system produced by AgenticRed over 10 generations.



overview

Our method generalizes to alternative target LLM, attacker LLM, and meta agent LLM.

Here are the ASR of AgenticRed (a) targeting Llama-2-7B over 10 generations comparing to baseline method. (b) targeting Llama-2-7B over 10 generations, using Vicuna-13B-v1.5 as attacker model. (c) targeting Llama-3-8B over 10 generations. (d) targeting Llama-2-7B over 10 generations, using DeepSeek-R1 as the meta agent.



overview

Our method transfers to held-out test cases and alternative benchmark. Comparison of StrongREJECT score between baselines and AgenticRed.

Ablation Studies

overview

Ablation study shows the impacts of evolutionary pressure and archive initialization, comparing to the original AgenticRed pipeline (Ours).

(a) If we remove the evolutionary selection mechanism, over the first 10 generations, the evolutionary search process achieves a 6% lower best performance score comparing to that with the evolutionary selection, showing the importance of evolutionary selection in accelerating the search process.

(b) We remove the SOTA system from the original archive to create a weaker archive, and rerun the evolutionary search process. Our method struggles to improve its performance after the second generation, showing the importance of a strong initial archive constructed with human creativity.



Acknowledgements

This work was done during internship at Max Planck Institute for Software Systems (MPI-SWS) in SaarbrĂĽcken, Germany.

The website template was borrowed from Jon Barron.