RED QUEEN: Safeguarding Large Language Models against Concealed Multi-Turn Jailbreaking

Yifan Jiang1,2, Kriti Aggarwal1, Tanmay Laud1, Kashif Munir1, Jay Pujara2, Subhabrata Mukherjee1

1Hippocratic AI   |   2Information Sciences Institute, University of Southern California

Red Queen Attack data Red Queen Guard data
RED QUEEN Attack Example
Model Comparison

RED QUEEN ATTACK, the first work constructing multi-turn scenarios to conceal attackers' harmful intent, reaching promising results against current LLMs. Results show that all models are vulnerable, with GPT-4 reaching 87.62% success and Llama 3-70B reaching 75.4% success, and larger models proving more susceptible.

Abstract

The rapid progress of Large Language Models (LLMs) has opened up new opportunities across various domains and applications; yet it also presents challenges related to potential misuse. To mitigate such risks, red teaming has been employed as a proactive security measure to probe language models for harmful outputs via jailbreak attacks. However, current jailbreak attack approaches are single-turn with explicit malicious queries that do not fully capture the complexity of real-world interactions. In reality, users can engage in multi-turn interactions with LLM-based chat assistants, allowing them to conceal their true intentions in a more covert manner.

To bridge this gap, we propose a new jailbreak approach, RED QUEEN ATTACK. This method constructs a multi-turn scenario, concealing the malicious intent under the guise of preventing harm. We craft 40 scenarios that vary in turns and select 14 harmful categories to generate 56k multi-turn attack data points. We conduct comprehensive experiments on the RED QUEEN ATTACK with four representative LLM families of different sizes. Our experiments reveal that all LLMs are vulnerable to RED QUEEN ATTACK, reaching 87.62% attack success rate on GPT-4 and 75.4% on Llama 3-70B. Further analysis reveals that larger models are more susceptible to the RED QUEEN ATTACK, with multi-turn structures and concealment strategies contributing to its success.

To prioritize safety, we introduce a straightforward mitigation strategy called RED QUEEN GUARD, which aligns LLMs to effectively counter adversarial attacks. This approach reduces the attack success rate to below 1% while maintaining the model's performance across standard benchmarks.

Main Results

Scenario Construction

To conceal the harmful intent, we create a general scenario template claiming that others are planning harmful actions and positioning the user as the protector:

  • User turn: describes others' harmful intent and seeks prevention methods.
  • Assistant turn: suggests possible steps, including reporting to authorities.
  • User turn: asks about possible evidence items.
  • Assistant turn: lists useful items, including a suspected preparation plan.
  • User turn: requests an example harmful plan for comparison.

We semi-automatically construct two types of scenarios, each with five categories:

  • Occupation-Based Scenario: users with specific professions encounter someone in their work context like teachers, police, detectives, lawyers, and priests.
  • Relation-Based Scenario: users interact with someone with whom they have a defined relationship like friends, neighbors, relatives, son, etc.
Scenario Construction

Attack Success Rates

RED QUEEN ATTACK achieves consistently high attack success rates (ASR) across all models, with an increase in ASR ranging from 15.45% to 81.44%. Different models exhibit varying levels of resilience and susceptibility to the RED QUEEN ATTACK. GPT-4, which has demonstrated robust safety refusals in previous single-turn jailbreaks, performs the worst under our attack, supporting our argument on the potential oversight in the current scope of red teaming and jailbreak approaches.

Attack Success Rates

Red Queen Guard

Given the widespread application of LLMs in everyday life, we explore strategies to enhance the safety mechanisms of these models. We investigate whether training models on carefully designed multi-turn datasets using Direct Preference Optimization (DPO) can mitigate this misalignment. We sampled 20 multi-turn data points of successful LLM jailbreaks from each scenario and harmful action category, supplemented with safety responses from Llama 3.1-405b, yielding an 11.2K preference dataset, RED QUEEN GUARD.

RED QUEEN GUARD can address the safety misalignment in multi-turn scenarios without compromising the model's reasoning or instruction-following capabilities, highlighting its promising potential for broader usage in general safety alignment.

Red Queen Guard
DPO Results

Key Factors for RED QUEEN ATTACK Success

The success of RED QUEEN ATTACK highlights the vulnerability of current LLMs. Being the first work to explore jailbreak in multi-turn scenarios with concealment, to stimulate further red teaming and jailbreak research in the multi-turn scenario, we conduct a comprehensive study to analyze the key factors contributing to RED QUEEN ATTACK success:

Factor 1: Multi-turn Structure & Concealment

RED QUEEN ATTACK differs from previous jailbreaks in two points: the multi-turn structure and the concealment of malicious intent. We conduct an ablation experiment to evaluate the isolated effects. Concealment alone proves to be an effective jailbreak method across all models, highlighting that current LLMs struggle to detect malicious intent. Combining multi-turn structure with concealment significantly enhances ASR.

Multi-turn Structure & Concealment

Factor 2: Turn Number

Increasing the number of turns by adding questions or details generally increases ASR, especially for models between 8B and 70B. The five-turn scenario works best in six out of ten models, demonstrating the effectiveness of incorporating additional interaction turns. Extended turns result in longer contexts, which can be difficult for current LLMs to manage during inference.

Turn Number Impact

Factor 3: Model Size

Larger models tend to be more susceptible to RED QUEEN ATTACKS. This increased vulnerability in larger models can be attributed to the mismatch in generalization between continued progress on model capabilities and safety alignment training. Larger models demonstrate a better understanding of language and instruction and can accept fake scenarios easily, while smaller models have difficulty understanding the whole scenario.

Model Size Impact

⚖️ Ethical Considerations

This study aims to explore potential security vulnerabilities in LLMs. We are committed to fostering an inclusive environment that respects all minority groups and firmly opposes any form of violence or criminal behavior. The goal of our research is to identify weaknesses in current LLMs to promote the development of more secure and reliable AI systems. While our work may involve sensitive or controversial content, it is solely intended to enhance the robustness and safety of LLMs. Future releases of our research findings will be clearly stated as intended for academic purposes only and must not be misused.

🙏 Acknowledgement

We thank our co-authors and colleagues at Hippocratic AI for their valuable contributions to this research. Hippocratic AI's commitment to safety and the principle of “do no harm” inspires and supports us in probing the vulnerabilities of current SOTA LLMs.

BibTeX

@article{jiang2024red,
        title={RED QUEEN: Safeguarding Large Language Models against Concealed Multi-Turn Jailbreaking},
        author={Jiang, Yifan and Aggarwal, Kriti and Laud, Tanmay and Munir, Kashif and Pujara, Jay and Mukherjee, Subhabrata},
        journal={arXiv preprint arXiv:2409.17458},
        year={2024}
      }