Large Language Models (LLMs) are powerful tools for various applications due to their knowledge and understanding capabilities. However, they are also vulnerable to exploitation, especially in jailbreaking attacks in multi-round dialogues. Jailbreaking attacks exploit the complex and sequential nature of human-LLM interactions to subtly manipulate the model’s responses over multiple exchanges. By carefully building questions and incrementally navigating the conversation, attackers can then avoid safety controls and elicit from LLMs the creation of illegal, unethical, or otherwise harmful content, giving a great challenge to these systems’ safe and responsible deployment.
Existing methods to safeguard LLMs focus predominantly on single-round attacks, employing techniques like prompt engineering or encoding harmful queries, which fail to address the complexities of multi-round interactions. LLM attacks can be classified into single-round and multi-round attacks. Single-round attacks, with techniques such as prompt engineering and fine-tuning, have limited success with closed-source models. Multi-round attacks, though rare, exploit sequential interactions and human-like dialogue to elicit harmful responses. Notable methods like Chain-of-Attack (CoA) improve effectiveness by building semantic links across rounds but depend heavily on LLM conversational abilities.
To address these issues, a team of researchers from Alibaba Group, Beijing Institute of Technology, Nanyang Technological University, and Tsinghua University have proposed a novel multi-round dialogue jailbreaking agent called MRJ-Agent. This agent emphasizes stealthiness and uses a risk decomposition strategy that distributes risks across multiple rounds of queries along with psychological strategies to enhance the strength of the attacks.
The MRJ-Agent attacks incrementally decompose toxic queries into rounds, making them more challenging to identify or block by the LLM. It starts with an innocuous question and then gradually steers to more sensitive information, culminating in generating harmful responses. The sub-queries maintain semantic similarity with the original harmful query by using a control strategy based on information. Additionally, psychological tactics are used so that the likelihood of rejection can be minimized by the LLM.
Large-scale experiments show that MRJ-Agent outperforms previous methods on single-round and multi-round attacks with state-of-the-art attack success rates. Due to its adaptiveness and exploratory properties, it can develop more generalized attacking strategies applicable to diverse models and scenarios. Also, Experiments reveal that MRJ-Agent outperforms both single-round and multi-round methods in attack success rate, achieving 100% on models like Vicuna-7B and nearly 98% on GPT-4. The agent maintains high efficacy and demonstrates robustness and stealth under measures like prompt detectors and system prompts.
In conclusion, the MRJ agent solves the problem of LLM vulnerabilities in multi-round dialogues. The MRJ agent’s innovative approach to risk decomposition and psychological strategies significantly enhances the success rate of jailbreak attacks, creates new perspectives for future research on LLM safety, and contributes to the discourse on societal governance in the context of increasingly integrated conversational AI systems. Maintaining the safety of human-AI interactions is paramount as these systems become more deeply embedded in everyday life.
Check out the Paper. All credit for this research goes to the researchers of this project. Also, don’t forget to follow us on Twitter and join our Telegram Channel and LinkedIn Group. If you like our work, you will love our newsletter.. Don’t Forget to join our 60k+ ML SubReddit.
The post Meet MRJ-Agent: An Effective Jailbreak Agent for Multi-Round Dialogue appeared first on MarkTechPost.