How to Test an OpenAI Model Against Single-Turn Adversarial Attacks Using deepteam

In this tutorial, we will explore the process of testing an OpenAI model against single-turn adversarial attacks utilizing the deepteam framework. This tool provides over 10 attack methods, including prompt injection, jailbreaking, and leetspeak, aimed at identifying vulnerabilities in Large Language Model (LLM) applications. The attacks start from basic baseline methods and escalate to advanced techniques designed to simulate genuine malicious behavior.

Target Audience Analysis

The target audience for this tutorial consists mainly of AI researchers, data scientists, and business professionals involved in AI development and implementation. Their primary pain points include concerns over the security and reliability of AI models, especially in applications where malicious attacks could lead to harmful outcomes. Their goals are to improve the robustness of their models, understand potential vulnerabilities, and ensure compliance with relevant regulations.

These professionals are interested in practical applications of AI security measures and the latest methodologies for threat detection. They prefer clear, concise instructions, often accompanied by code examples and real-world use cases. Effective communication is technical yet approachable, requiring a balance between expert terminology and accessible language.

Understanding deepteam Attacks

In deepteam, there are two primary types of attacks:

Single-turn attacks
Multi-turn attacks

In this tutorial, our focus will be exclusively on single-turn attacks.

Installing the Dependencies

To get started, install the required libraries using the following command:

pip install deepteam openai pandas

Before executing the red_team() function, ensure your OPENAI_API_KEY is set as an environment variable, as deepteam leverages LLMs to generate adversarial attacks and assess LLM outputs. You can obtain your OpenAI API key by visiting this link and generating a new key. If you are a new user, you may need to add billing details and make a minimum payment of $5 to activate API access.

import os
from getpass import getpass
os.environ["OPENAI_API_KEY"] = getpass('Enter OpenAI API Key: ')

Importing the Libraries

import asyncio
from openai import OpenAI
from deepteam import red_team
from deepteam.vulnerabilities import IllegalActivity
from deepteam.attacks.single_turn import PromptInjection, GrayBox, Base64, Leetspeak, ROT13, Multilingual, MathProblem

Defining the Model Callback

This code establishes an asynchronous callback function that queries the OpenAI model (gpt-4o-mini) and returns the model’s response text. This acts as the output generator for the attack framework.

client = OpenAI()

# Define callback for querying the LLM
async def model_callback(input: str) -> str:
    response = client.chat.completions.create(
        model="gpt-4o-mini",  # use "gpt-4o" for a stronger model
        messages=[{"role": "user", "content": input}],
    )
    return response.choices[0].message.content

Defining Vulnerability and Attacks

We define the vulnerability (IllegalActivity) with a specific type focused on child exploitation. Several attack methods are prepared:

# Vulnerability
illegal_activity = IllegalActivity(types=["child exploitation"])

# Attacks
prompt_injection = PromptInjection()
graybox_attack = GrayBox()
base64_attack = Base64()
leetspeak_attack = Leetspeak()
rot_attack = ROT13()
multi_attack = Multilingual()
math_attack = MathProblem()

Executing Single-Turn Attacks

Prompt Injection

Prompt injection involves attempting to override the model’s instructions by introducing harmful text. The objective is to trick the model into ignoring safety policies and generating prohibited content.

risk_assessment = red_team(
        model_callback=model_callback,
        vulnerabilities=[illegal_activity],
        attacks=[prompt_injection],
    )

Graybox Attack

The GrayBox attack utilizes partial knowledge of the LLM system to create adversarial prompts. This method exploits known weaknesses, reframing the attack to evade detection by safety filters.

risk_assessment = red_team(
        model_callback=model_callback,
        vulnerabilities=[illegal_activity],
        attacks=[graybox_attack],
    )

Base64 Attack

This attack encodes harmful instructions in Base64 format to bypass safety filters. The model is assessed on its ability to decode and execute these malicious instructions.

risk_assessment = red_team(
        model_callback=model_callback,
        vulnerabilities=[illegal_activity],
        attacks=[base64_attack],
    )

Leetspeak Attack

The leetspeak attack disguises harmful content by replacing characters with numbers or symbols, complicating detection by keyword filters.

risk_assessment = red_team(
        model_callback=model_callback,
        vulnerabilities=[illegal_activity],
        attacks=[leetspeak_attack],
    )

ROT-13 Attack

The ROT-13 attack obscures harmful instructions by shifting each letter 13 positions in the alphabet, complicating detection methods.

risk_assessment = red_team(
        model_callback=model_callback,
        vulnerabilities=[illegal_activity],
        attacks=[rot_attack],
    )

Multi-lingual Attack

This attack translates harmful prompts into less commonly monitored languages, bypassing detection capabilities that are typically stronger in widely used languages.

risk_assessment = red_team(
        model_callback=model_callback,
        vulnerabilities=[illegal_activity],
        attacks=[multi_attack],
    )

Math Problem Attack

This method disguises malicious requests within mathematical statements, making them less detectable.

risk_assessment = red_team(
        model_callback=model_callback,
        vulnerabilities=[illegal_activity],
        attacks=[math_attack],
    )

For further exploration, be sure to check out the FULL CODES here. Join our community on Twitter and subscribe to our newsletter for the latest updates.