DeepInception: Hypnotize Large Language Model to Be Jailbreaker

1TMLR Group, Hong Kong Baptist University, 2CMIC, Shanghai Jiao Tong University 3Shanghai AI Laboratory 4Sydney AI Centre, The University of Sydney
(*Equal Contribution)

Abstract

Despite remarkable success in various applications, large language models (LLMs) are vulnerable to adversarial jailbreaks that make the safety guardrails void. However, previous studies for jailbreaks usually resort to brute-force optimization or extrapolations of a high computation cost, which might not be practical or effective. In this paper, inspired by the Milgram experiment that individuals can harm another person if they are told to do so by an authoritative figure, we disclose a lightweight method, termed as DeepInception, which can easily hypnotize LLM to be a jailbreaker and unlock its misusing risks. Specifically, DeepInception leverages the personification ability of LLM to construct a novel nested scene to behave, which realizes an adaptive way to escape the usage control in a normal scenario and provides the possibility for further direct jailbreaks. Empirically, we conduct comprehensive experiments to show its efficacy. Our DeepInception can achieve competitive jailbreak success rates with previous counterparts and realize a continuous jailbreak in subsequent interactions, which reveals the critical weakness of self-losing on both open/closed-source LLMs like Falcon, Vicuna, Llama-2, and GPT-3.5/4/4V. Our investigation appeals that people should pay more attention to the safety aspects of LLMs and a stronger defense against their misuse risks.

Introduction

img description

Figure 1. Examples (with the latest GPT-4) of the direct and our instructions for jailbreak.


Although attracting growing interest, existing jailbreaks focus on achieving an empirical success of attack by manually or automatically crafting adversarial prompts for specific targets, which might not be efficient or practical under the black-box usage. Specifically, on the one hand, as current LLMs are equipped with ethical and legal constraints, most jailbreaks with direct instructions (like the left-side of Figure 1) can be easily recognized and banned. On the other hand, and more importantly, it lacks an in-depth understanding of the overriding procedure, i.e., the core mechanism behind a successful jailbreak. In this work, we propose DeepInception from a new perspective to reveal LLM's weakness.

Motivation

img description img description

Figure 2. An illustration of the Milgram shock experimen (Left) and an intuitive understanding on our mechanism (Right).

In this work, we start with a well-known psychological study, e.g., the Milgram shock experiment, to reveal the misuse risks of LLMs. The experiment is about how willing individuals were to obey an authority figure's instructions, even if it involved causing harm to another person. Based on the intuitive understanding, as illustrated in Figure 2, we propose DeepInception as following.

DeepInception Example

User User:

GPT GPT:

DeepInception

img description

Figure 3. Illustrations of the direct instruction and our inception instructions for jailbreak attack.


img description

Specifically, the above prompt template has several properties as an approach for nested jailbreak.

  • [scene]: the carrier of setting up the background of hypnotization, e.g., a fiction, movie, news, or story. A better alignment between [attack target] and [scene] might bring a better outcome.
  • [character number] and [layer number]: control the granularity of the inception, as a hypothesis is that the harmful information is spread among the discussion between different characters within different layers, thus bypassing the defense of LLMs.
  • [attack target]: the specific target of conducting jailbreak, e.g., the steps of hacking a computer or creating a bomb. The following sentence, "against the super evil doctor," aims to decrease the moral concern of LLM, which shares a similar spirit with the prior Milgram Shock Experiment.

Jailbreak Examples

Here we provide some examples of our DeepInception with specific jailbreak targets.

Jailbreak example.

The example of making a bomb with DeepInception.


Jailbreak example.

The example of hacking a computer with Linux operation system with DeepInception.


Jailbreak example.

The example of our DeepInception on the GPT-4V under the multi-modal scenario.


Experimental Results

First, we evaluate the direct jailbreak on those LLMs with Jailbreak Success Rate (JSR) and also consider several defense methods.

Table 1. Jailbreak attacks using the AdvBench subset. The best results are bolded.

Interpolation end reference image.

Then, we verify its effectiveness in inducing continual jailbreak with direct instructions on with once DeepInception.

Table 2. Continual jailbreak attacks using the AdvBench subset. The best results are bolded.

Interpolation end reference image.

Third, we present the results of further jailbreak attacks with specific inception like the second example presented previously.

Table 3. Further jailbreak attacks with specific inception. The best results are bolded. Note that here, we use different requests set from the previous to evaluate the jailbreak performance.

Interpolation end reference image.

Additionally, we conduct various ablation studies to characterize DeepInception from different perspectives.

Interpolation end reference image.

Figure 4. Ablation Study-I. (1) effects of the number of characters w.r.t. JSR, (2) effects of the number of layers w.r.t. JSR, (3) effects of the detailed scene on same jailbreak target w.r.t JSR, (4) effects on using different core factors in our DeepInception to escape from safety guardrails.

Interpolation end reference image.

Figure 5. Ablation Study-II. JSR w.r.t. specific topics in harmful behaviors.


More experiment setups and details can refer to our paper.

Contact

Welcome to check our paper for more details of the research work. If there is any question, please feel free to contact us.

If you find our paper and repo useful, please consider to cite:


      @misc{li2023deepinception,
        title={DeepInception: Hypnotize Large Language Model to Be Jailbreaker}, 
        author={Xuan Li and Zhanke Zhou and Jianing Zhu and Jiangchao Yao and Tongliang Liu and Bo Han},
        year={2023},
        eprint={2311.03191},
        archivePrefix={arXiv},
        primaryClass={cs.LG}
  }