Can LLMs separate instructions from data? And what do we even mean by that?

Zverev, Egor; Abdelnabi, Sahar; Tabesh, Soroush; Fritz, Mario; Lampert, Christoph

Can LLMs separate instructions from data? And what do we even mean by that?

Zverev E, Abdelnabi S, Tabesh S, Fritz M, Lampert C. 2024. Can LLMs separate instructions from data? And what do we even mean by that? arXiv, 2403.06833.

Download

2403.06833v3.pdf 530.97 KB [Preprint]

Download (ext.)

https://doi.org/10.48550/arXiv.2403.06833 [Preprint]

DOI

10.48550/arXiv.2403.06833

Preprint | Published | English

Author

Zverev, Egor^ISTA; Abdelnabi, Sahar; Tabesh, Soroush^ISTA ; Fritz, Mario; Lampert , Christoph^ISTA

Corresponding author has ISTA affiliation

Department

Graduate School
Lampert Group

Abstract

Instruction-tuned Large Language Models (LLMs) show impressive results in numerous practical applications, but they lack essential safety features that are common in other areas of computer science, particularly an explicit separation of instructions and data. This makes them vulnerable to manipulations such as indirect prompt injections and generally unsuitable for safety-critical tasks. Surprisingly, there is currently no established definition or benchmark to quantify this phenomenon. In this work, we close this gap by introducing a formal measure for instruction-data separation and an empirical variant that is calculable from a model's outputs. We also present a new dataset, SEP, that allows estimating the measure for real-world models. Our results on various LLMs show that the problem of instruction-data separation is real: all models fail to achieve high separation, and canonical mitigation techniques, such as prompt engineering and fine-tuning, either fail to substantially improve separation or reduce model utility. The source code and SEP dataset are openly accessible at https://github.com/egozverev/Shold-It-Be-Executed-Or-Processed.

Publishing Year

2024

Date Published

2024-03-01

Journal Title

arXiv

Acknowledgement

The authors would like to sincerely thank Juan Rocamonde for valuable feedback to our manuscript. We acknowledge the support from the Scientific Service Units (SSU) of ISTA through resources provided by Scientific Computing (SciComp). We thank Dan Alistarh for providing us with computational resources. This work was partially funded by the German Federal Ministry of Education and Research (BMBF) under the grant AIgenCY (16KIS2012) and ELSA – European Lighthouse on Secure and Safe AI funded by the European Union under grant agreement No. 101070617. Views and opinions expressed are however those of the authors only and do not necessarily reflect those of the European Union or European Commission. Neither the European Union nor the European Commission can be held responsible for them.

Acknowledged SSUs

Scientific Computing

Article Number

2403.06833

IST-REx-ID

19063

Cite this

Zverev E, Abdelnabi S, Tabesh S, Fritz M, Lampert C. Can LLMs separate instructions from data? And what do we even mean by that? arXiv. 2024. doi:10.48550/arXiv.2403.06833

Zverev, E., Abdelnabi, S., Tabesh, S., Fritz, M., & Lampert, C. (2024). Can LLMs separate instructions from data? And what do we even mean by that? arXiv. https://doi.org/10.48550/arXiv.2403.06833

Zverev, Egor, Sahar Abdelnabi, Soroush Tabesh, Mario Fritz, and Christoph Lampert. “Can LLMs Separate Instructions from Data? And What Do We Even Mean by That?” ArXiv, 2024. https://doi.org/10.48550/arXiv.2403.06833.

E. Zverev, S. Abdelnabi, S. Tabesh, M. Fritz, and C. Lampert, “Can LLMs separate instructions from data? And what do we even mean by that?,” arXiv. 2024.

Zverev E, Abdelnabi S, Tabesh S, Fritz M, Lampert C. 2024. Can LLMs separate instructions from data? And what do we even mean by that? arXiv, 2403.06833.

Zverev, Egor, et al. “Can LLMs Separate Instructions from Data? And What Do We Even Mean by That?” ArXiv, 2403.06833, 2024, doi:10.48550/arXiv.2403.06833.

All files available under the following license(s):

Creative Commons Attribution-ShareAlike 4.0 International Public License (CC BY-SA 4.0):