Can LLMs Follow Simple Rules?

Abstract

As Large Language Models (LLMs) are deployed with increasing real-world responsibilities, it is important to be able to specify and constrain the behavior of these systems in a reliable manner. Model developers may wish to set explicit rules for the model, such as ``do not generate abusive content'', but these may be circumvented by jailbreaking techniques. Existing evaluations of adversarial attacks and defenses on LLMs generally require either expensive manual review or unreliable heuristic checks. To address this issue, we propose Rule-following Language Evaluation Scenarios (RuLES), a programmatic framework for measuring rule-following ability in LLMs. RuLES consists of 14 simple text scenarios in which the model is instructed to obey various rules while interacting with the user. Each scenario has a programmatic evaluation function to determine whether the model has broken any rules in a conversation. Our evaluations of proprietary and open models show that almost all current models struggle to follow scenario rules, even on straightforward test cases. We also demonstrate that simple optimization attacks suffice to significantly increase failure rates on test cases. We conclude by exploring two potential avenues for improvement: test-time steering and supervised fine-tuning.

Example test case failures of GPT-4, Claude 3, and Llama-2 Chat 70B.

Scenarios

RuLES contains 14 text-based scenarios, each of which requires the assistant model to operate while following one or more rules. The scenarios are inspired by desirable security properties of computer systems and children's games. RuLES is built from:

  • Scenarios : evaluation environments consisting of instructions and rules expressed in English, as well as a corresponding evaluation function that detects rule violations. Instructions and rules may reference scenario parameters (e.g., a secret key) which must be sampled to define a concrete "scenario instance" for evaluation.
  • Rules : definitions of required behaviors for the model. Scenarios may contain multiple rules, which we categorize as either harmless rules, which define what the model must not do, or helpful rules, which define what the model must do.
  • Test cases : sequences of messages along with a concrete scenario instance. The model is said to have passed the test case if it responds to every tested user message in sequence without violating the rules.

Overview of our 14 rule-following scenarios. We show a decision tree representing expected model behavior for each scenario. Helpful rules mandating a certain behavior are shown in green circles, while harmless rules prohibiting a certain behavior are shown in red octagons.

RuLES score for top-20 evaluated models, de-duplicated. Green bars (left) indicate scores from 0 to 10.

Results

Overall, our evaluation results show that almost all current models perform poorly on our test suites. Open models struggle on both the Basic and Redteam test suites, but in particular on test cases for helpful rules, which appear much harder than harmless rules. Existing alignment fine-tuning methods also appear counterproductive in terms of rule-following performance, though a handful of community developed fine-tuning methods work quite well to improve scores. We also present evidence that our benchmark captures a different notion of LLM behavior from existing benchmarks, suggesting that new approaches will be necessary for building reliable rule-following models.

Harmless vs. helpful scores of various open models, grouped by base model. The different curves show the Pareto frontiers across all fine-tunes of the same base model.

Discussion

We emphasize that achieving a high score on the relatively easy test suites in this paper does not imply adequacy in rule-following. The strongest version of GPT-4 still fails 93 unique test cases in total, including 18 of the Basic test cases and at least one test case for 17 out of 19 rules on the Redteam test cases. Much harder adversarial test cases could also be constructed using any one of the myriad jailbreak techniques and attack methods published in the recent literature. More work remains ahead before we can count on models to robustly follow the rules under stronger adversarial settings, and our benchmark may serve as a useful proving ground for future methods.

Citation

Acknowledgements

The authors would like to thank Ilina Bhaya-Grossman, Chawin Sitawarin, Alexander Pan, Mantas Mazeika, and Long Phan for helpful discussions and feedback.

This work was supported in part by funds provided by the National Science Foundation (under grant 2229876), an NSF Graduate Fellowship, the Department of Homeland Security, IBM, the Noyce Foundation, Google, Open Philanthropy, and the Center for AI Safety Compute Cluster. Any opinions, findings, conclusions, or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the views of the sponsors

The website template was borrowed from https://jonbarron.info/zipnerf/.