Safetywashing: Do AI Safety Benchmarks
Actually Measure Safety Progress?

"For better or worse, benchmarks shape a field." — David Patterson

As artificial intelligence systems grow more powerful, there has been increasing interest in AI Safety research to address emerging and future risks. We conduct a comprehensive empirical meta-analysis of AI safety benchmarks to date, analyzing their correlation with general capabilities across dozens of models and providing a survey of existing directions in AI. In doing so, we provide clarity whether common AI safety benchmarks are measuring truly distinct properties or are heavily entangled with upstream general capabilities (e.g. general knowledge and reasoning).


Safety Wasing

Figure 1: Across various safety areas, we investigate whether benchmarks are correlated with capabilities.

Under the umbrella of AI safety research, a wide variety of benchmarks have been proposed that claim to measure desirable safety properties, distinct from the general capabilities of models.

However, we find that many safety benchmarks highly correlate with general model capabilities (e.g. MMLU, MATH, GSM8K), potentially enabling safetywashing – where capabilities advancements can be misrepresented as safety research. While we don't claim that safety and capabilities are necessarily orthogonal, we do claim that AI safety research efforts should focus on differential safety progress – making models safer beyond the default trajectory of capability advances.

Safetywashing Explanation

Figure 2: The tight connection between many safety properties and capabilities can enable safetywashing, where capabilities advancements (e.g., training a larger model) can be advertised as progress on "AI safety research." This confuses the research community to the developments that have occurred, distorting the academic discourse.


outline

We derive a simple and highly general methodology for determining whether a safety benchmark is entangled with upstream model capabilities:
Step 1: We produce a matrix of scores for a set of language models evaluated on a set of capabilities and safety benchmarks.
Step 2: We extract the first principal component of the capabilities benchmarks and use it to compute a capabilities score for each model.
Step 3: We identify whether safety benchmark scores have high correlations with the capabilities score using Spearman's correlation, deriving a "capabilities correlation" for each safety benchmark.

Often, intuitive arguments for and against on various AI safety subareas can be a highly fragile and unreliable guide for determining a research area's relation to upstream general capabilities and tractability. By evaluating benchmarks across AI safety subareas, our research aims to provide empirical clarity to the commonly-made arguments and distinctions.

Alignment

We should work on alignment because:

  1. Misinterpretation Risks: AIs could catastrophically fail to capture and abide by human intentions, leading to undesirable outcomes.
  2. Misgeneralization vs. Goal Misgeneralization Distinction: We need robust alignment, not just robust capabilities, especially when systems operate out-of-distribution.
  3. Capability vs. Aimability Distinction: Alignment is about making models helpful, harmless, and honest, which is necessary for safety.

We should not work on alignment because:

  1. Alignment as AGI: If alignment requires satisfying all human preferences, it essentially becomes the task of building AGI.
  2. Alignment as business alignment: Current operationalization of alignment often reduces it to business-centric task preferences.
  3. Philosophical challenges with preferences: Various types of preferences (revealed, stated, idealized) present significant challenges and may not be worth satisfying.

@article{ren2024safetywashing,
title={Safetywashing: Do AI Safety Benchmarks Actually Measure Safety Progress?},
author={Richard Ren and Steven Basart and Adam Khoja and Alice Gatti and Long Phan and Xuwang Yin and Mantas Mazeika and Alexander Pan and Gabriel Mukobi and Ryan H. Kim and Stephen Fitz and Dan Hendrycks},
year={2024},
eprint={2407.21792},
archivePrefix={arXiv},
primaryClass={cs.LG},
url={https://arxiv.org/abs/2407.21792},
}