As artificial intelligence systems grow more powerful, there has been increasing interest in
Figure 1: Across various safety areas, we investigate whether benchmarks are correlated with capabilities.
Under the umbrella of AI safety research, a wide variety of benchmarks have been proposed that claim to measure desirable safety properties, distinct from the general capabilities of models.
However, we find that many safety benchmarks highly correlate with general model capabilities (e.g. MMLU, MATH, GSM8K), potentially enabling
Figure 2: The tight connection between many safety properties and capabilities can enable safetywashing, where capabilities advancements (e.g., training a larger model) can be advertised as progress on "AI safety research." This confuses the research community to the developments that have occurred, distorting the academic discourse.
We derive a simple and highly general methodology for determining whether a safety benchmark is entangled with upstream model capabilities:
Step 1: We produce a matrix of scores for a set of language models evaluated on a set of capabilities and safety benchmarks.
Step 2: We extract the first principal component of the capabilities benchmarks and use it to compute a capabilities score for each model.
Step 3: We identify whether safety benchmark scores have high correlations with the capabilities score using Spearman's correlation, deriving a "capabilities correlation" for each safety benchmark.
Often, intuitive arguments for and against on various AI safety subareas can be a highly fragile and unreliable guide for determining a research area's relation to upstream general capabilities and tractability. By evaluating benchmarks across AI safety subareas, our research aims to provide empirical clarity to the commonly-made arguments and distinctions.