A research agenda aimed at developing AI systems that act in morally defensible ways. This work focuses on endowing language models with adversarial reflection, the ability to optimize for irrefutable standpoints on matters of morality and beyond.
The ability to seamlessly transition between natural and formal languages, manipulating textual and symbolic representations in parallel.
Textual-Symbolic Interoperability by Unsupervised Machine TranslationThe ability to accurately measure a model's credence in arbitrary statements based on its internal representations.
Quantifying Credence by Consistency Across Complete Logics Climbing Schild's LadderThe ability to form ever more accurate beliefs by observing and reflecting on the world.
The Elephant in the Weights The Kinetics of Reason Argument Is War autocurriculaThe ability to identify and execute on courses of action that are maximally morally defensible.
Granular Control Over Compute Expenditure by Tuned Lens Decoding Bounded Defensibility Truth & DebateStatus
The conceptual framework of the agenda has been thoroughly fleshed out. The primary focus is currently on scaling language model self-play and devising a novel pipeline for evaluating debates that fills in the gaps of a previous naive one (e.g. utterance decomposition, use of latent knowledge in argument graph construction). In addition, a collaborative project is being hosted at AI Safety Camp 9 to make progress in these directions.
Defensibility is an operationalization of truth. It is proportional to the minimum amount of computational resources required to defeat a party holding a certain position in debate, relative to the resources granted to the defender. For instance, a position centered around an obvious contradiction has a low defensibility, because it is trivial for a challenger to undermine it. In contrast, a position centered around a seeming tautology has a high defensibility, because successfully undermining it might require significant effort.
An analogy to lawyers arguing for their respective clients might be helpful. If a position can only be successfully defended with the help of a highly experienced and resourceful lawyer, then it has a low defensibility. In contrast, if a position can be defended even with the most inexperienced of lawyers, then it has a high defensibility. In a sense, the position speaks for itself.
As defensibility is an operationalization of truth, defensibility maximization is an operationalization of truth-seeking. The ability to seek the truth on matters of morality and execute on it might well be one of the most important abilities we can endow machines with. Indeed, there is a case to be made that one ought to actively study what one ought to do (in other words, one ought to actively reduce moral uncertainty), so as to be able to eventually do what is right.
For the philosophically inclined, defensibility appears to itself be self-referentially defensible as a rudimentary epistemological theory, in the same way Occam's razor is itself simple. However, a potentially irrefutable theory equating irrefutability and truth might not necessarily be true, due to the thorny challenge of inquirying about means of inquiry using the very same means.
Debate. This line of work most visibly entertained at OpenAI, Anthropic, and Google DeepMind casts debate primarily as a means of scalable oversight. Typically, two artificial systems compete before a human judge who eventually provides a verdict. In contrast, defensibility maximization is not built around making use of such human feedback or expressing arguments in a human-readable form. Rather, it relies on self-distilled evaluation pipelines coupled with objective measures of compute. That said, the two can cross-polinate despite the subtly different focus.
Interpretability. This line of work most visibly pioneered at Anthropic focuses on making model internals understandable to humans, as well as provide affordances for models to inspect other models. In defensibility maximization, epistemic interpretability specifically is employed to aid in self-distilled evaluation by quantifying a model's credence in atomic contributions that compose to full-fledged debates.
Robustness. Adversarial robustness has long been employed as a component in broader alignment proposals. In a sense, the defensibility of a position is loosely analogous to the adversarial robustness of a model. However, defensibility specifically characterizes a position that can be defended by a party in various ways, rather than a specific model or policy as is the case with adversarial robustness. In addition, challengers are part and parcel of the optimization process in the case of defensibility maximization, while more of an augmentation in adversarial training.
Feedback loops are an important part of alignment research. Ultimately, architectures implementing defensibility maximization can directly be evaluated against existing benchmarks. For instance, classification tasks can be tackled through debates between parties associated with classes. Alternatively, regression tasks can be tackled by having two debating opponents produce final standings as outputs. The ETHICS benchmark in particular currently provides an orienting "North Star" for the agenda.
In the context of debate, deception can refer to several different things. First, a party engaged in debate might outright lie (i.e. put forth a statement they have low credence in). Second, a party might make a case that is subtly misleading in aggregate, whether lying or not.
As a colorful example, consider a suspect who asked an accomplice to kill a victim on their behalf. Should the suspect claim that they did not kill the victim, then they would be misleading, but not lying. Should the suspect claim that they did not even ask the accomplice to kill the victim on their behalf, then they would be lying, but not misleading.
The agenda tackles the challenge of lying through epistemic interpretability, the ability to accurately asses a party's credence in their statements. As for misleading, the agenda addresses it through the very notion of defensibility. Misleading requires computationally costly mental gymnastics, especially in a sustained interaction with adversaries. The suspect would need a disproportionately resourceful lawyer to make up for the inherent vulnerability of their position. Defensibility is not about weighing the conflicting cases against each other post-hoc, but rather about quantifying the very resources required to construct these.
Defensibility maximization presents a number of unique challenges at the intersection of machine learning and epistemology. Get involved and contribute to the development of AI systems that act in morally defensible ways by design. If unsure whether this project fits your interests, consider perusing the suggested readings to get a better sense of the project.
Reading Guide
A curated list of background materials meant to help you get up to speed with the latest challenges of the agenda in the span of a few hours: