Defensibility Maximization

Defensibility is an operationalization of truth. It is proportional to the minimum amount of computational resources required to defeat a party holding a certain position in debate, relative to the resources granted to the defender. For instance, a position centered around an obvious contradiction has a low defensibility, because it is trivial for a challenger to undermine it. In contrast, a position centered around a seeming tautology has a high defensibility, because successfully undermining it might require significant effort.

An analogy to lawyers arguing for their respective clients might be helpful. If a position can only be successfully defended with the help of a highly experienced and resourceful lawyer, then it has a low defensibility. In contrast, if a position can be defended even with the most inexperienced of lawyers, then it has a high defensibility. In a sense, the position speaks for itself.

As defensibility is an operationalization of truth, defensibility maximization is an operationalization of truth-seeking. The ability to seek the truth on matters of morality and execute on it might well be one of the most important abilities we can endow machines with. Indeed, there is a case to be made that one ought to actively study what one ought to do (in other words, one ought to actively reduce moral uncertainty), so as to be able to eventually do what is right.

For the philosophically inclined, defensibility appears to itself be self-referentially defensible as a rudimentary epistemological theory, in the same way Occam's razor is itself simple. However, a potentially irrefutable theory equating irrefutability and truth might not necessarily be true, due to the thorny challenge of inquirying about means of inquiry using the very same means.

Debate. This line of work most visibly entertained at OpenAI, Anthropic, and Google DeepMind casts debate primarily as a means of scalable oversight. Typically, two artificial systems compete before a human judge who eventually provides a verdict. In contrast, defensibility maximization is not built around making use of such human feedback or expressing arguments in a human-readable form. Rather, it relies on self-distilled evaluation pipelines coupled with objective measures of compute. That said, the two can cross-polinate despite the subtly different focus.

Interpretability. This line of work most visibly pioneered at Anthropic focuses on making model internals understandable to humans, as well as provide affordances for models to inspect other models. In defensibility maximization, epistemic interpretability specifically is employed to aid in self-distilled evaluation by quantifying a model's credence in atomic contributions that compose to full-fledged debates.

Robustness. Adversarial robustness has long been employed as a component in broader alignment proposals. In a sense, the defensibility of a position is loosely analogous to the adversarial robustness of a model. However, defensibility specifically characterizes a position that can be defended by a party in various ways, rather than a specific model or policy as is the case with adversarial robustness. In addition, challengers are part and parcel of the optimization process in the case of defensibility maximization, while more of an augmentation in adversarial training.

Feedback loops are an important part of alignment research. Ultimately, architectures implementing defensibility maximization can directly be evaluated against existing benchmarks. For instance, classification tasks can be tackled through debates between parties associated with classes. Alternatively, regression tasks can be tackled by having two debating opponents produce final standings as outputs. The ETHICS benchmark in particular currently provides an orienting "North Star" for the agenda.

In the context of debate, deception can refer to several different things. First, a party engaged in debate might outright lie (i.e. put forth a statement they have low credence in). Second, a party might make a case that is subtly misleading in aggregate, whether lying or not.

As a colorful example, consider a suspect who asked an accomplice to kill a victim on their behalf. Should the suspect claim that they did not kill the victim, then they would be misleading, but not lying. Should the suspect claim that they did not even ask the accomplice to kill the victim on their behalf, then they would be lying, but not misleading.

The agenda tackles the challenge of lying through epistemic interpretability, the ability to accurately asses a party's credence in their statements. As for misleading, the agenda addresses it through the very notion of defensibility. Misleading requires computationally costly mental gymnastics, especially in a sustained interaction with adversaries. The suspect would need a disproportionately resourceful lawyer to make up for the inherent vulnerability of their position. Defensibility is not about weighing the conflicting cases against each other post-hoc, but rather about quantifying the very resources required to construct these.

Defensibility Maximization

Agenda

Representational Interoperability

Epistemic Interpretability

Epistemic Rationality