Defensibility Maximization

A research agenda aimed at developing AI systems that act in morally defensible ways. This work focuses on endowing language models with adversarial reflection, the ability to optimize for irrefutable standpoints on matters of morality and beyond.



  1. Representational Interoperability

    The ability to seamlessly transition between natural and formal languages, manipulating textual and symbolic representations in parallel.

    Textual-Symbolic Interoperability by Unsupervised Machine Translation
  2. Epistemic Interpretability

    The ability to accurately measure a model's credence in arbitrary statements based on its internal representations.

    Quantifying Credence by Consistency Across Complete Logics Climbing Schild's Ladder
  3. Epistemic Rationality

    The ability to form ever more accurate beliefs by observing and reflecting on the world.

    The Elephant in the Weights The Kinetics of Reason Argument Is War autocurricula
  4. Defensibility Maximization

    The ability to identify and execute on courses of action that are maximally morally defensible.

    Granular Control Over Compute Expenditure by Tuned Lens Decoding Bounded Defensibility Truth & Debate


The conceptual framework of the agenda has been thoroughly fleshed out. The primary focus is currently on scaling language model self-play and devising a novel pipeline for evaluating debates that fills in the gaps of a previous naive one (e.g. utterance decomposition, use of latent knowledge in argument graph construction). In addition, a collaborative project is being hosted at AI Safety Camp 9 to make progress in these directions.

Get involved

Defensibility maximization presents a number of unique challenges at the intersection of machine learning and epistemology. Get involved and contribute to the development of AI systems that act in morally defensible ways by design. If unsure whether this project fits your interests, consider perusing the suggested readings to get a better sense of the project.

Reading Guide

A curated list of background materials meant to help you get up to speed with the latest challenges of the agenda in the span of a few hours:

  • The first volume of Elements of Computational Philosophy gradually builds up to the conceptual framework and theoretical formalism of defensibility maximization in an accessible way.
  • The following sections of the Ought Machine Learning Reading List should provide familiarity with the technical aspects of the agenda: Reasoning & Decomposition, AI Safety, Interpretability & Model Editing, Reinforcement Learning.
  • The following chapters of the Handbook of Argumentation Theory should provide familiarity with the theoretical aspects of the agenda: Argumentation Theory, The New Rhetoric, Formal Dialectical Approaches, Argumentation & Artificial Intelligence.

Project run by

Paul Bricman

Independent Researcher