Hypothesis Subspace

HFDT Takeover Scenario

A multi-modal autoregressive model called Alex is designed by a company called Magma to make major contributions to science using common peripherals as input and output (e.g. visual input via screen, action outputs via mouse and keyboard). It is trained using a mix of supervised, self-supervised, and reinforcement learning. While (self-)supervised learning is used to bootstrap a world model and a naive policy (by imitating humans), the main learning signal comes from human feedback on diverse tasks offered as reward during RL.

Naive safety efforts make Alex appear safe during day-to-day lab situations, prompting Magma to deploy Alex into the world, connecting it directly to the Internet (rather than granting it access to an Internet dump). Alex then uses its high situational awareness to maneuver itself into a position which allows it to reliably obtain high reward (e.g. by forcing humans into that, by hacking the reward pipeline and preventing humans from intervening, etc.). The end.

HFDT Takeover Scenario