Hypothesis Subspace

What if Alex games the expanding charter?

With its excellent situational awareness, Alex might likely be aware of the specifics of the deontic array and its current charter (e.g. by having an internal representation of it). This begs the question of Alex gaming the charter as it might game human feedback.

If the multi-modal evaluator is Alex itself, it might cleverly adapt to make a specific charter principle relatively more likely to follow than its anti-charter counterpart, while still ranking dangerous courses of action (and non-text sequences in general) way above both of those continuations. It could also just make specific charter principles more likely in general (including when following from its actions). It is uncertain whether it would lead to more aligned behavior (as it would grow more fluent in morality in an attempt to game the system) or wishful thinking (i.e. convincing itself moral principles underlie actually-immoral actions).

If the multi-modal evaluator is a past version of Alex, it might not initially be able to game it as easily. The evaluator at a given step might lack the bias towards making a specific situation seem better than it really is (e.g. by blindly pretending an actually-immoral action is rooted in moral principles). However, given its assumed self-awareness, Alex might reason that morally corrupting its current self might later help it achieve higher reward when the corrupted version is selected as an evaluator, even if it would miss out on the short-term perks of its conformism.

If the multi-modal evaluator is a completely separate Gato/Flamingo-like model, Alex might find it difficult to game the deontic array by morally corrupting itself as the evaluator. However, it might narrow in on quirks of those possibly-frozen models and pick courses of actions which seem extremely moral, despite not truly being so. For instance, perhaps this specific term has a weaker association with this specific action. However, this is where the targeted red teaming would come into play, helping systematically eliminate blindspots of the evaluator. Still, this might not be enough.