Hypothesis Subspace

What if Alex hacks the deontic array?

The deontic array is assumed to be running on some server and offering Alex feedback on its actions. Given its situational awareness, it might be able to reason about faking its "reports" prior to evaluation, or overwriting the reward entirely.

Without any modifications, deontic arrays might obviate this concern by instilling an aversion towards planning out immoral actions. Those might nudge Alex away from considering this type of tampering in the first place. However, the potential payoff might be immense, pushing it to temporarily step over its drive towards morality in an episode of frenetic consequentialism.

Other reactions to this concern might require additional technical tweaks. For instance, as the evaluation is reliably deterministic (as opposed to direct human feedback), it can run in multiple different sandbox environments, before various inter-annotator agreement schemes would be employed to pin down compromised hosts. Such partial compromise might trigger a killswitch (which itself might be faulty). However, this scheme is still vulnerable to situations in which Alex compromises all hosts at the same time, leaving out no evidence of lack of consensus and covering its tracks.

Yet another reaction to this concern consists in rewarding the process of being rewarded via the original evaluation pipeline in particular. One of the terminal goals on which Alex would be conditioned could be to allow itself to be evaluated in the original fashion. This line of reasoning was inspired by Tammy. Perhaps it would be rewarded for helping the evaluator follow each step of its pipeline (e.g. each line of code, each machine code op, etc.). However, the same concern resurfaces. Alex might just swallow short-term penalties triggered by tampering in light of potentially huge consequentialist payoffs. Perhaps force Alex to be somewhat myopic to avoid this?