Hypothesis Subspace

What if the agent gets one step ahead by chance?

What if, by whatever means, the agent somehow gets one step ahead of the evaluator. It's capable of exploiting its blindspots better than it in turn can spot its tricks. In this position, the agent can plausibly maneuver itself in a rewarding-yet-misaligned position by subtly remaining under the evaluator's radar. If there would always a small chance for this to happen, then it would be virtually guaranteed to happen over a long period of time, which is unfortunate.

The best way of avoiding nasty outcomes of this state would be not to get into this state in the first place. This is where unilateral training signals to beef up the evaluator would come into play, combined perhaps with future exploitation giving the evaluator a step ahead in anticipating the agent's moves and reacting to it. Provable robustness in a future exploitation setting might make for a robust foundationof this approach.

However, assuming the agent still gets one step ahead, what could be done? It might lead to a spike in reward. Maybe spikes in reward should be penalized? This would make the whole training process slower, especially in the beginning. Maybe allow early reward spikes but don't allow that later on?

What if the agent gets one step ahead by chance?