Perhaps a GAN-like setup for RL agents aiming to co-evolve agent-evaluator capabilities? The discriminator would be trained to recognize human behavior, while the generator would be trained to suggest them. At the end, you'd get a generator-agent and a discriminator-evaluator which yields a numerical reward for humanness, despite nobody specifying what that was in advance. The evaluator might grow into being a sensible evaluator, in a similar way to how a GAN discriminator gets a sense of which images are photorealistic and which are not.
Alternatively, an ML model trained as an "ecosystem engineer" to correct divergences of a target object (e.g. BERT trying its best to revert masked tokens in a text, DALL-E 2 trying its best to revert noise sprinkled on an image), might help "stabilize" actions considered by an agent, and help avoid inappropriate courses of action.
It seems to me, however, that both of those low-hanging safety tricks (e.g. stabilize policy, co-evolve evaluator with an agent which is itself increasing in capabilities) are applications of past model arrangements and training regimes. Coming up with new schemes might unfortunately benefit capabilities quite drastically, just like the previous ones. Related: [[are-training-arrangements-inspired-by-parametric-ecologies-likely-to-have-capability-externalities]]. I currently find it unlikely for newly discovered training regimes to differentially push safety further ahead. Still, having them roughly on par might not be a horrible state of affairs.