In [[what-if-alex-games-the-expanding-charter]], an AGI tries to game an IIE-like system. What defensive mechanism does IIE in general have against such behavior?
First, there's the line of work involving making the knowledge base and entailment verifiers itself hard to game, by e.g. redundantly encoding human values in different formulations. This way of thinking frames IIEs as means of implementing the True Name of human values, where any attempt to actually perform well on its objective, including gaming attempts, would inevitably still be aligned due to the objective's robustness.
Second, there's the line of work which involves discouraging gaming attempts through (infinite) penalties in term of reward. Penalties might either need to be infinitely aversive or some induced myopia might be necessary to help avoid the agent's consequentialist frenzy, similar to [[what-if-alex-hacks-the-deontic-array]]. Adversarially eliciting gaming before penalizing the agent in order to condition it to avoid this behavior faces similar issues like ones brought up by the original [[hfdt-takeover]] article.