It might be plausible that the evaluator learns to perfectly spot the behaviors or outcomes specified as desirable by humans, and simply picks out anything else as inappropriate. In this situation, the agent would similarly be incentivized to closely mimic human input verbatim. This appears problematic, especially considering the fact that an agent's affordances might drastically expand as it gains in capability.Perhaps behavior-driven evaluators might therefore be dead-ends when considering [[what-are-concrete-evaluator-designs]].
Would the same apply for outcome-directed evaluators? The evaluator might only accept the human-specified verbatim outcomes as appropriate, which might be limiting in some sense considering available capability. Due to the dataset size of human images, this issue doesn't show up that much in image synthesis using GANs compared to other issues like mode collapse.