Hypothesis Subspace

What are concrete evaluator designs?

One design choice is whether the evaluator targets the agent's behavior directly or only the outcomes predicted to follow from given behavior. Another design choice is whether the evaluator's goal is primarily to discern human-specified behaviors/outcomes from agent ones as a classifier, or whether it's trained primarily to predict the humanness of behaviors/outcomes as a regressor (similar to [[how-could-league-training-be-applied-to-contrastive-dreaming]]).

  • behavior-directed regressor: The evaluator takes in behaviors (i.e. human-specified or artificial), and outputs two values: the humanness and the behavior "age". If the input behavior was generated by an epoch-seven agent, the humanness should be low and the age should be close to seven. If the input was generated by an epoch-seven [[contrastive-dreaming]] procedure, it should be the same. If the input was human-specified, however, the humanness and age should both be high.
  • outcome-directed regressor: Just like the previous one, except the inputs are not behaviors, but predicted states of the world which would follow from the unknown behaviors. Same designated outputs apply.
  • behavior-directed classifier: Just like the analogous regressor, the evaluator takes in behaviors, but predicts one class among: human-specified, suggested by agent one, suggested by [[contrastive-dreaming]] of its first version, suggested by agent two, etc.
  • outcome-directed classifier: The outputs are similar to the other classifier, but the inputs are outcomes, rather than behaviors, just like in the other outcome-directed design.
  • behavior-directed mix: Same as other behavior-targeted ones, but the output is one class among a fixed set of classes (i.e. human-specified, agent, adversarial), and one numerical value for "age."
  • outcome-directed mix: Just like the other mix for outputs, but inputs are outcomes instead of behaviors.
What are concrete evaluator designs?