Hypothesis Subspace

What if the two component rewards are unstable?

In its vanilla formulation, a deontic array is supposed to contribute part of the aggregate reward given to Alex, together with the main original reward associated with human feedback. Let's refer to those as human feedback (HF) reward, deontic reward, and aggregate reward (composed of the previous two). In this situation, what if Alex expects excellent human feedback from a course of action with poor deontic reward? What if HF reward trumps the deontic one, with large potential payoffs steering Alex away from morality?

To tackle this situations, it might help to make either make both HF and deontic reward bounded or unbounded. If HF is unbounded while deontic is not, there's no question about what will end up driving Alex more. In the equality cases, they might be more comparable, with both bounded appearing safer.

Besides individual bounds on reward sources, the aggregation scheme involved in computing the aggregate reward could penalize large differences between HF and deontic. If HF is huge but deontic is low, the aggregate could be much closer to deontic via e.g. a geometric mean. However, humans might themselves want to occasionally clash with the charter for various reasons. Such an aggregation scheme would naturally also limit their impact.

What if the two component rewards are unstable?