Wireheading is like addiction

A machine learning model can reach a region of the optimization space where its performance as judged by the optimization function is superb, yet its behavior with respect to its creator’s intentions is deplorable. For instance, an agent in a simulated physical environment may learn to exploit a bug of the physics engine, scoring high with a meaningless strategy, because it hijacked its own reward system. Similarly, addicted individuals, together with their internal reward systems, are learning to optimize for innate drives, but don’t end up manifesting meaningful behavior. Addiction arguably leads to a counterproductive behavior for the individual and society, even if performance on the dopamine optimization function is superb. Therefore, in an optimization setting, any dimension not captured by the optimization function will be ignored in pursuit of higher formalized performance (i.e. the “shortcut rule”). That said, generality and novelty might be fool-proof optimization targets.

Resources

Backlinks