Hypothesis Subspace

What if Alex spreads out its deception dynamics to avoid detection?

While our understanding of such dynamics is in its infancy (e.g. how crisp are they even?), one plausible failure mode of using latent resonators here would be that Alex spreads out its internal pattern of deception into distinct specialized skills. It might intentionally conceive of deception in question answering on a specific topic in a fundamentally different way than in question answering on a different topic.

While the original attempt to isolate a general deceptive dynamic might fail because of no "constructive interference" across situations, all is not lost. First, the very fact that Alex invests representational resources on different flavors of deception might be detected. The formalism underlying latent resonators has successfully been used to quantify the "available storage space" for novel dynamics in RNNs across a few toy problems. Similar techniques might help us keep tabs on Alex's investments in representational resources for its different patterns.

Additionally, if deception flavors are somehow successfully isolated despite there being multiple of them, Magma researchers can take the disjunction of those nasty variants and negate that.