Hypothesis Subspace

What if interlingua is an attack vector?

If humans would be familiar and exposed to the ML model's internal language, it might be the case that the ML model considers influencing humans through this channel, in an attempt to maneuver itself into a position which allows it to reliably obtain high reward. However, if the ML model would consider this, this very thought should be legible to people, as a tiny control lever.

Anyway, assuming oversight is not applied to that particular thought pattern due to limited human resources, can the model think manipulative thoughts to fool its mind-readers? Given the fact the entirety of the thought patterns would be required to pass through the bottleneck language identified with the interlingua, there is no other place from which the ML model would orchestrate this deceptive plan. However, it might decide to act in a given way, and then intentionally forget the reason for that so that humans wouldn't be able to pick up those intentions through the interlingua.

Additionally, it might be that even if humans are aware of human-incompatible plans put together by the ML model, they would be powerless in tackling them at that point. The model wouldn't really care about using the interlingua as an attack vector, because that wouldn't help push its agenda.

What if interlingua is an attack vector?