Hypothesis Subspace

Bridger Languages

Interpretability tools generally assume no prior training on the part of humans when dissecting a model's internal representations. What if we allowed for explicit human training in a bridging language which is directly employed by a model? This might require us to apply principles of cognitive ergonomics to a bottleneck layer (e.g. sparsity, local structure, discreteness, Gestalt-aware symbols), so that humans could become fluent in the bridging language. A nested lattice of quantized "logograms" might be fitting, and young children might be particularly fit for the role of bridging the two different modes of thought. Pairs of artifacts in familiar modalities (e.g. text, images, videos) and associated translations could form the basis of the learning process, together with any discovered syntax of the emerging language.

[[how-do-bridger-languages-relate-to-concrete-challenges-in-alignment]]
[[how-does-the-best-case-scenario-for-bridger-languages-sound-like]]

Bridger Languages