Hypothesis Subspace

How do abstraction inductors relate to concrete challenges in alignment?

Abstraction inductors aim to induce arbitrary representations in prosaic models by conditioning them to gradually encode certain ontologies over time. By forcing the concept "human values" to relate to other frozen concepts in specific ways (e.g. subsume happiness, justice, etc.), this technique might help force the model to form an accurate representation of a human normative framework as a prerequisite of projecting it onto the world. They aim to alleviate inner misalignment through a representational lens, as opposed to the dynamical one of latent resonators.

How do abstraction inductors relate to concrete challenges in alignment?