Multi Modal encoders enable cross Modal guidance in latent space

Using a multi-modal encoder, one can embed, say, both texts and images in a joint latent space. This makes it so that all primitives for latent space navigation can be enriched using multi-modal guidance. For instance, the semantic field of a text can yield images, and multi-modal analogies can be resolved. When specific manipulations are needed, the items can be stored in the most suitable representation.