diffusion magnetizes manifolds
In case you’ve been living under a rock for the past month, OpenAI’s DALL-E 2 can generate insanely high-quality images given a text description. In contrast to the system’s previous version, DALL-E 2 makes use of a process called diffusion. This entry is my attempt to develop some intuition as to why exactly diffusion appears so incredibly effective at producing meaningful images. I’ll describe a mental model of it which to me feels elegant, general, and quite explanatory. Along the way, we’ll encounter iron nails, the Grimm Brothers, and high-dimensional aircraft performing landing maneuvers.
Let’s lay the groundwork by introducing the space of all images with resolution 256x256. Assuming three color channels, this means the space has 256x256x3 dimensions. Each location in this high-dimensional space corresponds exactly to one image, one configuration of pixels. As it happens, vast swaths of this space will correspond to utter trash: images which are not really meaningful to a human because they only contain noise. You can’t really make anything out of those, and they’re difficult to describe in words. It’s tempting to mentally group all those unintelligible images together and assume that they’re all clustered together around a “noisy images” cluster in image space. In reality, noisy images make out the vast majority of the image space we defined. Let’s refer to non-noisy images – ones in which you can clearly discern a cat, a boat, a person, etc. – as meaningful images.
Next, let’s invoke the manifold hypothesis. Just like a person’s body measurements are generally in proportion to each other and don’t span the entire space of 3D shapes, we assume that meaningful images in our 256x256x3-dimensional image space are clustered around a tiny region of lower dimensionality. In loose terms, we can say that meaningful images are located on a (relatively) low-dimensional surface in high-dimensional space, like a 2D sheet of paper underlying a 3D origami swan.
Next, let’s highlight the fact that there’s not really a hard cut-off between meaningful and noisy images in image space. You can take a picture of a cat, as you should always do, and sprinkle a bit of noise on it in Photoshop, as if you’d have taken the picture with an older grainier camera. The result is still a cute cat, but noisier. If you continue noising the picture, it’ll become hard to discern the cat among the noise at all. When exactly does it stop depicting a cat? Ask Sorites about his paradox. Point is, the distinction between meaningful and noisy images is fuzzy, gradual, progressive.
Okay, here is where diffusion comes in. We’ll describe it in dry terms, before linking it to our image space by means of embodied metaphors. First, the people behind DALL-E collected approximately a gazillion images, probably by searching Common Crawl for image URLs paired with alt caption text. Then, given an image, they added a bit of noise to obtain a slightly noisier, slightly corrupted, version of the original. Then, they continued adding a bit of noise at a time, obtaining a sequence of increasingly noisy versions of the original image. As you continue obliterating the original via multiple noising steps, you eventually lose any trace of what the image initially depicted, and you approach a distribution of pure noise in the limit.
Right, so what’s the actual benefit of incrementally corrupting a bunch of cat images? Part of the answer is that it appears possible to train a model to reverse the process. Given a somewhat noisy image of a cat, please try your best to produce this less noisy version – a denoising step. The inputs and outputs are known because they’ve been generated by the actual authors as original + noise, so the model can be trained end-to-end in a supervised fashion (i.e. the less noisy target is known pixel by pixel). There’s a slow but steady increase in the signal-to-noise ratio.
The line of tanks was referred to as a cascade, a rather abstract bit of engineer's whimsy lost on the tourists who did not see anything snapshot-worthy there. All the action took place in the walls separating the tanks, which were not really walls but nearly infinite grids of submicroscopic wheels, ever-rotating and many-spoked. Each spoke grabbed a nitrogen or water molecule on the dirty side and released it after spinning around to the clean side. Things that weren't nitrogen or water didn't get grabbed, hence didn't make it through. There were also wheels for grabbing handy trace elements like carbon, sulfur, and phosphorus; these were passed along smaller, parallel cascades until they were also perfectly pure. The immaculate molecules wound up in reservoirs. Some of them got combined with others to make simple but handy molecular widgets. In the end, all of them were funneled into a bundle of molecular conveyor belts known as the Feed, of which Source Victoria, and the other half-dozen Sources of Atlantis/Shanghai, were the fountainheads.
Even more interesting, considering that by incrementally corrupting an image with noise you eventually reach a clean distribution of pure noise, you can actually start the denoising chain with it. Generate an image where each RGB value is sampled from a uniform distribution, denoise it a dozen times, and you might reach quite a meaningful image in the process. We leave the way you might guide the iterative denoising process towards a certain type of image (e.g. “an astronaut riding a horse in space”) for later.
To recap, reversing denoising in an incremental way allows us to obtain meaningful images from pure noise. This process happens to map quite nicely to metaphors of image space, as follows. Let’s say you start with a cat image, linked to a certain location in image space. When you add noise, you move a bit in space, you take a random tiny step in a certain direction. After completing it, the original still shines through, but it’s a tiny bit different. When you repeat the noising process, you continue taking random drunkard steps around image space. At the end, you always happen to get somewhere, but that somewhere can be anywhere, because almost all of image space is noisy. However, on this incremental journey from the original image to utter trash, you pass through a finite set of intermediate locations denoting images in-between meaningful and noisy.
The denoising process then translates to learning to follow those high-dimensional breadcrumbs back home to the manifold of meaningful images. As the breadcrumb trails can stretch towards any place in image space, you can find your way back from any starting point. In more mathy terms, diffusion is essentially used to scaffold a vector field which nudges you towards the meaningful manifold by means of flows and currents. That field is erected by randomly stumbling outwards from the manifold, and training a model to reverse the steps. The model being trained implements this field, nudging meaningful-images-in-the-making closer and closer to the meaningful surface. Its role is to turn the manifold into an attractor in the space of images, so that even pure noise is tugged towards the cute cats empire, as it naturally should.
This systematic Hansel-and-Greteling out of the manifold and back in by means of the vector field implemented by a model like DALL-E 2 feels a bit like temporarily magnetizing an iron nail. It just sits there in space initially, doing what all good iron nails do, before you influence it in a way which turns it into an attractor. Apparently, there are a few ways of doing that, one of which is rubbing it with a permanent magnet. This doesn’t map nicely to the expanding breadcrumbs, but it still highlights the attractor-ification of an object in space, attracting all single ferromagnetic items in its area afterwards. The iron nail takes the role of the meaningful manifold, while diffusion is related in outcome to the magnetization process itself.
This is growing into a satisfying framing of what’s behind those insanely high-quality images, but we’re still missing a crucial element in our metaphor. In our magnetized iron nail + breadcrumb following model, we lack a way of guiding movement towards certain types of images. It has been conjectured that there exist non-cat images on the Internet, so we’d like to be able to condition this image generation process (more like image search, really), towards images which fit a certain textual description. That’s an important feature of those models: you don’t only get an arbitrary meaningful image, you get one which would fit a caption.
For tweaking this magnets + breadcrumbs mental model so as to account for guided diffusion and conditional image generation, it’s helpful to think for a second about airplanes. You’re casually cruising at I don’t know how many kilometers altitude in I don’t know what level of the atmosphere, when the captain suddenly announces that you’ll soon be starting your descent towards Schiphol in Amsterdam. You slide the window thing up, buckle up, and reflect on what your ancestors might have deemed god-like technology just a few centuries ago. The aircraft is maneuvering across 3D atmosphere, as it approaches the curved(!) 2D manifold of the Earth’s surface. Assuming the captain is sane, she’ll want to maneuver her way towards a certain location on this surface, instead of just coming in contact with it anywhere. It would be nice to land at the airport, though Earth also exerts a more general pull through gravity.
Sketch of the three complementary framings introduced in this article. The bottom graph depicts the magnetization of the manifold via the inward black arrows. The top-right zooms in on the breadcrumb expansion. The top-left one depicts the manifold coordinates.
What does the captain therefore do? She has (at least) a set of 2D coordinates for the destination airport, together with advanced systems which help her navigate towards it, plot courses, adjust velocities, etc. So she uses those to maneuver in 3D space and course-correct her way as she approaches the destination. Okay, now over the image generation side, the caption text description of the desired image is actually converted into a set of coordinates using an auxiliary model termed CLIP (plus the prior). Those coordinates specify roughly where on the manifold of meaningful images we want to land. Importantly, the CLIP coordinates differ from our initial crude 256x256x3 ones. The original ones we used describe the high-dimensional image space, while the new CLIP ones roughly describe the relatively low-dimensional manifold itself. In this framing, the iterative denoising process which is conditioned on caption text is a bit like following a course towards a certain location on the meaningful manifold, where tiny errors can be course-corrected and adjustments can be made incrementally, as we approach the destination.
Back to mathy terms, we could argue that not only is a vector field erected to turn the manifold into an attractor, but the field itself is parametrized by the caption text. It would reconfigure its dynamics as to point to one region of the manifold or another. Crucially, when considering text destinations which have not been used as training data, having information about the manifold as a whole remains crucial, as the field needs to point to a novel region that is still on the manifold. In an extreme case of overfitting where you have two training samples, a dog and a cat, and you want to generate a picture containing both pets at once, it will have no idea where that might be, because of no info about the manifold structure. This framing also explains how you can get multiple variants of the same image by approaching the same destination from different locations while riding the same vector field, as you’ll inevitably reach slightly different outcomes.
Those complementary mental models might lead to the following ideas. Could we apply diffusion to text by scaling the denoising of BERT’s masked language modeling objective to multiple steps? Can we similarly “magnetize” a region of language model dynamics which is human-compatible? What happens if you add momentum to the meaningful-images-in-the-making, extending the repertoire of physical forces exerted on it? If you denoise multiple images together, does it help to incentivize them to coordinate in boid-like movements? And so on…
Together, I think the breadcrumbs, the nail, and the aircraft metaphors each capture a tiny bit of the inner workings of diffusion. Or, at the very least, they illustrate our desperate grasping for intuitive explanations of intelligent systems.