🌿  back

21.35 YRS

the bittersweet lesson

The bitter lesson, stated by Richard Sutton in his eponymous article, refers to the fact that simple models trained on a lot of data often outperform complex models trained on limited data. This teaching is particularly bitter for researchers spending years trying to refine ML architectures on theoretical grounds, because even a slight increase in the amount of training data brings the initial approach to parity. There’s simply not that much return on investment on the majority of algorithm tweaks given that you can predictably improve performance by scaling up datasets alone.

Before taking a spin on Sutton’s teaching, it’s important to keep in mind that algorithms and architectures used today throughout ML are still way better across most metrics than they were a few decades ago. Even if the popular rhetoric of “cheaper compute and more data enabled the latest advancements in AI” is true, you simply wouldn’t be able to bring an SVM to GPT-3 performance or outperform AlphaGo with alpha-beta pruning alone, regardless of your compute budget and dataset size. It’s been argued that the particular algorithm improvements which have grown into industry standards are precisely those which extend the effective compute regime of models to larger and larger datasets. They enable models to continue learning even after already accumulating mountains of experience, rather than making better use of existing data.

Coming back to Sutton’s lesson, I’d argue that the same reasons which make it bitter manage to give it a sweet aftertaste. The fact that simple algorithms plus generous amounts of data often yield state-of-the-art models means that it’s quite straightforward to produce meaningful work. I’ve often read claims along the lines of “our contribution advances the state-of-the-art on N different benchmarks while also being conceptually simpler.” Funnily enough, “transformers go brr…” seems to go a long way, despite not being better theoretically grounded than older models.

For instance, Sutton himself is a huge name in RL, a field dominated for many years by careful theory-building around things like Markov models, game theory, multi-agent systems, etc. Then, you suddenly see work like the decision transformer which casually predicts a suitable next action in a game just like GPT-3 predicts a suitable next word, after only being trained on previous experience. Wait, what? Of course it’s bitter for the avid GOFAI believer, but it’s also sweet for people whose personal worth is not built on staying true to axioms of past intellectual traditions. It seems that you can do a lot with a handful of simple ideas and ingenious use of data.

Speaking of ingenious use of data, consider this other surprising example which almost feels unfair. The task is now to predict the commonsense consequence of an action (e.g. X starts running, so X gets in shape). Researchers listed a handful of samples and then used prompt engineering to get GPT-3 to generate new samples from scratch – orders-of-magnitude more commonsense inferences. Then, they trained a separate model on the previously-generated synthetic data, so that it learned how to solve the task. After a few extra tricks, the resulting model outperformed a baseline model which was trained directly on a large human-written dataset. Wait, what? You don’t even need a lot of human-written data? You can just casually generate it using a general-purpose language model? What is this sorcery? I’m not sure if I should be annoyed or excited – the bittersweet lesson at play.

But let me actually expand on the few extra tricks used in the previous paper. They noticed that simply using the synthetic data generated with prompt engineering didn’t immediately outperform the baseline model trained entirely on human-written data. What then? They trained a “critic” model on human judgements of whether a given commonsense inference makes sense. People didn’t write new data points, they simply recognized the desired pattern and encouraged it, echoing the generator-discriminator asymmetry. After training this critic model to judge samples, they filtered the previously-generated data to leave out nonsensical inferences – increasing its signal-to-noise ratio. This “curated” synthetic data turned out to cut it, it yielded a state-of-the-art model on the commonsense inference task. Creative use (and generation) of data means the bar for meaningful models is even lower, an exciting Wild West.

However, one might ask: Is it really sane to go all-in on opaque models built on top of a mish-mash of artisanal tricks? Scott argues that even if “wise women” making use of traditional herbs were probably more effective than early Hippocratic doctors who relied on questionable theories of the four humors, the systematic pursuit of knowledge outperforms intuition in the long run. Similarly, we managed to bake in highly effective intuitions into recent ML models, yet we don’t really understand how they work, and therefore we have a hard time building on top of their internalized understanding. We might achieve great feats by deploying those “hunch machines” into production in the short-term, but theory-building over decades and centuries might incrementally get us further in the end – the bitter lesson might again have a sweet aftertaste.