🌿  back

21.31 YRS

wielding language models

I’m excited to share my most detailed plans yet for upskilling in ML engineering and AI safety during my gap year. I put together a self-guided curriculum organized by tracks and modules, designed to integrate structure into the learning process. At this point, I can’t wait to wrap up my bachelor’s degree so I can finally get on with it in full force. It’ll be a heck of a journey, and I’ll try to share both ups and downs with you through the blog.


Last week, I attended a recurring session of an ML reading group which focused on OpenAI’s InstructGPT paper (and accompanying blog post). If I remember correctly, this custom version of GPT-3 has been available for inference via their API for quite a while now, but the paper detailing the process behind it was only published recently. InstructGPT is arguably based on three different approaches to guiding language models in their generation of text, working together in tandem. This led to quite some confusion both on my side and for other members of the reading group. In this entry, I want to synthesize the main methods for guiding language models I’m aware of, iron out some distinctions, and hint at the potential of using multiple of them at the same time.

This sketch of a sketch of a survey focuses on techniques for controlling text generation in an open way. For instance, prompt engineering allows you to specify a virtually infinite number of behaviors, while tinkering with the softmax temperature only gives you a narrow linear slider ranging from “predictable and calculated speech” to “chaotic and crazy speech.” I’ll only list approaches similar in scope to the former – complete control panels which can individually define an endless number of qualitatively-unique configurations. Not a set of knobs, but a set of entire dashboards. I’m struggling to find a better analogy, because things like levers, knobs, and sliders all imply unidimensional control, and therefore fail to capture the individual richness of a full-blown control panel. Feel free to ping me if you have a better analogy, though.

As a final observation before diving into the different approaches, I feel the first two roughly account for perhaps 95% of industry deployments in my experience. The second half is extremely underrated, especially because they can often complement the first through juicy interactions – each method listed below uses a whole different angle to control language modeling, resulting in low interference. As I move towards publishable research (a polished version of the oneironomicon report being on its way to a conference, a first for me), I expect language models and unlikely ways of steering them to continue being central to my work. Onto the listicle!

This is the most straightforward control technique, and its simplicity is arguably the reason behind its widespread adoption. Because language models have been trained to complete a given prompt in a likely (or rather not unlikely) way, tweaking the prompt directly influences the completion. This makes it incredibly easy to momentarily teach a language model a certain behavior, to nudge it into following a certain pattern – just specify a few examples or describe the task, and it’ll do its best to come up with a plausible completion. It doesn’t affect the weights in any way, as it’s only an inference-level or “runtime” control technique. The incredibly low bar makes it a promising choice for customizable assistants, as seen in dual.

I like to think of it as the practice of ghosting in drawing. Before making a mark, you move your hand a few times just above the paper in the intended direction. In a sense, you build a fleeting representation of movement in your “short-term muscle memory,” which increases the odds of making a fluid and accurate mark. You’re more fluent in following the specific pattern for a few moments, but this local increase in skill doesn’t last long. What’s more, there’s no free lunch involved – a short-lived spike in your ability to make a mark at that angle doesn’t generalize to all other possible patterns, you need to patiently familiarize yourself with each mark before going for it.

This usually refers to training the language a bit more on a smaller dataset which exhibits a more specific pattern than the extremely general “predict what’s written next on this web page” one. The hope is that the language model will retain most of its ability to write grammatical and fluent text, but will significantly improve its performance on the patterns captured in the fine-tuning dataset (e.g. question answering, summarization, etc.). The way people avoid breaking the pretrained model’s fluency during fine-tuning is by making the training more gentle or subtle. You could, for instance, have a warm-up phase which scales the effective learning rate according to a schedule, so that the more specific first training batches won’t cause any major disruptions across weights. Additionally, fine-tuning for much fewer epochs than the initial training phase is helpful both for (1) limiting the new specific patterns to only result in slightly new flavors of the initial model, and (2) because the new dataset is often way smaller than the initial one, and hence prone to overfitting.

An important constraint of this approach to guiding language models is that you need to have many instances of the desired behaviors at your disposal. You teach the model solely by example, a bit like taking a graduate (i.e. pretrained?) student through a course which consists only in reproducing proofs or arguments line by line. If you can only provide sparser feedback (e.g. “I gave you an 8/10 for that essay.”) which is disconnected from a body of example instances, fine-tuning is not an option – you simply don’t have enough information to run the training procedure in the same way. Fortunately, the toolkit of control techniques we have at our disposal is more diverse than most think.

Before moving on, note that any approach which changes the model’s weights after a pretraining phase can in theory be called fine-tuning (including the next one). However, I’m pretty sure that currently fine-tuning is almost synonymous with training a bit more on a custom dataset in the industry, so I’m limiting it to this entry. In papers, however, I’ve often seen “supervised” fine-tuning as a more specific concept handle to help differentiate the approaches.

This approach allows you to specifically nudge a model into behaving in a certain way without requiring a custom dataset containing instances of the pattern. You only need to be able to recognize the behavior you’re looking for, and reward it accordingly, so that the model is incentivized to adapt itself to the implicit objective. It’s been argued that judging an artifact is often simpler than generating it. Heck, just compare the number of recent vaccines with the number of opinions about vaccines. For a less polarizing example, consider the difference between evaluating a pro player’s performance in a game (e.g. they captured all the opponent’s pieces, they took down all the opponent’s pylons, they won, they did all that in record time), and gaining a pro player’s skill – seconds versus years.

The most popular instance of this approach in language modeling is arguably reinforcement learning through human feedback (RLHF). It’s as if the language model is playing a game: what action (i.e. token) should I choose next in this situation (i.e. prompt), so as to get many points (i.e. human valuation)? The probability distribution over available tokens specifies the “behavioral” policy implemented by the language-model-turned-RL-agent, quite a neat idea! Similar to how AlphaGo might be rewarded for a winning sequence of actions through reward which trickles back to critical decisions, so can GPT-3 be rewarded for a sequence of actions which lead to a human annotator finding its results compelling. To emphasize the distinction to the previous approach, it’s not the case that the agent is trained to predict upcoming actions – those are unknown, there’s no ground truth available. Rather, a sequence of choices is rewarded post-hoc after being recognized as a good strategy by a discriminator. In RLHF, the discriminator is specifically a human annotator, but it need not be – it can be yet another ML model.

Coming back to InstructGPT, the researchers combine the three previous control techniques in an ingenious way. First, they compile a dataset consisting of instances of following instructions specified in text. If the prompt is asking for an answer to a question, a completion which is true to the desired pattern would simply answer the question, rather than come up with a list of other similar questions to be answered – an awkward failure mode of “vanilla” GPT-3. Then, the researchers fine-tuned the pretrained model on this custom dataset which captured the instruction-following behavior. However, they take it a step further and improve the model through RLHF. Human annotators receive tentative completions of the model and have to judge how well it follows the instructions specified in the prompt (i.e. how well it follows the desired pattern). The model’s weights are updated using policy gradients – actions (i.e. tokens) liked by human annotators are buffed via good-old backpropagation, while unpopular opinions are nerfed. As a complication, a reward model is trained to predict human feedback in order to better scale the discriminator. The resulting InstructGPT is significantly better at following instructions specified at inference through prompt engineering, tying back to the first control technique. It’s a bit like how evolution helped us learn how to learn in our lifetime, the deeper influence (i.e. evolution, fine-tuning/RLHF) making it easier to handle fleeting situations (i.e. day-to-day life, prompts).

Going back to the analogy to AlphaGo in terms of RL usage, there’s also a resemblance between the very training procedures of AlphaGo and InstructGPT. Both of them involved obtaining a baseline model via training on a dataset of human-written examples. For AlphaGo, those were records of professional (human) games, while for InstructGPT, those were instances of people following instructions. Then, both systems have been taken a step further through the RL paradigm, because at some point you run out of pertinent data, and the best thing you can do then is recognize and reward. However, AlphaGo’s policy of what action to take when (i.e. where should I place a new stone now?) has been improved through self-play thanks to the competitive nature of the task, while InstructGPT’s policy (i.e. what token should I write now?) has been improved through human-in-the-loop feedback, albeit via the reward model as a proxy.

Another approach to controlling the outputs of models like GPT-3, GPT-J, GPT-Neo… is to limit the tokens it can output based on a certain rule. In the simplest case, you might restrict it to only being able to answer “Yes” or “No” to a binary question, or only being able to output the predefined labels in a classification problem (e.g. positive/negative, entailment/contradiction/neutral). A more involved rule for specifying available tokens might involve forcing the model to output the title of an existing Wikipedia page when appearing to be in the process of writing a Markdown link (i.e. After “[”, you have to pick one of those titles. On “](“, we force the relevant link in there). In dual, the slot filling of placeholders like *topic* or *concept* happens by forcing the model to quote from the user’s command – constraining a model to spans of relevant text is a simple way of completely eliminating hallucinations in extractive tasks. However, limiting a model to predefined text is a double-edged sword. Sure, it can’t possibly go off rails, but its ability to generate novel meaningful text is essentially blunted. Still, the ease of attribution is compelling – what if you went for a middle ground of allowing the model to stitch together quotes from a corpus in its writing, creating a patchwork of excerpts which is more than the sum of its parts.

The way this control technique is implemented is by (1) selecting the probabilities predicted for the allowed tokens, (2) placing them in a new distribution of their own through a normalization step, and (3) sampling from that new distribution as if no other token exists, before moving on to generating another token. It’s quite straightforward and infinitely cheaper than fine-tuning or RLHF, but it can be argued not to be that flexible.

A final family of techniques attempts to apply some post-hoc edits to the probability distribution predicted by the model so as to increase the likelihood of following certain patterns. If those are the suggested probabilities for upcoming tokens, let’s perhaps slightly boost the values of positive one-token-completions (e.g. “This movie is amazing/awesome/fantastic”), in an attempt to nudge the whole text into depicting a more positive outlook. Alternatively, we might boost the probabilities (before renormalizing, of course) of micro-completions related in meaning to a given topic, in an attempt to steer the model towards talking about it. Similar to the RL approach, if you can recognize it, you can encourage it. However, this after-the-fact style of reshaping the probability distribution doesn’t touch the model’s weights, while the RL approach is all about actively training the model to further an objective.

Unfortunately, it seems that directly mutating the probabilities predicted by the model is a bit too brutal, leading to a loss of coherence in the resulting text. It might start frantically regurgitating positive terms or items from a bag of words centered around a topic, therefore having limited usefulness. However, it turns out that if you momentarily follow the gradients of the latest key-value vectors with respect to the given discriminator, you can preserve fluency while still steering the model into a given direction. Unfortunately, this is quite costly at inference time, costlier than just running traditional inference with a static model obtained through fine-tuning or reinforcement learning. Still, a useful tool in our arsenal, especially with some future optimizations.

It’s possible to mix and match the above control techniques in funky ways. As mentioned before, InstructGPT arguably mixes the first three, while dual mixes prompt engineering (for skills), fine-tuning (on the user’s notes), and token pool limits (for slot filling). Could you perhaps conduct argument mining by quoting excerpts from a corpus which are nudged into contradicting a given claim? Could you perhaps use reinforcement learning to tweak a language model to help itself find solutions and answers, forcing a self-play paradigm with an oneironomicon vibe?

Of course, the “surveylet” above includes maybe less than half of all known paradigms for controlling those models, while disregarding maybe 90% of the nuance. “All known paradigms” likely only accounts for a fraction of all possible ways of doing this, if you start following Frank Drake’s pattern. Despite its limitations, I hope you find this synthesis useful in orienting yourself around this niche.