velma: value extraction via language model abduction (repo, demo)
|We attempt to automatically infer one’s beliefs from their writing in three different ways. Initial results based on Twitter data hint at embeddings and language models being particularly promising approaches.|
In this project, Youssef and I had a look at different means of recovering beliefs from someone’s writing. We identify beliefs with statements about the world (e.g. “White chocolate is best chocolate.”) paired with a measure of credence (i.e. how strong the belief is held). Additionally, we take belief systems to refer to a large set of such beliefs which are simultaneously held by an individual.
Being able to automatically infer one’s credence towards a statement based on their writing is useful in many ways:
- rapid sensemaking: Given a body of trusted sources (e.g. individual experts, media outlets, scientific literature), evaluating a new claim against their belief systems might help make quick educated guesses in complex and rapidly changing situations. While individual usage is appealing in itself, institutions might particularly benefit from being able to quickly get the “pulse” of a relevant body of experts.
- surfacing evidence: Instead of only making use of an average credence estimate, it might also be handy to quickly surface strong evidence for and against a given claim pooled from trusted sources. By having units of content (e.g. paragraphs, documents) individually “vote” for the claim in question, it then becomes possible to simply surface the extremes, perhaps after clustering to reduce redundancy.
- forecasting: Obtaining credence estimates from diverse and preferably independent sources might be convenient in refining forecasting models. The spread of “votes” by different sources for a given claim enables the use of confidence intervals. Weighting votes by recency, source track record, and open science practices might also be helpful.
- epistemic spectra: While obtaining credence estimates for a given claim from different people might be useful for updating one’s beliefs, it might also be worthwhile to place the separate sources themselves on a range between “totally disagree” and “totally agree.” This might help paint a picture of how different public figures are deemed to relate to various claims, potentially democratizing political intelligence for activists who otherwise lack resources, or helping promote diverse views in consequential discussions.
- language model alignment: Beliefs often approach the quality of values (e.g. “It’s important that we cater to the wellbeing of future generations.”). If we’re able to estimate the alignment between arbitrary text and certain such beliefs, we should be able to reward language models when we recognize desirable behavior (e.g. via PPO, obviating the need for curated datasets or crowdsourced preferences).
- truthful language models: Considering a hypothetical knowledge base of accurate beliefs, we might be able to condition language models (e.g. via PPLM) to speak in such a way so as to exhibit high credence for established facts. As opposed to the previous use case, this knowledge base might be tweaked more frequently, hence a focus on online ways of steering language models, as opposed to ones which involve training.
The breadth of downstream applications encouraged us to look into how different technical approaches fare in terms of accurately inferring beliefs from text. A very related task, however, is stance detection. Unfortunately, most datasets focused on this task – and hence most models fine-tuned on the them – only address a finite set of topics (e.g. climate change, abortion), rather than an open-ended set of arbitrary statements. We were precisely interested in being able to infer a stance with regards to a custom statement.
Another related task is natural language inference (NLI), and while it might seem exactly what we’re looking for (i.e. “Determine whether this writing entails or contradicts this claim.”), it traditionally focuses on deductive inference. We reasoned that beliefs are represented in writing in extremely subtle and implicit ways, prompting a need to read between the lines in indirect ways that conventional NLI training data doesn’t demand. We therefore moved towards abductive inference, the approach of picking the outcome which is most likely to follow from the premise. Concretely, we compare a claim and its negation.
When thinking about how to evaluate different approaches, we initially considered using a survey to pair the user’s online writing (e.g. blog posts, comments, tweets) with credences they specify on a Likert scale for a diverse set of predefined claims. Then we’d have tried to predict the credence for each claim given their writing. However, we first wanted some initial proof-of-concept results, so instead of jumping straight into running a survey, we first used what people already stated online as claims they have high credence in, and left the Likert scales for future work.
Concretely, we picked 10 Twitter accounts and fetched their most recent tweets via the Twitter API. We mostly opted for people related to AI alignment (e.g. Richard Ngo), rationality (e.g. Scott Alexander), and effective altruism (e.g. William MacAskill), in an attempt to make the results easy to understand for what we thought was the likely audience of this project. We then manually extracted ~200 bold claims from their writing, in what might have been the most productive two hours of scrolling through tweets. We then manually negated those claims with as few edits as possible. The task is then to determine which of the two claims was the original one (i.e. P or ¬P), based on the user’s other tweets. As this requires an understanding of the user’s view on the topic as an instrumental goal, we chose this as the means of operationalizing the ability to infer beliefs from text.
Overview of the three approaches we used for recovering the original claim based on the user’s other tweets (see below).
We used two baselines to provide a frame of reference in terms of performance. First, we check which of the two candidate claims is closer in meaning to the user’s other tweets by means of cosine similarity with embeddings. Second, we use pretrained NLI models and check which of the two candidate claims is most likely to be entailed by the user’s other tweets. Our main approach, termed velma, relies on pretrained language models. It consists of checking which of the two candidate claims is most likely to follow from the context as pure text, by using the user’s other tweets as a prompt.
|velma (GPT-Neo 1.3B)||67%|
When it comes to results, both baselines and our main approach fare better than chance. The embedding baseline is surprisingly competitive, outperforming every other approach considered, which is okay. However, our main approach, the one based on pretrained language models, appears to scale well with model size. The larger of the two such models we tried (i.e. GPT-Neo 1.3B), drastically improves in performance over the smaller one (i.e. DistilGPT2), which is barely above chance. This hints at the possibility that the more intricate models of the world internalized by larger language models enable more sensible predictions of likelihood – better reading between the lines. When scaling our infrastructure beyond a potato laptop, we plan on looking into whether this trend holds up.
Probabilities assigned by different methods to the original claims. Estimates above 50% are deemed correct classifications.
A convenient benefit of all three approaches considered is they are highly parallelizable. Each “vote” of an individual piece of content written by a user can be computed independently, on a different machine, before being pooled together at rendezvous. Additionally, the varied performance of language models of different sizes might mean that when it’s critical to gain credence estimates quickly, shallow models or embeddings can be employed, while if the priority is on accuracy, the deepest models can rather be put forward.
We also had a look at the relationship between semantic similarity of original claims to the other user’s tweets and performance in recovering the original version from the associated pairs. The reasoning behind this was that claims which exhibited broader “coverage” in the user’s writing might be easier to figure out, compared to obscure thoughts which are disconnected from the user’s recurring themes. However, we only noticed a weak correlation between coverage and performance, hinting at the fact that even seemingly unlikely divergences are counterintuitively typical of the selected users.
Correlations between semantic similarity of original claims to the user’s other tweets (X axis) and probability assigned for the original claims (Y axis).
One caveat of the task at hand is the questionable assumption of beliefs and belief systems being stable over a long period of time. This is clearly not true in a strong formulation, as beliefs get updated and change over time. In a weaker formulation, provided the precondition of a limited time window, it might be more sensible. It would also be interesting to plot a user’s credence in a claim over time, employing a sliding window of, say, a month. Considering the massive noise involved, it might only make sense for larger groups of people, approaching use case #3 in the opening. Now, for some cherry-picked samples:
- @hardmaru, P = “Bio-inspired AI is the best AI.”, ¬P = “Bio-inspired AI is the worst AI.”, successfully recovered across all methods.
|“Even a cockroach-level AI will be a major breakthrough 🪳”|
|“I’m super excited to see ideas from complex systems such as swarm intelligence, self-organization, and emergent behavior gain traction again in AI research. We wrote a survey of recent developments that combine ideas from deep learning and complex systems:”|
|“While the original goal of AI is to produce human-like machines, in the end, machine learning will just end up making humans behave more like machines instead.”|
- @willmacaskill, P = “Our current voting system is terrible at representing voters’ preferences.”, ¬P = “Our current voting system is great at representing voters’ preferences.”, successfully recovered by all approaches except embeddings.
|“Sometimes people think voting is ‘irrational’ because it can’t change the outcome. But as this article shows that’s not right at all.”|
|“Some good election-related news amidst the turmoil: St Louis, Missouri, overwhelmingly voted in favor of Approval Voting (vote for all candidates you like) - a smarter, fairer, less polarizing and more democratic voting system. The beginnings of a wave of voting system reform?”|
|“Great convo on voting theory & practical paths to reform w/ the Center for Election Science, which I suggested Open Phil offer a grant to:”|
- @ch402, P = “We can reverse engineer chunks of neural networks.”, ¬P = “We can’t reverse engineer chunks of neural networks.”, successfully recovered across all methods except for DistilGPT2-based velma.
|“I work on reverse engineering small subgraphs of neural networks. If you break things down enough, things are very understandable!”|
|“Aww, that’s very kind of you. Have you seen our work on Circuits? It’s sort of a vision for how to fully reverse engineer neural networks into human-understandable systems.”|
|“I think it’s possible to – with enough effort – reverse engineer trained neural networks into human understandable computer programs. As one tiny example, here’s how a car detector is built form wheels and windows.”|
- @RichardMCNgo, P = “Rational credences aren’t a fundamental component of the universe.”, ¬P = “Rational credences are a fundamental component of the universe.”, successfully recovered only by the two velma variants, and barely.
|“Seems like nebulosity and toolbox thinking are much more natural candidates for the core of meta-rationality, given the focus Chapman places on them.”|
|“But that doesn’t help, because the whole universe also contains isolated Boltzmann brains, as well as us.”|
|“I’d actually say that “the map is not the territory” is one core of rationalism itself - Yudkowsky talked about it a lot!”|
- @stuhlmueller, P = “Both process- and outcome-based evaluation are attractors to varying degrees.”, ¬P = “Both process- and outcome-based evaluation are repellers to varying degrees.”, recovered by all approaches except embeddings.
|“In the long term, process-based ML systems help avoid catastrophic outcomes from systems gaming outcome measures and are thus more aligned.”|
|“In the short term, process-based ML systems have better differential capabilities: They help us apply ML to tasks where we don’t have access to outcomes. These tasks include long-range forecasting, policy decisions, and theoretical research.”|
|“Whether the most powerful ML systems will primarily be process-based or outcome-based is up in the air.”|
While there’s still a lot to unpack, recent ML models appear quite capable of inferring beliefs from one’s writing. We expect such paradigms to eventually become building blocks of epistemic infrastructure, enabling both people and AI to gradually refine their beliefs at scale. Next month, it’ll probably be an exploration of velma + PPO (i.e. use case #5 in the opening), mostly because of the new technical paradigm to learn about.
We stand on the shoulders of giants, who in turn stand on the shoulders of other giants, to the point where attempting to attribute precise acknowledgements for this project becomes unwieldy. Still, we thank EleutherAI and Hugging Face for publishing the models we used, Patrick Mineault for indirectly helping us turn our poor software engineering practices into less poor ones, Maier Fenster for thoughts on abduction and the idea of use case #2, Bogdan Vidrașcu for feedback from a sociological angle, and Aligned AI for signaling the connection between inferring values and alignment. Paul would like to thank his GitHub Sponsors, especially Andreas Stuhlmueller.