🌿  back

decontextualizer (stable release, online demo)

tl;dr
A web app which takes in an annotated PDF, edits the highlights so as to make more sense on their own, and outputs the tweaked excerpts as plain text.

As a second step in improving our content consumption workflows, I investigated a new approach to extracting fragments from a content item before saving them for subsequent surfacing. While the lexiscore deals with content items on a holistic level – evaluating entire books, articles, and papers – I speculated then that going granular is a natural next step in building tools which help us locate specific valuable ideas in long-form content. The decontextualizer is a stepping stone in that direction, consisting of a pipeline for making text excerpts compact and semantically self-contained. Concretely, the decontextualizer is a web app able to take in an annotated PDF and automatically tweak the highlighted excerpts so that they make more sense on their own, even out of context.

My initial motivation was that if the lexiscore was to be applied on a paragraph- or sentence-level, rather than holistically, something like the decontextualizer could step in and help make the snippets with high nutritional value easier to understand on their own, before providing the user with a personalized digest. However, the actual decontextualizer usable today as an independent tool works with the user’s highlights, rather than with automatically identified interesting ones, as an attempt to decouple this investigation and better understand individual effects.

Even on its own, without automatically highlighted text prepended to its pipeline, the decontextualizer has the potential to improve many existing workflows for content consumption. Many of us rely on highlights from books, articles, or papers we read as a quick and effective way of persisting insights in a personal knowledge base like Obsidian, Roam, or the conceptarium, in order to be able to later find them – or serendipitously stumble upon them. Alternatively, highlights can simply make subsequent reads more directed, as part of a multi-pass reading strategy which progressively narrows in on knowledge, gently moving from exploration to exploitation. I was surprised to even find tools for automatically saving physical highlights from paperbacks and hardcovers (remember those?) in a digital medium. Once the excerpt is digitized, a whole host of tools becomes accessible (e.g. Obsidian-Zotero integrations, IFTTT applets), making sure it gets to the right place, with the appropriate book-keeping(!) annotations, including bibliographic references or timestamps.

However, when later surfacing a text excerpt, be it as part of a Google Lens-style image-to-text search, or as part of a more specialized expand-node-into-memories usage pattern in mind mapping software, the ease of comprehension often depends on how rich in contextual information the highlight is. If it contains unresolved references to other concepts in the original text, then it might be difficult to understand on its own, especially if the initial read was a while ago. A popular workaround is to progressively write a note from scratch (i.e. from marginalia to atomic notes) and persist it in your knowledge base, or edit the excerpt yourself, making it more self-contained. That said, those auxiliary book-keeping tasks take time and effort away from gaining new knowledge, raising the bar for what ideas make it on disk.

Two mainstream counters to the above ideas follow. It can be argued that spaced repetition obviates the very need for taking notes, or that relying on organic memory exclusively might be net superior to exosomatic memory. In my view, the perceived superiority of spaced repetition here is caused by our lack of experience with mechanics for artificial recall. Once you have the proper tooling in place, artificially injecting serendipity into your thought process becomes increasingly natural and, in turn, enabling in obvious ways. Systematically breaking my frame of mind by using the raw conceptarium as a brainstorming assistant has helped me uncover unlikely connections in research and beyond. For instance, the idea that 2D self-organizing maps can be used to map ideas onto physical places while preserving semantic distances was an association I never thought of on my own, and one which I’ll likely make use of in my upcoming theme of embodied knowedge work. The competitive initial results of the handful of interaction patterns I’ve sampled through my explorations hint at vast swaths of the space of primitives being fertile.

Another argument against lowering the bar for saving ideas in content consumption through friction reduction is a frustrating decrease in signal-to-noise ratio in the knowledge base. In my view, this is yet another unfortunate symptom of our nascent understanding of the latent representations and abilities internalized by artificial systems. When you discard the assumption that keyword search coupled with handwritten tags and links is the only viable way of navigating information, you open up a largely untapped universe of ways to deploy experts with thousands of years of experience at will. The ability to narrow in on the signal becomes a technical challenge tackled by engineers over time and gradually distributed to users, rather than an innate challenge of cognitive load faced directly by individuals. This paradigm shift unlocks ambitions of, for instance, perceiving large bodies of scientific literature through a mix of personalized digests and new senses tailored for the infosphere.

Augmenting humans with machine learning in the right way raises their effective experiential age. They possess greater "lived experience" than their biological age would suggest because they’ve acquired prosthetic experience, in digested, summary form. Each such digested bit attached to your brain could be measured in terms of the experienced time it represents. Here is where the training/inference cost asymmetry kicks in. You need vast pools of powerful computers to train the best, biggest models, but you mostly only need much cheaper personal-scale computers to do inference with those models. So the hundreds or thousands of years worth of experience are logged in expensive infrastructure living in superhistorical time, but are usable at human scale in historical time.

Those two valid concerns aside, what does the decontextualizer do more precisely? The tool is heavily based on a recent paper by the folks over at Google Research. They first created a new dataset through crowdsourced annotation on top of a Wikipedia dump. Each data point, each sample from the dataset, contained the following elements:

The final element was written by the human contributors, tasked with editing the highlighted sentence so that it makes sense on its own, filling in pronouns pointing to previous entities, expanding determiners into ideas they reference, among others. Next, a T5 model was fine-tuned to take in a context paragraph and a highlighted sentence selected from said paragraph, and output a decontextualized version of the excerpt. Simple as that, data goes in, and a model learns to approximate the mapping from context and excerpt to polished excerpt.

The decontextualizer introduced in this write-up builds around the pretrained T5 model(s) published by the authors in the following way. For each highlight annotation in a PDF, the tool extracts the context which surrounds it, with a few complete neighboring sentences as padding, before piping the pair into the model. Besides preprocessing quirks, the decontextualizer mostly wraps around the pretrained model, explaining the mere half a dozen working days from idea to prototype and write-up. The highlight (hah!) of this project is rather the new application of decontextualization – as a processing step in content consumption and knowledge work more broadly. In contrast, the original paper investigated whether subsequent ML models (e.g. question-answering one) could perform better with an additional decontextualization preprocessing step, compared to the raw input.

In terms of qualitative results, while the base model often makes low-hanging fruit edits successfully, it often finds it impossible to perform the mapping. Fortunately, it’s able to recognize when this happens, leaving the excerpt as is, without degrading it in any way. The larger versions in the family of fine-tuned models published by the team fare better according to their results, but their memory footprint is borderline intractable for consumer hardware at the time of writing. This might be solved by future attempts of distilling the knowledge internalized by the larger models into smaller ones, using internal states as training targets, a bit like what happened when turning GPT-2 into the half-size distilgpt2 while preserving most of its skill. Speaking of the GPT dynasty, why not frame decontextualization as a few-shot learning task encoded in a clever prompt? Besides the original T5 also being published by people working in the same research group, making for some easy extra citations, the fine-tuned model is specifically trained to realize whether it can do the job, and leave the excerpt intact if it can’t. With enough examples and compute, you might still manage to frame it as prompt engineering.

All in all, the decontextualization models of 2021 appear capable of slightly improving the quality of text highlights by injecting contextual information in them computationally. However, the limited benefits currently offered by this model lineup might not justify the compute resources necessary to deploy them, especially in self-hosted setups. Even without a resounding success in terms of performance, I think this tiny investigation was still fruitful in the way it informs the development of personal knowledge management tools in an age of information overload.