generative operating systems
The ideas below were largely seeded by a chat with David, Linus, and Luke at an Ought retreat.
Just to get this out of the way, autoregressive refers to a system whose outputs become part of the inputs in the next step. GPT-3 generates a word and then appends it to the prompt before continuing its text completion. It’s recurrent at the text level, but that term is loaded with pre-transformer ideas of RNMs, LSTMs, and GRUs, so we stick with autoregressive.
You might have seen those GPT-3 demos aiming to emulate traditional content via familiar interfaces. There’s this one pretending to be an ordinary search engine indexing the web, but actually all the search results are generated from scratch using the language model. Similarly, there’s the idea of a generative library of books. In the same vein, it’s easier to implement a call to GPT-3 for generating a wikipedia summary for a term than it is to actually query wikipedia for it. It works for imaginary and rephrased terms quite well, but requires orders-of-magnitude more compute.
If we blissfully ignore the compute requirements for a bit, what if the whole operating system running on a consumer device would be “implemented” through a multi-modal autoregressive model like Gato? It would project visuals on some screen, but those would be generated from scratch based on context, user inputs, etc. There wouldn’t necessarily be apps supported by executables, in turn distilled from programming languages. The developer wouldn’t write code, but would only describe the desired behavior of the app. Referring to “apps” here is the symptom of being too familiar with the current paradigm, just like those window frames ported in VR. The natural primitives would radically evolve to fit the medium, but I can’t even start to fathom them now.
There are many angles to this, so let’s explore a few more. Memory at different scales would relate to context lengths and sparse attention of the multi-modal model over past user states. If I wanted to resume sculpting an object, I could just refer to it. The autoregressive model would constantly labor on the user-machine interaction history and hence have that accessible. Hot and cold storage paradigms translate to dense and sparse attention over interaction history.
How about interfaces and peripherals? The multi-modal autoregressive OS would probably first be fluent in text, visuals, and audio. The naive approach would be to type stuff in, but what if you just had a verbal conversation with the system? You’d gesticulate and sketch and explain what you want, and if prompted properly it would bring that to life: a building, an image, a story, a concept, etc. Most output signals coming out of the system would be synthetic, ranging from the background music accompanying a just-in-time game all the way to a detailed technical drawing.
But we can take multi-modal interfaces even further. What if we naturally extended this autoregressive OS with neuroimaging data, assuming a utopic non-invasive functional scanner with high temporal and spatial resolution? We might think the computer into behaving in a certain way. We might envision that 3D asset, and by mere autoregressive coherence bring that into existence. The visuals would close the human-machine feedback loop and converge on the desired design, or just wander around design space.
Pushing to extreme science fiction, you might think a piece of software into existence, by just imagining its functionality. By virtue of the extreme generality associated with hands-off end-to-end models, the main primitive would probably just be open-ended end-user programming. The current conception of “thoughtware” I’ve been throwing around over the past year might severely underestimate the possibilities.
I can also see neural activity as being a residual output signal from machine to human. I get this feeling sometimes of GitHub Copilot colliding with my thought pattern. I had something different in mind for this function, and this other thought pattern has thrown me off balance. If anything, conventional text/visual/audio outputs can be used to get the neuroimaging readings towards a certain state, communicating a certain idea to the user through indirect feedback. This might have been mentioned in Peter Watt’s Blindsight at some point. However, that link might be too sparse and noisy, so perhaps neural stimulation as yet another modality could tighten the feedback loop.
Not sure how the model would be structured. Maybe a mixture of experts on some cloud serving users with modest hardware, a bit like Stadia. Maybe the frozen model weights would be “burned” to some hyperefficient FPGA chip. Maybe knowledge editing and online learning would play a role in “updating” the OS. Maybe cross-device communication would be based on models attending to each other’s activations across a network. What I’m vaguely sketching is mostly the functionality and user experience, rather than the implementation.
There’s a related recent trend of describing apps and having a language model like OpenAI’s Codex implement them. But that also feels like being stuck in the old ways which are still new. Alternatively, there’s the idea of using the user’s entire context and session as raw visual input in order to solve inter-app integrations with Flamingo-style models. An app would ask the screenshot “What food is depicted there?” and help you order that. Or Adept’s models which automatically script away their interactions with legacy software.
This spectrum between conventional software engineering with code and autoregressive substrates feels like yet another instance of the neurosymbolic integration problem. Should you go for reliable links or fuzzy embeddings? Should you use traditional ML models for interpretability or double down on performance with black box DL models? Should your software run on code or be autoregressive in nature? In a softer computational sense, Karpathy had very much related thoughts on the future of software engineering, but I think there’s more to it: multi-modal user interfaces, mostly autoregressive OS, etc.