🌿  back

oneironomicon (paper, repo)

tl;dr
I trained an AI “teacher” to help a dreamed-up user simulator answer questions (e.g. “What are the most important books ever written?”) by strategically challenging it with prompts (e.g. “How can this be split into parts?”). I then evaluated the teacher’s effectiveness in helping me find pertinent answers to arbitrary questions, as a test of sim2real transfer.

Most virtual assistants available today bring value to users by simply taking on some of their tasks. They find online facts for them, make reservations on their behalf, or even automate parts of the research process. Those are obviously useful in helping users save time and make use of more experienced entities. That said, I feel there’s something missing here, the same thing I’ve been musing over in the humane transhumanism rambling essay. It’s the fact that those tools largely delegate the user’s task to an AI – they outsource the user’s decision-making and problem-solving skills instead of nurturing them.

In contrast, tutoring systems focus precisely on cultivating specific skills in their student users. They provide timely feedback and match the challenge of problems being tackled to the user’s skill, among others. Unfortunately, implementing a tutoring system often requires domain-specific expertise, making it difficult to put together versatile tools which can help in many disciplines.

However, virtual assistants can seamlessly answer questions about both movie trivia and weather forecasts, among a myriad other specific domains. One approach to developing dialogue systems which fare well across contexts despite there being limited user data available is to make use of user simulators. Those are systems external to the virtual assistant which have the sole task of impersonating plausible users and exhibiting procedural intents – they might inquire about the weather or attempt to make an online reservation. The virtual assistant has the opportunity to practice helping the user simulator before eventually being deployed to assist an actual human. This conversational sim2real pattern is analogous to training agents in 3D game engines before deploying them to the real world.

In this context, I was curious whether a virtual tutor could be trained to help a user simulator reach conclusions and identify appropriate actions itself, before deploying it to assist a human user. Concretely, I explored what I call the question answering assistance (QAA) task. In this task, the tutor agent has the objective of helping the user simulator reach pertinent and diverse answers to a random overarching question which drives the dialogue. To reach that goal, the virtual tutor can make use of a deck of ~200 static prompts, which are in fact exactly the prompts from the knowledge probes dataset, published around a year ago and meant back then to simply be sampled randomly. Over the course of 20,000 10-turn dialogues, the tutor agent is rewarded for helping a GPT-Neo user simulator reach what look like pertinent answers to the overarching questions, while being penalized for eliciting repetitive answers. The user simulator’s replies form the state of the environment in which the tutor agent acts.

After training was complete, and the virtual tutor managed to accumulate quite some experience in helping guide the user simulator, I took on interacting with the virtual tutor myself. I had five 5-turn dialogues with both the trained tutor agent and an untrained one which simply picked interventions randomly. The order was randomized, and the overarching questions were kept the same. In the end, the trained tutor appeared to elicit more pertinent and diverse answers from me on average based on the same reward formula as during training, although the difference turned out not to be significant in the end.

Comparison of transfer performance before and after training (Source)

Looking back at the experiment, I see a number of potential issues. First, dialogue lacks the Markov property – the tutor agent might need more than the last user’s reply in order to identify a suitable intervention. Second, the objective function could have been designed better, as repetitive answers beneath a threshold of pertinence still yielded reward, despite representing an undesired behavior. Third, the predefined interventions were overall too formal to be suitable for more casual questions. I describe those issues in more detail and suggest fixes for each in the paper.

Looking forward, I see quite a few dimensions in which this conversational sandbox can be extended. First, the task doesn’t need to be question answering assistance for the sim2real paradigm to work – a reading comprehension assistance task could train the teacher to help reduce the student’s perplexity on an overarching concept, for instance. Second, the tutor agent could incorporate planning by running multiple conversational futures “in its head” – dreaming up different ways the dialogue might go on given various interventions, and picking the most rewarding ones for a specific user. Third, it might be possible to repurpose the language model’s knowledge into other domains, and help users reach desired states in visual tasks, for instance. Those and many other directions are explored in much more depth in the paper.

Finally, this has been a really fun project. Seeing some of the loosely-connected ideas scattered around my past projects and articles come together in a coherent paper has been surprisingly rewarding. For instance, section 4.3.2 is entirely based on the piece on conversational multiverses, 4.3.3 echoes the idea of skills used in dual, while 4.4 reflects thoughts around AI safety and thinking in public. Having still the audacity to believe that it’s possible to follow your own research agenda, I’m excited about the more ambitious projects which might get within my technical reach with practice.