At the Language Team in Sony CSL, we are very interested in how people can use artificial intelligence (AI) in their creative processes. There have been many recent advances in the field of artificial intelligence that we believe will help to turbo-charge creativity in writing. Unlike grammar-checkers or spell-checkers, our goal is not to correct writing mistakes, but to help with content creation.
To this end, we created a software suite called Poiesis Studio, to understand how existing state-of-the-art, AI-driven tools can be used in an interactive creation context. Poiesis Studio is a test-bed for our existing work and a statement about our belief in a future of AI-driven writing assistance. We will get into what the ‘poiesis’ in Poiesis Studio means exactly in part II, but for now you can just think of it as a creative mode of thinking. We have already showcased Poeisis Studio in many contexts: It’s been used by various musicians to write lyrics, notably by Whim Therapy in his entry for the AI Song Contest 2021, and at the Maker Faire in Rome in front of hundreds of attendees.
In this four part series on poeisthetic writing assistance, we will bring you up to speed on what we have been doing in the language team at Sony CSL. In part I, we will provide an intuitive description of the core technology that has made robust text generation possible — neural language models. We will then present the Poiesis Studio interface in part II before demonstrating interactive text generation in part III. Finally, in part IV, we will talk about our vision of poiesthetic writing in general and what the future might be in store for writers in the near future.
Language Models
Imagine force-feeding an artificial brain with copious amounts of text. That artificial brain has to consume the text, digest it, and use it in some task. The task we ask it to perform is to reproduce what it is given. So, if we give it “This is some text” as input, it has to reproduce “This is some text” as output. Sounds easy right? If all it had to do was to reproduce what it was given as an input, the task would be easy. However, we want it to do more than that. We want it to learn a rich representation of the language it consumes.
To do that, we need to make sure it doesn’t just repeat what it is fed, but that it learns something more — about the structure and meaning of language as a whole. Instead of serving up the text as is, we deviously hide parts of the text by ‘masking’ those parts. We then ask the artificial brain to reproduce the original unmasked sentence, hidden parts and all. You can see this in the example below, where we mask the words ‘example’ and ‘text’ and ask the language model to reconstruct the masked text.
Our fledgling artificial brain isn’t in for an easy time! The poor thing must perform the task we set out gluttonously, line by line, for millions to billions of lines. Eventually, it gets better at this task, learning the relationships between visible words (unmasked), to predict hidden words (masked). After doing this for millions to billions of lines, it learns to understand something about the fundamental textual structures that it encountered. After this ‘training’ procedure, it is now an artificial neural network that has learned to model the inherent structures of language — a language model.
Exactly what is learned by language models is an active area of research. There are researchers that investigate these artificial brains in an analogous way to the way that neuroscientists study the brain. Unlike measuring the human brain, we can easily measure every part of a neural network, but just like the human brain, deciphering what these measurements really capture about language is challenging. We can speculate that the language structures it learns about are both syntactic and semantic – knowing that nouns follow adjectives (syntactic) and that the adjective ‘noisy’ probably goes with the word ‘car’ more than it does with ‘idea’ (semantics). Also, as a lot of the text that it sees is also factual, from sources such as wikipedia, it also ‘knows’ something about the world, such as which cities are in which countries, what foods are popular in those cities etc.
Being able to understand language automatically is the holy grail of natural language processing. This is the area of artificial intelligence concerned with processing – you guessed it – natural language. The sort that we all use in our everyday lives. While we are still a long, long, way away from being able to automatically understand language anywhere near as well as humans do, language models have provided a way to extract useful computational representations from text that have been shown to be useful in a wide variety of different tasks, including: sentiment analysis (i.e. is a Tweet positive or negative), machine translation (e.g. This is some text -> ce sont des textes) and question-answering (e.g. What is the capital of Japan -> Tokyo). Their utility in a wide variety of tasks indicates that something complex and fundamental has been learned.
While language models are a good foundation for improving performance in many tasks (see Q&A for more on my use of the word ‘foundation’), they are also very good at generating text. What do I mean by ‘generating’ text? I mean starting with some text and extending it with new text automatically, based on what has already been written. We are all familiar with auto-complete in messaging applications like Facebook messenger and WhatsApp. However, unlike those technologies, modern language models are complex enough to compose long coherent text on demand, rather than simply suggesting what word might come next when typing a message. The usefulness of language models in a wide variety of tasks is really a side effect of the task they perform. Remember the task of reconstructing text that is partially hidden or masked? In learning how to improve in this task, what language models have really learned how to do is ‘predict’ what is masked from what is not. The fact that this is possible using a language model might not seem like a revelation at first. Why would you want to do that anyway? To understand why this is significant in the context of text generation requires a slight shift in thinking. We have to ask, what does the act of masking text mean?
During what is called the ‘training’ phase of a language model, the act of masking text is used to define the problem that a neural network is being ‘trained’ to solve. That is the process we have already described. However, once the model is trained, this masking becomes more of a query. A query that asks “what could possibly fill in the masked parts?”. It’s no longer a question with a single answer (reconstruct the original sentence), but a question with many answers, where each answer is a plausible sentence. These replacements could be any part of a piece of text: individual words, phrases, sentences, paragraphs etc. The language model has seen millions to billions of sentences — more than a human could ever read. Every question we ask in terms of a masked representation corresponds to perhaps hundreds of plausible answers.
Alright, I can sense some language modellers twitching in their seats right now! Let me just talk to them for a second. I’ve brushed over a lot of details in writing this article so far. At this stage, I should point out that the sort of language model I’ve been talking about is a “masked language model”. If you have been paying attention so far, you can understand what the ‘masked’ part of that name refers to. There is another sort of language model called an “auto-regressive” language model. This latter type of model only generates the next word given previous words, just like the auto-complete feature on your phones. By generating one word at a time and basing what comes next on what has already been generated iteratively (auto-regressively), it can generate arbitrarily long pieces of text. Autoregressive models and masked language models are conceptually solving the same problem, except that auto-regressive models only every try to predict the word at the end of a piece of text and not any number of missing words within the text. Auto-regressive models are better at this task than masked language models as they are specialised to perform this task alone. However, as they can only generate text from left to right, we feel that they are much less versatile in an interactive writing assistance context.
Though there is a lot of additional complexity we could get into concerning language modelling, for the purposes of this article, we only really care about the fact that using a masked language model, we can regenerate arbitrary parts of a piece of text, including future and past tokens. It’s more than enough to make a useful and versatile writing tool.
Conclusion
In part I of our article on poeisthetic writing assistance, we explained a bit about the brains of our system — language models. In part II we will show the user interface of Poiesis Studio and how you interact with language models.
Our take-home messages are as follows:
– Language models are neural networks that have learnt something about the syntactic and semantic structure of language, as well as some knowledge about the world
– They can be used to produce novel and reasonably coherent text
– At the language team in Sony CSL, we are exploring masked language models in Poiesis Studio, where any word in a text can be regenerated
We are excited to make progress on improving Poiesis Studio, both technically and conceptually by better understanding and supporting poiesthetic writing experiences. Feel free to contact me at [email protected] if you want to get in touch. We are very happy to collaborate with writers, language researchers and techies!
Q&A
There are lots of elements to this article that eagle-eyed readers might have accumulated questions for while reading this article. I’ll try to anticipate and answer them here.
Question: How do you do next word prediction if you didn’t train a next word prediction model?
This is conceptually similar to adding a mask to the end of a sentence and asking the model to predict what word that mask should be. In an “auto-regressive” model, all of its internal neural architecture is focused on learning what is necessary to be *understood* about the previous words, to predict the next word. In a masked language model, its neural architecture is trained to predict every masked part of a sentence, from every unmasked part. Though conceptually similar, these models have a lot of subtle and not so subtle differences in the way they are constructed that I will not go into.
Question: I thought language models were just used to measure probabilities of sentences. You didn’t talk about that in this article?
Strict definitions of language models typically include the requirement to be able to measure the probability of sentences. We haven’t talked about that technicality here, simply knowing that language models ‘know’ something about the structures of language and can generate reasonable sentences is enough. The models we talk about work by improving their predictions of words.
Question: You keep talking about words but in other articles people talk about tokens. What’s going on?
I decided not to go into the notion of modelling sentences as ‘tokens’ rather than ‘words’. That would require explaining why there are issues with training models to have broad coverage which adds a layer of additional complexity for the reader. The language model I demonstrate is a word-piece model and many of the tokens are actually just words.
Question: Why not next sequence prediction?
There is no fundamental reason not to include a next-sequence-prediction-only model. In fact, we may well do that in future plugins. However, in our experience, generating longer texts than a phrase or sentence changes the relationship between writers and the text they create. While some might find it compelling to see a large piece of text write itself, writers might beg to differ. The more steps that are autonomously generated, the more it feels like the writer is irrelevant in the generation of text. Making writers irrelevant is not what we are trying to do.
Question: Is this a Thesaurus?
Game on!
First serve:
So this just an over-the-top thesaurus?
Not really. A thesaurus is used for finding semantically similar words. This works at the phrase level.
15 – love
Second serve:
So, this some kind of phrase thesaurus?
No. Nice try. You can request replace arbitrary parts of text and even ask to extend parts of the text.
30 – love
Third serve:
So is this an arbitrary pattern thesaurus?
No, but now I’m really starting to sweat. Getting back to the definition of a thesaurus, our MLM model doesn’t generate synonymous words by default. It will generate some somewhat plausible replacement of tokens in the text.
40 – love
Game point:
So it isn’t a thesaurus?
Well… ehehehe. It could potentially provide synonyms if the context suggested at synonymous words. If we were to take the sentence: “The [MASK] walked quickly.” We might find ‘man’, ‘guy’ and ‘bloke’ as fairly probable replacements. However, we might also find ‘Bob’, ‘woman’ and ‘robot’. It’s clearly not just trying to find synonyms, but one would expect synonyms to be probable.
30 – 15
I’ll give you that point. Let’s continue this another time…
Question: Are language models really foundational models?
There is a current debate about whether language models should really be called ‘foundational’ models. I think this debate is perhaps based on a misunderstanding of what a foundational model means in deep learning. It simply means a model that can be built upon to improve performance on many different tasks. It’s a foundation in the same sense of the word as the foundations of a building. It’s a useful starting point, and though we take language models for granted now, it’s somewhat surprising that the simple procedures for training these models (and access to A LOT of data) really allows for one model to perform well in many different contexts.
It’s very important to point out that, though language models are exciting technological accomplishments, they don’t contain all of the knowledge about the structure of language, nor do they contain all the knowledge about the world, to serve as a good foundation for any language related problem. It is important to not naively assume that they already do, or that the current types of language models will ever serve as such a foundation without significant technological advances in AI and significant advances in linguistics.