In part II and III of this series on poeisthetic writing assistance, we explained what poeisthetic writing assistance is and showcased Poiesis Studio, our prototype for providing poeisthetic writing assistance. We also shared our experiences of working with users, as they tried out our system, gaining insight from artists and attendees at the Maker Faire. In part II, we will share some ways that we have learned to conceptualise poeisthetic writing assistance based on the insights we gained. We will then point out what is fundamentally missing from text generation tools and provide our vision of the future.
When we think of modern text editors, we have a fairly good conceptualisation of the important features that should be present and how they should be logically laid out on the page. However, there is no real design guide for developing interactive text generation tools. So, we tried to identify our own design philosophy and implement it in Poiesis Studio.
The first design goal is the least tangible, but perhaps the most important. If we allow an AI writing assistant to ‘write’ by itself, how much text would it write before the writer feels irrelevant? We want to aid writers in their creative process and not replace them. In building new language models, there has been a lot of focus on writing long texts with models such as GPT-2 and GPT-3. While such models are very impressive, perhaps generating LONG pieces of text automatically isn’t necessary for interactive text generation. In fact, perhaps it isn’t really desirable?
Going against the grain, we decided to use a masked language models as the first plugin for Poiesis Studio. This can be used to regenerate arbitrary parts of a piece of text, including previous and future text. Typically, when regenerating small extensions to existing text, the suggestions by the AI correspond to minor “what-if” like interactions. This might be: introducing a new individual, elaboration on an existing person or place, a new verb indicating some action taking place etc. These are often fun and interesting extensions that correspond to a unit of thought for consideration by a writer. Moreover, as the user chooses the specific number of words they want regenerated, they will naturally define ‘queries’ that reflect their own cognitive need to the system. The system won’t run away and generate large amounts of text, and it will be directed towards solving specific writing problems the writer wants to solve.
Though writing is textual, what we want to commit to the page is as complex, multi-modal and nuanced as our mind’s eye can muster. If we really want to co-create with an AI writing assistant, it needs to know this intimate part of our creative mind. We use the analogy of painting to capture this requirement as a design philosophy. The idea is that a user annotates what is in their mind on the page, like applying brushstrokes to a canvas while painting. Of course, there isn’t a brush and the page isn’t a canvas, so this analogy can only be taken so far. However, the fundamental vision of annotation being a visual, simple and enjoyable activity when interacting with the system is a fundamental design consideration.
There are many parallels between writing and painting. In both cases, compositions are created incrementally, where different parts of a composition are realised at different levels of detail. The broad strokes in a painting that might capture the tone and colour profile of a portion of the canvas, corresponds to the setting, tone, vague story beats and key lines of dialogue in a textual composition. The difference is that, in text, we often don’t layer these ideas on the page in the same way that we layer paints on a canvas. However, when writing with an AI-assistant, a similar layering of concepts would emerge on the page, where ideas are annotated at different levels of granularity: from the word level, through to the phrase, sentence and paragraph level and beyond. The implicit conceptual structures we form internally would need to be expressed on the page.
Our support for this design consideration is still in its infancy. We have implemented this by representing the process of masking and the specification of restrictions as visual elements that correspond to clear visual changes that correspond to regions of text. This is perhaps not really enough to demonstrate the concept of capturing the varying levels and types of thought involved in incrementally elucidating an idea, but it is a start.
We are not really sure what it is that people do in their minds when they create. Where do the ideas come from that chart a possible creative trajectory? In which modality do these ideas present themselves to the writer? As text, images, sounds, smells, vague feelings? All the above together? It’s certainly different to what we do in Poiesis Studio for now but getting at this creative mental representation and complimenting it in our interface is one of our fundamental goals.
As described previously, specifying masks is similar to defining a query, and the text that replaces these masks is like an answer. These queries are different from typical search queries you might make on Google. When searching on Google, the assumption is that you already know what you are looking for, and you want the most relevant matches to that search. In our case, the user knows that they want something, and they can formulate that by interacting with our system. The link between what a user asks for and what they want to find isn’t always clear. Supporting this journey of discovery is important. It’s hard to say exactly how this should be supported in design, but it is an important concept to keep in mind.
For now, we simply kept track of the history of generations and indicate which parts of the generations were masked and restricted by the user to being particular word classes or not. The user can see where they have been and what they asked for.
We started this article with the notion of masking as a sort of query and the generations as answers to this query. Every time a user changes the masking or the context, they are directing their search in some way. Every time the user clicks on the generate button, they are getting an answer. Their search process is refined as they receive feedback from the system in terms of generated text. Of course, the explicit search they are performing with Poiesis Studio also runs in parallel with their own internal exploration of ideas. An interesting ideas sparks new ideas that they might want to incorporate back into their composition.
It’s very interesting that this form of exploration is laid out on the page rather than a purely mental process. Perhaps externalising this process may help exploration in the same way that bouncing ideas off of another writer can be helpful at times. The way in which a writer interrogates the AI model is also perhaps different to the way in which they would traverse possible ideas in their own mind. Practically speaking, a user can define a space of queries by adding and removing masks and look at the corresponding sentences that are generated as sorts of answers.
It’s unclear what our internal creative processes are and whether these should be mimicked in AI poiesis tools. As the step of poiesis can be performed by an AI tool, the writer doesn’t really have to do this, and they can jump straight to judging whether what has been created is aesthetically beautiful. We’ve focused on text as a point of inspiration, but writing is really about ideas and ideas aren’t fundamentally textual. One can imagine various ways of inspiring writers.
The final element of design comes from recognition that most people are actually already experts in writing. A lot of the time, we know what we want to write and poiesis isn’t a problem. So, imposing a lot of bloated interface features just clutters our experience. Interacting with the AI writing assistant should be like entering a different mode of thinking. In this mode, different cognitive operations become prominent. When you enter Poiesis Studio, you take a timeout to enter a creative interaction with our AI writing assistant. We achieved this in our work by having a ‘poiesis’ mode which a user engages and disengages as they need. Importantly, the generated text is no different to user written text, when exiting poiesis mode, it sits on the page like all other content.
There is, I believe, a subtle but fundamental shift in our writing process that takes place when using Poiesis Studio. This all has to do with the internal creative process going on in the mind of a writer when they interact with a text generation tool. In our case, our masked language model.
The most basic use case we considered, consisted of forming an intention about what should be written, writing it, and simply asking Poiesis Studio to replace a part of that intention. This could correspond to, for example, trying to find nicer adjectives or a more interesting noun to avoid repetition or to use more complex and varied vocabulary. For example, starting with “the man”, and generating the “the decrepit old man” as a possible extension.
An alternative to this, is to consider the masked representation and operations on it as defining a creative space of possibilities. Each operation that is performed moves the user through this space. The user can perform these operations regardless of whether they have formulated any particular intention about what their writing objective is. In other words, the tool itself defines mental operations which the user is reasoning about in some way, rather than using the tool just to support conventional writing needs. This was the case when asking for ‘some words’ to start a song in part III. There was no particular concrete idea about what should be generated.
The difference between the two cases, is that in the first case, the writing assistance provides support for when conventional writing falters in some way. In this case, the tool needs to express a conventional writing need. In the second case, the writer fundamentally changes the way in which they write as a consequence of the writing process that the tool enables. In the later case, the most fundamental step in writing, formulating something that needs to be expressed, doesn’t even need to be there.
Interestingly, both of these cases are handled identically by the system (it’s all just a masked representation), and so it may not seem like a big leap, but I think a fundamental shift happens in the mind of a user when they start to see writing as playing with a text generation tool. This is exactly the kind of insight that makes building an interface to play with a model invaluable.
So far, we have only looked at text generation using masked language models. There are still various problems left over to iron-out to really support robust masked language model plugin in an interactive writing context, but it is already good enough to be useful. So where next?
If we try to glimpse into the not-so-distant future of writing with an AI writing assistant, is there any undeniable component of such a system that we would need that we don’t currently have? I think the answer is, yes. While the masked language model allows you to generate plausible sentences and can be directed towards generating text for particular writing contexts, what it doesn’t know about is the intention of the writer. Imagine how complex and rich the internal world is that we are trying to put onto the page: the characters, events, places, emotions, relationships, sensations and many more aspects. I’m calling all of this rich internal world the ‘intentions’ of the writer, in so far as they are the internal state that the author intends to express in the text. It is what is intended to be conveyed to the reader.
The question then is, what is the shape of your intention? How can we provide this in a practical way to writing assistants such as Poiesis Studio, so you can get to work together? Understanding this is an open question, combining literary studies, linguistics and artificial intelligence. From a linguistics perspective, we need to understand what the fundamental abstract building blocks of language are that would support the expression of intentions. From literary studies, we need to understand the building blocks of the written form more as an art or craft, such as writing style, composing stories, world building, writing good dialogue etc. Finally, in artificial intelligence, we need to understand how these insights, that form the fundamental building blocks of what we might want to express in text, can be used to engineer practical systems for use by writers.
I imagine that this will be a very long process of to and fro between those identifying and engineering intentions and those using them in an intention rendering system. What will happen, is that users will discover deficiencies in whatever intentions that are provided and realised by the system as well as identify the need for additional intentions. These small pieces of the puzzle will come together through the interaction between intention engineers and writers to effectively realise practical writing goals. Over time, what works will stick and what won’t will fade away.
Our writing process will go through radical changes in the coming years. In many ways though, it is late to the game. The ability to automatically create and transform content has been possible in music production for decades: midi sequencers; drum machines; multi-tracking; audio plugins (such as reverb, delay) and complex ways of arranging these in pipelines. More recently, AI researchers in the music team at Sony CSL Paris have also created tools to generate new hybrid sounds, and aid in song composition. This is all realised in software called Digital Audio Workstations.
Likewise, what we need in the future is a workstation for text, to engineer text as the output of a set of content creation and transformation processes, orchestrated by writers. I refer to this an ‘Intention Rendering Engine’ for now. The idea being that writing becomes more about specifying ‘what’ needs to be realised in textual form (the writer’s intention) using various multimodal inputs and direction from writers. This is then ‘rendered’ into a final piece of text.
Text is a much more complex ‘signal’ than audio is, and so it has taken much longer to have tools capable of manipulating text meaningfully to the extent that we can start to imagine an Intention Rendering Engine of the sort we envision. But it is coming.
Phew, you made it to the end! Feel free to put your brain back in after having you mind blown repeatedly. I think by now you have a good idea about poeisthetic writing assistance and what we are trying to achieve in the language team at Sony CSL.
Our take-home messages are as follows:
We are excited to make progress on improving Poiesis Studio, both technically and conceptually by better understanding and supporting poiesthetic writing experiences. Feel free to contact me at firstname.lastname@example.org if you want to get in touch. We are very happy to collaborate with writers, language researchers and techies!