You cannot read a news story today that does not somehow mention the implications of AI. Everywhere we turn, it’s part of the conversation. Some of that conversation is exciting and some of it is scary. There’s certainly a lot of hype and there’s definitely a lot of fear. I’ve recently gotten a fair bit of feedback on how we’ve used AI in the apps that we’ve built. Some of that feedback has been good. Some…well, not so much. I thought that I’d share my perspective about what I’ve heard and what I think. We’ve now built two solid, well-designed iOS applications that are in market and getting traction. Check them out. The first is Talk to Me, Goose! and the second is Fable’s Adventures. What follows below is a little overview of what we’re doing, what I’ve heard and what I think.
The Use Case
Talk to Me, Goose! is an app designed for people living with ALS or those living with a speech disability who require support to communicate. By connecting to an AI-enabled clone of a user’s voice, it allows a person who no longer has the ability to speak with their own voice to use a digital replica thus giving them their voice back. Talk to Me, Goose! also incorporates an AI agent, we call Merlin. Merlin performs three functions: Text Prediction, Text Generation, and Story Building. Fable’s Adventures is the Story Building component of Talk to Me, Goose! lifted out into its own app targeted to a broader audience.
Text Prediction
Merlin assists with predicting logical blocks of text based on a user’s input to accelerate the pace at which a user can participate in a conversation. If someone can type at only 10 words per minute because they have motor deficits, one’s ability to participate in a conversation is severely constrained. By the time that person has completed typing a basic sentence, the conversation has moved on. Merlin is meant to help accelerate that effort generating blocks of text that would take minutes to type in just seconds. It’s a perfect use case for generative AI.
Text Generation
In addition to predicting text in real-time, Merlin also allows the user, with a few button clicks, to generate specific text blocks n a chosen tone of voice based on simple inputs. “I’m hungry” can quickly become “Now that it’s lunch time, have you thought about what we might eat?” What was initially a demand becomes a pleasant-sounding, conversational question. Again, typing less, Merlin allows the user to say more. It takes seconds and a limited amount of input to generate a lot of output for conversion to speech, and the output can be tailored to the user’s mood in the moment from sounding formal & professional to being sassy or sarcastic. Merlin even allows users to tell jokes.
Story Building
Telling stories is one of the most genuine ways we build memories with the children in our lives. Sitting with kids in our laps and reading together builds bonds between adults and children in ways that last a lifetime. ALS takes away so much already, and we didn’t want to let it take this away too, so we built the ability to allow Merlin to build short stories for kids of all ages that could be read in the user’s voice clone together. Story Builder allows us to build fantastical and magical stories with just a few button clicks, and each story can be personalized with character names and traits giving us the chance to make those stories even more fun and unique. In less than five minutes, we can create and start reading a story together in our own voice. This is something that we think is unique, and critically important for our community to be able to do. We also think that without generative AI, this would be impossible.
Is it Hype, or is it Reality?
So, let’s breakdown this use case and unpack is it hype, or it reality?
First we have ElevenLabs, the technology provider with whom we’ve chosen to partner who can provide a high-quality, richly expressive digital representation of a user’s voice based on a limited amount of audio recordings provided to them which can be used to render speech in 31 languages in real time. We send them a text string, we get a streaming audio file in response. That audio file includes the user’s voice clone speaking those words in their own voice with amazing intonation, cadence, rhythm and sometimes even improvisational “ums and ahs.” Soon, we’ll even be able to get whispers, sighs, gasps, and laughter. And, for someone who has lost the ability to speak, that’s the difference between being locked in or not, between being able to communicate or not, between using a robotic-sounding speech generation device or not, or between using your own voice or not. That’s not hype, that’s reality.
Second, we have Claude’s Haiku. In milliseconds, based on the initial words that you type (in pretty much any language you type them), and the context in which you are operating, the model responds with a context-specific block of text suggesting what might be the rest of the paragraph. This isn’t next-word-prediction. It’s “conversation-prediction.” If you type at 10 words per minute, that is approximately 6 seconds per word. Playing that out, if I type “Hello” in ~6 seconds, in milliseconds, it converts it to “Hello, how are you doing today? I hope that you’re having a good day so far.” If I accept it, that’s 16 words. Something that would take 96 seconds to type took just over 6. That’s a 16x improvement in speed. That’s the difference between me participating in a conversation and the other person disengaging. That’s the difference between loneliness and human connection. That’s not hype, that’s reality.
Third, we have Claude’s Sonnet. In a matter of seconds, given the choice of a topic area, a tone of voice and a little bit of input, I can generate a set of suggestions for text on a range of topics that can, in the words of one user, “Make me a smooth talker.” No longer will a user be limited to utilitarian communication. “Need a blanket” can become “When you have a moment, could you get me a blanket, please?” Not only does this sound better to the caregiver who is responsible for helping with a user’s every need, but it’s also way less fatiguing for the user to do. I have three hypotheses that we want to test empirically. 1. That this can lead to lower caregiver stress and turnover. 2. That this can lead to lower user fatigue and greater adoption, and 3. That this leads to higher quality of life. If we can be warm and friendly, occasionally humorous, sometimes sassy, I think it gives us our humanity back. That’s not hype, that’s reality.
Finally, we have Claude’s Opus with a little help from Google’s Gemini. In just a few minutes, we can craft a story with interesting characters, rich settings and engaging plot twists. We even get an illustration to boot. Imagine sitting with a child and answering a few questions, clicking a few buttons and engaging in the process of creating a story which, in just a matter of moments appears on the screen with a fun illustration that can then be read in a digital replica of your own voice. This is not hype, it’s reality. This is making memories for someone who has limited time to do so. We think this is priceless. That’s not hype, that’s reality.
But now – The Fear?
We’ve received a lot of feedback on the work that we’ve done, in particular on the story builder component as we’ve launched Fable’s Adventures. This is a separate app which is the story builder aspect of Talk to Me, Goose! that is targeted at a broader market audience. Apparently some folks find abhorrent the idea that we’re using AI to create stories and illustrate them in real-time. Many have pushed back that it simply should not be done, that somehow something like this should simply not be created in favor of…what?
The reality is the ability for a large language model and an illustration engine to create a story and a relevant picture in a few moments from a series of inputs driven by carefully crafted prompting in the background, is pretty remarkable, technically. I suppose that I can understand and engage in the debate about the ethics of AI, but this particular use case does not strike me as overly ethically challenging. The primary push-back appears to come on three fronts:
- The use of existing works of artists and writers in the training of the large language models serving as the backdrop for this work as a start,
- The purported infringement on the creative opportunity for artists or creators, and
- The potential environmental impact of the energy use required for the inference necessary to generate the output.
So, let’s unpack these arguments.
But wasn’t this trained on existing work?
Yes. But so were we. A number of years ago, I had the opportunity to speak to a group of innovators at a conference, and I used the opportunity since it was at the Chicago Museum of Art to connect the dots between the work of George Seurat and the cave paintings in Lascaux. I did this to show that as innovators, we are all standing on the shoulders of those who came before. Nothing is, essentially, new. We are all reshaping the existing set of techniques, capabilities and materials into new works to serve new purposes all the time. Our job as innovators is not to make something entirely new. It is to assemble the existing in novel ways so as to make something new and establish a new baseline for others to build from.
The same is true of artists. Each was building upon existing techniques and capabilities, the ground that had been previously been broken.
Sunday in the Park with George in its pointillism style, did not just appear out of nowhere. It emerged from a long history of innovation in techniques and capabilities, of artists assembling the existing techniques and capabilities in novel ways to make something new. It was built upon the breakthroughs of all of the artists who came before that trace themselves back, ultimately, to those early people 17,000-22,000 years ago dabbing or blowing pigment onto the walls of caves. I would argue, we are just in the next generation of that evolution, and it may be painful for some because it is happening so quickly. But, that doesn’t mean we shouldn’t do it.
But, what about the artists and other creators?
I’ve had a fair bit of feedback that somehow using AI-generated content is putting at risk the lives and welfare of working artists and creators. I can certainly appreciate the concern. This topic is one that is in the news a lot recently. If you read the news, AI is coming for everyone’s job. As someone who has a Master of Fine Arts degree, it’s also one about which I’m fairly sensitive in this instance.
So, first, here’s the bottom line. Well, there’s no bottom line here, actually. Mundell Designs, LLC, the company behind Talk to Me, Goose! and Fables Adventures does not have a profit motive, and there are no employees. The work that we’ve created is being done purely to get Talk to Me, Goose! in the hands of users who need it. We seek to disrupt the market for Alternative and Augment Communications solutions not to try to make a gazillion dollars, and we’ve got some exciting announcements ahead. The revenue we seek to generate is committed to benefitting people living with ALS, and while we hope to have the means in the near future to engage a larger marketing team and artists and creatives to help with the externally facing imagery, we are indeed relying on the assistance of AI. Trust me when I say it’s not putting anyone out of a job because without AI in this case, there would be no imagery.
As for the imagery and the content in the app, this would simply not be possible without the advent of artificial intelligence. We create, in an instant, text-based content and illustrations that can be served to hundreds, if not thousands of users, simultaneously. One thing that we’d love to do in the future is engage a cadre of artists to fine tune the models that we’re using to hone the style of illustrations such that Fable’s Adventures, in particular, always has a particular look and feel to the illustrations that are created. However, I would refer you back to the bottom line reference above. That is simply not possible right now.
But, what about the environment?
This is a fair criticism. In thinking through the balance of equities here, I weigh the benefits against the costs. I would encourage you to do the same. I’ve laid out what I believe to be the benefits above. The costliest component of this is the generation of speech. Generating the text and the illustration is not free, but models have already been built and trained. The largest environmental impact has already been made and is a sunk cost.
Is it beneficial to give someone their voice back? What would you trade off to have your voice back? What would you trade off to reengage in conversation? What would you trade off to have time with your children or grandchildren that you otherwise may not have? Does that benefit outweigh the marginal cost associated with the energy required to perform the inference necessary to generate the predictions required? I think it does. The alternative is a world of silence. The alternative is a world of loneliness. The alternative is a future that I don’t frankly want to consider, and the trade off seems wholly worth it. I will trade off a little higher energy use for this use case in a heart beat; based on what I’ve seen on the market, there are a lot more wasteful use cases out there for AI that I would trade off differently.
Where does that leave us?
We live in an interesting time when technological innovation is driving exponential change, and we’re right to embrace it. Eschewing it, in my opinion, would ask people living with ALS to continue to settle for less than what is possible, and I refuse to do that. We have enough barriers already. Let’s not accept additional artificial ones that we don’t have to.