M.G. Siegler •

OpenAI Changes the Vocal Computing Game!

No sarcasm, just enthusiasm for GPT-4o
OpenAI Changes the Vocal Computing Game!

There were several points during OpenAI's demonstration of their new 'GPT-4o' model yesterday where I had to laugh. Not necessarily a "that's funny" laugh but more a "that's amazing" laugh. A profound laugh. A laugh to myself. A "this is it" laugh.

I've been on the voice train for a while. 13 years ago, I broke the news that Apple would be integrating Siri directly into the iPhone,1 which it did that fall. And it has been a part of iOS ever since. But while Siri may have been first, it was Amazon a few years later that was able to scale vocal computing to new heights, thanks to their Alexa cheap device everywhere strategy. Google then entered the fray with their own voice assistant and devices. Microsoft too. Samsung. And so on. But while there was intense consumer curiosity in such devices and functionality, once the novelty boiled off, we were left with a rather simplistic system for playing music, asking about the weather, setting timers, and maybe a game or two.

The gadgets have been here, but the technology powering them was not ready. With yesterday's OpenAI announcement, it finally feels ready.

I'm someone who has interacted with computers via voice since I was a kid, making my dad buy some early dictation software. But with smartphones, I started to listen to nearly everything I could as soon as I could. This used to require accessibility hacks,2 but these days it's the norm across a number of apps and services. Over the span, the voice technology for both understanding and speaking back spoken words has gotten a lot better. But it has always been missing some truly human elements.

That's what OpenAI showed off yesterday. It sounds nuanced because it is nuanced. But the difference truly is in the details. Voice inflection. The ability to do sarcasm. Humor. The ability to interrupt. And perhaps most important: doing all of this fast. At the speed of thought. All of these are things we take for granted because we learned them as children through mimicry and repetition. Computers, up until now, were giving it to us straight, as it were. That's good for certain types of information, but it was never going to take us to a truly vocal computing paradigm. It was far too robotic for most people.

Granted, some of what we heard in these demos was too over-the-top. But those are tweaks,3 the system is working. In a number of instances, the GPT-4o voice sounded more authentic and human-like than some of the presenters!4

Said another way, while this is undoubtedly a series of large breakthroughs in technology, it's just as big of a breakthrough in presentation. And this matters because 99% of the world are not technologists. They don't care how impressive and complicated the technology powering all this stuff may be, they just care that Siri can't understand what they're actually looking for and keeps telling them that in the most robotic, cold way possible. Which is perhaps even more infuriating.

One side of that equation: the actual "smarts" of these assistants have been getting better by leaps and bounds over the past many months. The rise of LLMs has made the corpus of data that Siri, Alexa, and the like were drawing from feel like my daughter's bookshelf compared to the entirety of the world wide web. But again, that doesn't matter without an interface to match. And ChatGPT gave us that for the first time 18 months ago. But at the end of the day, it's still just a chatbot. Something you interact with via a textbox. That's fine but it's not the end state of this.

We need to be able to communicate with computers just like we do with people. And that means voice.5

To be clear, I'm not saying that voice has to be the only way you interact with a computer. It may not even be the primary way much of the time – it really depends on what you're doing. But that's also the key: voice, truly reliable voice, needs to be a part of the computing interaction paradigm. It's going to be a key part of the concert of computing, across many services and many devices in many places.

For what's next, it also means visuals, something else OpenAI dropped into the announcement yesterday – putting the 'o', "omni", in GPT-4o – which is undoubtedly worthy of its own deep dive. I mean, with the new Mac app, it can see your screen. And I suspect using a phone's camera will be a key part of whatever Apple is going to announce AI-wise with the iPhone as well. Perhaps with OpenAI in tow...

Everyone yesterday was focused on the film Her – from Sam Altman on down to myself. It's a natural way to frame all of this – without trying to reframe it as some sort of dystopian nonsense. But it's actually a quote from another movie that initially popped into my mind upon seeing the GPT-4o demos. If I say the name "Dr. Ian Malcolm" you undoubtedly think I'm going to say his most famous quote:

"Your scientists were so preoccupied with whether they could, they didn't stop to think if they should."

A great quote from Jurassic Park, no doubt. But perhaps the opposite of the one I'm looking for here. So instead, let's go with:

"You did it. You crazy son of a bitch, you did it."

I'm not sure all of this is the equivalent of seeing a dinosaur brought back to life, but it certainly just put the vocal computing platforms of yesteryear on notice that an asteroid was inbound...


1 They had acquired Siri the year prior, and allowed it to keep running as a stand-alone app. But they key was baking it into the OS, which again, didn't happen until the fall of 2011.

2 You will, of course, note the image in this post, and the tweet embedded...

3 "Bring it on down to 75, will ya?"

4 This is less a joke about tech folks sounding robotic and more about that fact that humans get nervous presenting on stage and trying to remember lines, so that makes them sound more... well, how AI voices used to sound.

5 A vector which also may open a lane for new devices, of course