On Tuesday, Amazon debuted a brand new generative AI mannequin, Nova Sonic, able to natively processing voice and producing natural-sounding speech. Amazon claims that Sonic’s efficiency is aggressive with frontier voice fashions from OpenAI and Google on benchmarks measuring velocity, speech recognition, and conversational high quality.
Nova Sonic is Amazon’s reply to newer AI voice fashions such because the mannequin powering ChatGPT’s Voice Mode, which really feel extra pure to talk with than the extra inflexible fashions from Amazon Alexa’s early days. Latest technological breakthroughs have made legacy fashions and the digital assistants they underpin, reminiscent of Alexa and Apple’s Siri, appear extremely stilted by comparability.
Nova Sonic is accessible by way of Bedrock, Amazon’s developer platform for constructing enterprise AI functions, by way of a brand new bi-directional streaming API. In a press launch, Amazon referred to as Nova Sonic “essentially the most cost-efficient” AI voice mannequin in the marketplace, and round 80% cheaper than OpenAI’s GPT-4o.
Elements of Nova Sonic are already powering Alexa+, Amazon’s upgraded digital voice assistant, in line with Amazon SVP and Head Scientist of AGI Rohit Prasad.
In an interview, Prasad instructed TechCrunch that Nova Sonic builds on Amazon’s experience in “massive orchestration programs,” the technical scaffolding that makes up Alexa. In comparison with rival AI voice fashions, Nova Sonic excels at routing consumer requests to totally different APIs, stated Prasad. This functionality helps Nova Sonic “know” when it must fetch real-time data from the web, parse a proprietary information supply, or take motion in an exterior software — and use the suitable device to do it.
Throughout a two-way dialogue, Nova Sonic waits to talk “on the applicable time,” bearing in mind a speaker’s pauses and interruptions, says Amazon. It additionally generates a textual content transcript for the consumer’s speech, which builders can use for numerous functions.
Nova Sonic is much less susceptible to speech recognition errors than different AI voice fashions, in line with Prasad, which means the mannequin is comparatively good at understanding a consumer’s intent even when they mumble, misspeak, or are in a loud setting. On a benchmark measuring speech recognition throughout languages and dialects, Multilingual LibriSpeech, Amazon says Nova Sonic achieved a phrase error charge (WER) of simply 4.2% when averaged throughout English, French, Italian, German, and Spanish. Which means that roughly 4 out of each 100 phrases from the mannequin differed from a human transcription in these languages.
On one other benchmark measuring loud interactions with a number of contributors, Augmented Multi Occasion Interplay, Amazon says Nova Sonic was 46.7% extra correct by way of WER than OpenAI’s GPT-4o-transcribe mannequin. Nova Sonic additionally has industry-leading velocity, with a mean perceived latency of 1.09 seconds, in line with Amazon. That makes it sooner than the GPT-4o mannequin powering OpenAI’s Realtime API, which responds in 1.18 seconds, per benchmarking by Synthetic Evaluation.
Prasad says Nova Sonic is part of Amazon’s broader technique to construct AGI (synthetic common intelligence), which the corporate defines as “AI programs that may do something a human can do on a pc.” Transferring ahead, Prasad says Amazon plans to launch extra AI fashions that may perceive totally different modalities, together with picture, video, and voice, in addition to “different sensory information which might be related in case you convey issues into the bodily world.”
Amazon’s AGI division, which Prasad oversees, appears to be enjoying a bigger position within the firm’s product technique today. Simply final week, Amazon launched a preview of Nova Act, a browser-using AI mannequin that seems to be powering components of Alexa+ and Amazon’s Purchase for Me characteristic. Beginning with Nova Sonic, Prasad says the corporate desires to supply extra of its inner AI fashions for builders to construct with.