16.5 C
New York
Sunday, June 15, 2025

Buy now

OpenAI upgrades its transcription and voice-generating AI models

OpenAI is bringing new transcription and voice-generating AI fashions to its API that the corporate claims enhance upon its earlier releases.

For OpenAI, the fashions match into its broader “agentic” imaginative and prescient: constructing automated techniques that may independently accomplish duties on behalf of customers. The definition of “agent” is perhaps in dispute, however OpenAI Head of Product Olivier Godemont described one interpretation as a chatbot that may converse with a companies’ prospects.

“We’re going to see increasingly more brokers pop up within the coming months” Godemont instructed iinfoai throughout a briefing. “And so the final theme helps prospects and builders leverage brokers which might be helpful, out there, and correct.”

OpenAI claims that its new text-to-speech mannequin, “gpt-4o-mini-tts,” not solely delivers extra nuanced and realistic-sounding speech however is extra “steerable” than its previous-gen speech-synthesizing fashions. Builders can instruct gpt-4o-mini-tts on the best way to say issues in pure language — for instance, “converse like a mad scientist” or “use a serene voice, like a mindfulness instructor.”

Right here’s a “true crime-style,” weathered voice:

And right here’s a pattern of a feminine “skilled” voice:

Jeff Haris, a member of the product employees at OpenAI, instructed iinfoai that the objective is to let builders tailor each the voice “expertise” and “context.”

“In numerous contexts, you don’t simply need a flat, monotonous voice,” Harris continued. “In case you’re in a buyer help expertise and also you need the voice to be apologetic as a result of it’s made a mistake, you possibly can even have the voice have that emotion in it […] Our large perception, right here, is that builders and customers need to actually management not simply what’s spoken, however how issues are spoken.”

See also  Dapr’s microservices runtime now supports AI agents

As for OpenAI’s new speech-to-text fashions, “gpt-4o-transcribe” and “gpt-4o-mini-transcribe,” they successfully change the corporate’s long-in-the-tooth Whisper transcription mannequin. Educated on “various, high-quality audio datasets,” the brand new fashions can higher seize accented and different speech, OpenAI claims, even in chaotic environments.

They’re additionally much less more likely to hallucinate, Harris added. Whisper notoriously tended to manufacture phrases — and even entire passages — in conversations, introducing every thing from racial commentary to imagined medical remedies into transcripts.

“[T]hese fashions are a lot improved versus Whisper on that entrance,” Harris stated. “Ensuring the fashions are correct is totally important to getting a dependable voice expertise, and correct [in this context] signifies that the fashions are listening to the phrases exactly [and] aren’t filling in particulars that they didn’t hear.”

Your mileage might differ relying on the language being transcribed, nonetheless.

In keeping with OpenAI’s inner benchmarks, gpt-4o-transcribe, the extra correct of the 2 transcription fashions, has a “phrase error price” approaching 30% for Indic and Dravidian languages like Tamil, Telugu, Malayalam, and Kannada. That signifies that the mannequin misses round three out of each 10 phrases in these languages.

In a break from custom, OpenAI doesn’t plan to make its new transcription fashions overtly out there. The corporate traditionally launched new variations of Whisper for business use underneath an MIT license.

Harris stated that gpt-4o-transcribe and gpt-4o-mini-transcribe are “a lot greater than Whisper” and thus not good candidates for an open launch.

“[T]hey’re not the sort of mannequin that you could simply run regionally in your laptop computer, like Whisper,” he continued. “[W]e need to ensure that if we’re releasing issues in open supply, we’re doing it thoughtfully, and we’ve got a mannequin that’s actually honed for that particular want. And we predict that end-user units are probably the most fascinating instances for open-source fashions.”

See also  The gaming industry is facing a midlife crisis – is AI its future?

Supply hyperlink

Related Articles

Leave a Reply

Please enter your comment!
Please enter your name here

Latest Articles