Paul Krill
Editor at Large

OpenAI previews Realtime API for speech-to-speech apps

news
Oct 2, 20243 mins

Realtime API supports multi-model text and speech experiences including natural speech-to-speech conversations using preset voices already supported in the API.

Image of a person typing on a keyboard. Text, speech, typing.
Credit: Tero Vesalainen/Shutterstock

OpenAI has introduced a public beta of the Realtime API, an API that allows paid developers to build low-latency, multi-modal experiences including text and speech in apps.

Introduced October 1, the Realtime API, similar to the OpenAI ChatGPT Advanced Voice Mode, supports natural speech-to-speech conversations using preset voices that the API already supports. OpenAI also is introducing audio input and output in the Chat Completions API to support use cases that do not need the low-latency benefits of the Realtime API. Developers can pass text or audio inputs into GPT-4o and have the model respond with text, audio, or both.

With the Realtime API and the audio support in the Chat Completions API, developers do not have to link together multiple models to power voice experiences. They can build natural conversational experiences with just one API call, OpenAI said. Previously, creating a similar voice experience had developers transcribing an automatic speech recognition model such as Whisper, passing text to a text model for inference or reasoning, and playing the model’s output using a text-to-speech model. This approach often resulted in loss of emotion, emphasis, and accents, plus latency.

With the Chat Completions API, developers can deal with the entire process with one API call, though it remains slower than human conversation. The Realtime API improves latency by streaming audio inputs and outputs directly, enabling more natural conversational experiences, OpenAI said. The Realtime API also can handle interruptions automatically, like ChatGPT’s advanced voice mode.

The Realtime API enables development of a persistent WebSocket connection to exchange messages with GPT-4o. The API backs function calling, which makes it possible for voice assistants to respond to user requests by pulling in new context or triggering actions. Also, the Realtime API leverages multiple layers of safety protections to mitigate the risk of API abuse, including automated monitoring and human review of flagged model inputs and outputs.

The Realtime API uses text tokens and audio tokens. Text input costs $5 per 1M tokens and text output costs $20 per 1M tokens. Audio input costs $100 per 1M tokens and audio output costs $200 per 1M tokens.

OpenAI said plans for improving the Realtime API include adding support for vision and video, increasing rate limits, adding support for prompt caching, and expanding model support to GPT-4o mini. The company said it would also integrate support for the Realtime API into the OpenAI Python and Node.js SDKs.

Paul Krill

Paul Krill is editor at large at InfoWorld. Paul has been covering computer technology as a news and feature reporter for more than 35 years, including 30 years at InfoWorld. He has specialized in coverage of software development tools and technologies since the 1990s, and he continues to lead InfoWorld’s news coverage of software development platforms including Java and .NET and programming languages including JavaScript, TypeScript, PHP, Python, Ruby, Rust, and Go. Long trusted as a reporter who prioritizes accuracy, integrity, and the best interests of readers, Paul is sought out by technology companies and industry organizations who want to reach InfoWorld’s audience of software developers and other information technology professionals. Paul has won a “Best Technology News Coverage” award from IDG.

More from this author