
OpenAI recently announced that its "Realtime API" has officially exited beta and entered production. This new API, designed for businesses and developers, is powered by the gpt-realtime conversational speech model. It utilizes an end-to-end speech-to-speech architecture to directly generate and process speech, eliminating traditional text-to-text conversion steps. Compared to its predecessor, it offers faster response times, more natural speech, and significantly improved processing of complex commands, making it suitable for scenarios such as customer support, education, and personal productivity tools.
The model has added emotion-sensing capabilities, capturing nonverbal cues like laughter and enabling seamless language switching during conversations. Developers can also customize the voice tone, such as "friendly with a French accent" or "fast-paced professional voice." In terms of performance, gpt-realtime achieved impressive results across multiple benchmarks: Big Bench Audio accuracy increased from 65.6% to 82.8%, and ComplexFuncBench jumped from 49.7% to 66.5%.
This upgrade also optimizes the tool integration process, allowing models to more accurately select and trigger external tools. It also supports image input—users can send screenshots or photos, and the model will interact based on the image content, such as recognizing text or answering related questions. To address cost constraints, the API price has been reduced by 20%, with audio input/output tokens now priced at $32 and $64 per million, respectively. The ability to set a token usage cap has also been added.
In terms of security, the API automatically detects inappropriate content and terminates sessions, but OpenAI emphasizes that developers must implement custom security rules. For EU users, data localization options and special privacy rules have been implemented to comply with GDPR requirements.