OpenAI launches three new voice models for developers

Fri, 8th May 2026 (Yesterday)

OpenAI has introduced three new voice models in its API, expanding its realtime audio offering for developers.

The new products are GPT-Realtime-2, GPT-Realtime-Translate and GPT-Realtime-Whisper. They are designed for live voice interactions, translation and speech-to-text transcription, and are positioned as tools for software that can respond as people speak.

GPT-Realtime-2 is OpenAI's latest model for spoken interaction. It is the company's first voice model with GPT-5-class reasoning, aimed at handling more complex requests while maintaining the flow of a conversation.

The model includes features meant to make voice assistants less brittle during live exchanges. These include short spoken preambles such as "let me check that", the ability to call multiple tools at once, clearer verbal cues about those actions, and responses that acknowledge problems rather than falling silent when a task fails.

OpenAI has increased the model's context window from 32K to 128K. Developers can also set different reasoning levels, from minimal to xhigh, depending on whether they want lower latency for simple interactions or a more deliberate response for harder tasks.

In performance tests cited by OpenAI, GPT-Realtime-2 at the high setting scored 15.2% above GPT-Realtime-1.5 on Big Bench Audio, which the company uses to measure audio intelligence. At the xhigh setting, it scored 13.8% above GPT-Realtime-1.5 on Audio MultiChallenge for instruction following.

Translation model

A second model, GPT-Realtime-Translate, focuses on live multilingual conversations. It can translate speech from more than 70 input languages into 13 output languages while keeping pace with the speaker.

The model is intended for uses such as customer support, cross-border sales, education, media and creator services. It can also provide realtime transcriptions alongside translated speech.

According to OpenAI, Deutsche Telekom is testing the model for multilingual voice interactions. Vimeo is also using it in a demonstration that translates a product education video live as it plays, allowing viewers to hear updates in another language without waiting for a separate version.

Transcription model

The third model, GPT-Realtime-Whisper, is a streaming speech-to-text system for low-latency transcription. It transcribes audio as people speak, with potential uses ranging from live captions to meeting notes generated during a conversation.

The model can be used in workflows where spoken language needs to be processed immediately. OpenAI pointed to uses in meetings, classrooms, broadcasts, customer support, healthcare, sales and recruiting.

OpenAI framed the broader launch around what it sees as three patterns emerging in voice software. One is voice-to-action, where a user speaks a request and the system completes a task. Another is systems-to-voice, where software turns live context into spoken guidance. The third is voice-to-voice, where AI helps people continue a conversation across languages or changing contexts.

Among the examples cited, Zillow is building an assistant that can respond to spoken property search requests and arrange tours. Priceline is working on voice-led trip management, including handling travel changes and translation during a journey.

Safety and pricing

OpenAI said the Realtime API includes safeguards intended to prevent misuse. Active classifiers monitor sessions, and some conversations can be halted if they are found to violate harmful content rules. Developers can also add their own guardrails through the company's Agents SDK.

Its policies prohibit the use of outputs for spam, deception and other harmful purposes. Developers must also make clear to end users when they are interacting with AI, unless that is already obvious from the context.

For customers in Europe, the Realtime API fully supports EU data residency for EU-based applications and falls under OpenAI's privacy commitments.

Pricing varies across the three products. GPT-Realtime-2 costs USD $32 per 1 million audio input tokens and USD $64 per 1 million audio output tokens, with cached input tokens priced at USD $0.40 per 1 million. GPT-Realtime-Translate is priced at USD $0.034 per minute, while GPT-Realtime-Whisper costs USD $0.017 per minute.

The launch underlines growing competition around voice interfaces as AI companies try to move beyond basic transcription and scripted responses toward systems that can manage longer spoken interactions, work across languages and carry out tasks during a conversation.

Preferred Source

OpenAI launches three new voice models for developers

FinTech

Industry

Commerce

Consumer tech

Enterprise

Cybersecurity

Telecomms