Known Issue: Transcription Artifacts Causing Unexpected Low Statement Scores in AI Role Play

⚠️ Monitoring — Mitigation deployed March 3, 2026. Known artifact patterns are now filtered. We are continuing to monitor for additional hallucination patterns and decreased impact of Whisper hallucinations on role play statement scoring.

Summary

Some users may see unexpected low statement scores in their AI role play sessions caused by transcription artifacts — not by anything they actually said. We've deployed a mitigation for the most commonly seen artifact and will continue to monitor for additional patterns.

Who Is Affected

Learners using the AI-powered practice (OpenAI Realtime) feature
More likely to occur when users are in a noisy environment, using device speakers instead of headphones, or experiencing audio feedback or echo during their session
Not affected: users on the Deepgram STT path

What You Might Notice

A statement score of 40 or lower appears with feedback indicating the response was "completely irrelevant to the conversation context"
The flagged statement does not match anything the user said
Unusual characters or phrases appear in the transcript, such as Korean characters or short out-of-context phrases
The user's actual speech is transcribed correctly — only the artifact statement appears out of place, often as a separate "statement" in Brevity with its own feedback.
The rest of the session score and feedback looks normal

What's Happening (High-Level)

The AI role play feature uses OpenAI's Realtime API, which is powered by a Whisper-based speech-to-text model. Whisper is designed to be highly confident in its transcriptions — but as a side effect, when it receives brief, degraded, or ambiguous audio (such as echo from device speakers, mic feedback, or background noise), it can generate a short, plausible-sounding phrase rather than returning silence. This is a known characteristic of Whisper-based models and is not unique to our platform.

These phantom phrases were being treated as real user statements and scored, which could result in an unfairly low score for that turn.

What We've Done

We've implemented a hallucination denylist that filters known artifact phrases before they reach the transcript, scoring, or backend. The most commonly seen artifact — a Korean news anchor phrase introduced by the Whisper model — is now filtered automatically. We are actively monitoring session data for new patterns and will expand the denylist as needed.

What To Do

To reduce the likelihood of audio quality issues during sessions:

Use headphones or earbuds whenever possible — this is the most effective way to prevent the AI's audio from bleeding into your microphone
Use the mute button when not speaking, particularly in shared or noisy spaces
Avoid using device speakers during a session

If a session has been affected and scores appear unfairly low, please contact your administrator or support so the session can be reviewed manually.

Current Status

⚠️ Monitoring — Mitigation deployed March 3, 2026. Known artifact patterns are now filtered. We are continuing to monitor for additional hallucination patterns.

When to Contact Support

If you or a learner receives a flagged low score that does not match what was said during the session, please reach out to support with the session details so it can be reviewed manually.

Last Updated: March 3, 2026