Executive Summary We evaluated Android’s Speech-to-Text (STT) and Text-to-Speech (TTS) capabilities for building a voice-based learning experience. The core finding: live transcription of short, turn-based speech works well on Android. However, continuous listening, long-form speech, and full-duplex conversation are not reliably achievable due to platform-level constraints. This document covers what works, what doesn’t, the trade-offs of each approach, and our recommended product direction. Bottom line: Design for a turn-based speaking practice app. Do not pursue real-time conversational AI on the current Android STT stack.
1. Why Live Transcription Matters Live transcription is the single most important capability for a natural-feeling speech experience. When users speak and see their words appear instantly via partial results, the interaction feels real-time and responsive. Without it, the experience feels delayed and broken — users won’t wait for a full sentence to be processed before expecting feedback. This means the system must rely on partial/interim results from the speech recognizer, not wait for final transcripts.
2. What Works Well Android STT for Short Speech Android’s
SpeechRecognizer provides two callbacks:
onPartialResults()— delivers live, in-progress transcriptiononResults()— delivers the final transcript after the user stops speaking
3. What Doesn’t Work Well Long Speech + Wait-for-Final Pattern If the product requires the user to speak for extended periods and then waits for a final transcript, the result is high latency, poor UX, and reduced accuracy. This pattern should be avoided. Continuous Listening Android’s
SpeechRecognizer automatically stops listening after a brief silence (typically 2–5 seconds). This is a hard platform constraint:
- Android exposes silence-related configuration extras (
EXTRA_SPEECH_INPUT_COMPLETE_SILENCE_LENGTH_MILLIS, etc.), but these are not guaranteed to be honored by the recognizer implementation. onEndOfSpeech()fires automatically and cannot be overridden.- This behavior is confirmed across native Android, React Native, and Flutter — it is a platform limitation, not a bug.
4. The Restart Loop Workaround — Trade-offs The most common workaround for the auto-stop behavior is to detect
onEndOfSpeech() or onResults() and immediately restart the recognizer. This creates a pseudo-continuous listening experience.
What it solves:
- Enables longer listening sessions
- Avoids hard session cut-offs
- Audible beep sound: Android triggers a system sound on every mic restart. This is device-dependent (e.g., prominent on OnePlus devices), not fully controllable via APIs, and creates a disruptive beep loop during conversation. Mitigations like DND mode or audio permission resets are unreliable across devices.
- Word loss during restart: If the user is still speaking when the mic restarts, words get cut off or missed. This creates transcription gaps and confusion in fast-paced conversation.
5. Whisper as an Alternative Current Whisper Flow: Record → Process → Transcript → Send to LLM Whisper offers higher accuracy, better handling of noise and accents, and is more suitable for longer speech segments. However, in its current setup it provides no live transcription — it’s a batch process that adds delay. Whisper Streaming (Future Exploration) Whisper streaming is not yet implemented in our system. It could potentially enable near-real-time transcription with Whisper-level accuracy. Recommendation: Build a small proof-of-concept to test Whisper streaming on Android. Evaluate latency, stability, and whether it can replace or complement the native STT for longer interactions.
6. Decision Framework
| Use Case | Recommended Approach |
|---|---|
| Short live conversation (1–5 sec) | Android STT with onPartialResults |
| Real-time UI feedback | Partial/interim results |
| Longer speech processing | Whisper (batch, no live feedback) |
| True conversational AI | Not supported on current stack |
| Future improvement | Whisper streaming POC |
7. What This Means for Product
| Product Direction | Feasibility |
|---|---|
| Speaking practice app (short sentences, turn-based) | Fully supported — ship it |
| Real-time conversational AI (continuous, full-duplex) | Not supported with current stack |
SpeechRecognizer limitations. This is a potential future path but introduces latency, cost, and connectivity dependencies.
8. Recommendation Proceed with:
- Android STT using
onPartialResults+ restart logic for short, turn-based interactions - Android default TTS as the primary speech output
- Whisper for batch processing of longer speech where live feedback is not required
- Whisper streaming POC as a future exploration for near-real-time accuracy improvements
Appendix: Evidence Base Official Documentation
- RecognizerIntent — https://developer.android.com/reference/android/speech/RecognizerIntent
- RecognitionListener — https://developer.android.com/reference/android/speech/RecognitionListener
- Android Source (RecognizerIntent.java) — https://android.googlesource.com/platform/frameworks/base.git/+/9066cfe9886ac131c34d59ed0e2d287b0e3c0087/core/java/android/speech/RecognizerIntent.java
- Issue #486536250 — https://issuetracker.google.com/issues/486536250#comment2 Confirms the silence timeout is a known issue with no reliable fix.
- Recognizer stops automatically — https://stackoverflow.com/questions/57673683
- Silence length extras not working — https://stackoverflow.com/questions/36519804
- Listen forever — https://stackoverflow.com/questions/62129117
- Very short timeout on pause — https://stackoverflow.com/questions/76623084
- Increase listening time — https://stackoverflow.com/questions/66319334
- Speech recognition extras not working — https://stackoverflow.com/questions/15660805
- Capture timeout — https://stackoverflow.com/questions/56648188
- Speech timeout — https://stackoverflow.com/questions/54196738
- Related discussion — https://stackoverflow.com/q/38933196
- react-native-voice #402 — https://github.com/react-native-voice/voice/issues/402
- react-native-voice #251 — https://github.com/react-native-voice/voice/issues/251
- speech_to_text (Flutter) #114 — https://github.com/csdcorp/speech_to_text/issues/114
- Flutter speech_to_text auto-stop — https://stackoverflow.com/questions/75692142