TL;DR: Modern browsers (especially Chrome and Edge) have a built-in speech recognition engine you can access with JavaScript. Set continuous: true and interimResults: true, and you get real-time transcription as you speak. It is not perfect -- it needs internet on Chrome, Firefox does not support it yet, and accuracy drops with background noise -- but for free, browser-only transcription, it is surprisingly good.
Remember when voice typing felt like science fiction? You would talk to your computer, it would transcribe something hilariously wrong, and you would go back to typing. Those days are (mostly) behind us.
Here is the thing most people do not realize: your browser already has a speech recognition engine built in. No app to download. No API key to sign up for. No monthly bill. Just JavaScript and your microphone. Let's explore how it works, what you can build with it, and where it falls short.
Getting Started: Your First Transcription
The Speech Recognition API gives you a JavaScript interface for converting spoken audio from your microphone into text. Here is the minimal setup:
const recognition = new (window.SpeechRecognition
|| window.webkitSpeechRecognition)();
recognition.lang = 'en-US'; // Which language to listen for
recognition.continuous = true; // Keep listening (don't stop after one phrase)
recognition.interimResults = true; // Show real-time guesses as you speak
recognition.onresult = (event) => {
let transcript = '';
for (let i = event.resultIndex; i < event.results.length; i++) {
transcript += event.results[i][0].transcript;
}
console.log(transcript);
};
recognition.start();
That is it. Call recognition.start(), grant microphone access when the browser asks, and start talking. Your words appear in the console in real time.
Interim vs. Final Results (This Is the Key Concept)
When you speak, the API does not wait until you finish your entire sentence to give you results. It starts guessing immediately. These guesses come in two flavors:
- Interim results: The engine's current best guess, updated rapidly as you speak. You might see "I want two" briefly before it corrects to "I want to" as more context arrives.
- Final results: The engine's committed answer after it has heard enough to be confident. These do not change.
recognition.onresult = (event) => {
let interimTranscript = '';
let finalTranscript = '';
for (let i = event.resultIndex; i < event.results.length; i++) {
const text = event.results[i][0].transcript;
if (event.results[i].isFinal) {
finalTranscript += text;
} else {
interimTranscript += text;
}
}
// Show final text in black, interim text in gray
document.getElementById('output').innerHTML =
finalTranscript + '' + interimTranscript + '';
};
This is what makes browser transcription feel magical. You see the text appear and self-correct in real time, almost like the computer is thinking out loud.
The Speech Recognition API's interim results are like autocomplete on steroids. It will confidently show "I scream" before realizing you said "ice cream." Context is everything.
Confidence Scores: How Sure Is It?
Each result comes with a confidence score between 0 and 1. In clear conditions with a standard accent, you will typically see scores above 0.90. Background noise, strong accents, or unusual words can tank this number.
recognition.onresult = (event) => {
for (let i = event.resultIndex; i < event.results.length; i++) {
if (event.results[i].isFinal) {
const text = event.results[i][0].transcript;
const confidence = event.results[i][0].confidence;
console.log(`"${text}" (${(confidence * 100).toFixed(1)}% confident)`);
}
}
};
You can use confidence scores to flag uncertain transcriptions for manual review -- super useful if accuracy matters (meeting notes, medical dictation, etc.).
Building a Robust Transcription Tool
The basic example works, but real-world usage needs error handling and auto-restart logic. Here is a more production-ready version:
class Transcriber {
constructor(outputElement) {
const SR = window.SpeechRecognition || window.webkitSpeechRecognition;
if (!SR) throw new Error('Speech recognition not supported');
this.recognition = new SR();
this.recognition.continuous = true;
this.recognition.interimResults = true;
this.recognition.lang = 'en-US';
this.output = outputElement;
this.finalText = '';
this.isListening = false;
this.recognition.onresult = (e) => this.handleResult(e);
this.recognition.onerror = (e) => this.handleError(e);
this.recognition.onend = () => {
// Auto-restart if we're supposed to still be listening
// Chrome sometimes stops unexpectedly during pauses
if (this.isListening) this.recognition.start();
};
}
start() { this.recognition.start(); this.isListening = true; }
stop() { this.recognition.stop(); this.isListening = false; }
handleResult(event) {
let interim = '';
for (let i = event.resultIndex; i < event.results.length; i++) {
const text = event.results[i][0].transcript;
if (event.results[i].isFinal) {
this.finalText += text;
} else {
interim += text;
}
}
this.output.innerHTML = this.finalText +
'' + interim + '';
}
handleError(event) {
const messages = {
'no-speech': 'No speech detected -- try speaking louder',
'audio-capture': 'No microphone found',
'not-allowed': 'Microphone permission denied'
};
console.error(messages[event.error] || 'Error: ' + event.error);
}
}
Tips for Better Accuracy
Use a decent microphone. Built-in laptop mics pick up fan noise, keyboard clicks, and that coworker eating chips three desks away. A USB microphone or headset makes a massive difference.
Speak naturally. Too fast and words run together. Too slow and the engine gets confused by the unnatural pauses. Just talk like you are explaining something to a friend.
Set the right language. If you are speaking British English, use "en-GB" instead of "en-US." The models differ in vocabulary, spelling, and pronunciation. "Aluminium" vs. "aluminum" matters.
Dictate punctuation. You can say "period," "comma," and "new line" and many engines will insert the actual punctuation. Support varies by browser and language, but it is worth trying.
Pro tip: Minimize background noise. Close windows, mute other apps, and turn off music. The speech recognition engine does not have built-in noise cancellation.
Browser Support (The Honest Truth)
Here is where we need to be upfront. Speech recognition support is more limited than speech synthesis:
- Chrome: Full support. Uses Google's cloud-based recognition, so it needs internet.
- Edge: Full support. Uses Microsoft's recognition service.
- Safari: Supported since version 14.1 on macOS.
- Firefox: Not supported as of early 2026. They have expressed intent, but no release date yet.
Privacy note: Chrome sends your audio to Google's servers for processing. This means an internet connection is required, and your speech data is transmitted externally. For sensitive content, consider local alternatives like running Whisper via WebAssembly.
Always check for support before using the API:
if (!('SpeechRecognition' in window) && !('webkitSpeechRecognition' in window)) {
alert('Your browser does not support speech recognition. Try Chrome or Edge.');
}
Try It Yourself
Use our Speech to Text tool to transcribe your voice in real time. Just click the microphone button and start speaking. Your transcription stays in your browser and is never sent to our servers.
Open Speech to Text →