Question 1

What languages does Scribe support?

Accepted Answer

Excellent Accuracy (≤ 5% Word Error Rate - WER)

Bulgarian, Catalan, Czech, Danish, Dutch, English, Finnish, French, Galician, German, Greek, Hindi, Indonesian, Italian, Japanese, Kannada, Malay, Malayalam, Macedonian, Norwegian, Polish, Portuguese, Romanian, Russian, Serbian, Slovak, Spanish, Swedish, Turkish, Ukrainian, Vietnamese

High Accuracy (>5% to ≤10% WER)

Bengali, Belarusian, Bosnian, Cantonese, Estonian, Filipino, Gujarati, Hungarian, Kazakh, Latvian, Lithuanian, Mandarin, Marathi, Nepali, Odia, Persian, Slovenian, Tamil, Telugu

Good (>10% to ≤25% WER)

Afrikaans, Arabic, Armenian, Assamese, Asturian, Azerbaijani, Burmese, Cebuano, Croatian, Georgian, Hausa, Hebrew, Icelandic, Javanese, Kabuverdianu, Korean, Kyrgyz, Lingala, Maltese, Mongolian, Māori, Occitan, Punjabi, Sindhi, Swahili, Tajik, Thai, Urdu, Uzbek, Welsh

Moderate (>25% to ≤50% WER)

Amharic, Chichewa, Fulah, Ganda, Igbo, Irish, Khmer, Kurdish, Lao, Luxembourgish, Luo, Northern Sotho, Pashto, Shona, Somali, Umbundu, Wolof, Xhosa, Zulu

Question 2

What is Speech to Text and how does it work?

Accepted Answer

Speech-to-text (STT) is a technology that converts spoken language into written text using automatic speech recognition (ASR). It processes audio signals, identifies speech patterns, and transcribes them into text with high accuracy.

ElevenLabs' AI-powered speech-to-text software is designed to transcribe audio and video content with human-like precision, making it ideal for speech-to-text conversion, audio transcription, and real-time speech recognition.

Speech-to-text technology is used in:
✔ Speech-to-text transcription for podcasts, meetings, and interviews.
✔ Captions and subtitles in video content.
✔ Speech-to-text software for hands-free typing and accessibility tools.

ElevenLabs ASR offers fast, reliable, and highly accurate speech-to-text conversion for multiple languages and accents.

Question 3

How do I transcribe video to text?

Accepted Answer

ElevenLabs provides video transcription to convert spoken dialogue into text format, making it easy to create subtitles, captions, and searchable transcripts.

Steps to transcribe video to text:
1. Upload your video file to ElevenLabs ASR
2. Speech recognition technology processes the audio
3. A transcript is generated automatically, with timestamps
4. Download the text file or export subtitles for editing.

This AI-powered video transcription model helps content creators, businesses, and educators quickly convert video speech into accurate text for accessibility and content repurposing.

Question 4

How much does Scribe cost?

Accepted Answer

Starting from $0.40 per hour of transcribed audio, falling well below this at scale with Enterprise plans.

Question 5

Can I generate captions for social media videos?

Accepted Answer

Yes. Scribe can auto-generate captions and subtitles for YouTube, TikTok, Instagram, and more—supporting multiple languages for accessibility and reach.

Question 6

What is the most accurate Speech to Text model?

Accepted Answer

The most accurate Speech to Text models use deep neural networks trained on large, multilingual datasets. Scribe achieves industry-leading accuracy across 99 languages, outperforming models like Whisper, Deepgram, and Gemini in benchmark tests.

Question 7

Can Speech to Text work in real time?

Accepted Answer

Yes. Real-time Speech to Text converts spoken words into text as they’re being spoken. With Scribe v2 Realtime, transcription occurs in under 150 milliseconds, making it ideal for live conversations, meetings, and AI agents.

Question 8

What can I use Speech to Text for?

Accepted Answer

Speech to Text can be used for meeting notes, podcasts, accessibility captions, customer service calls, and any task that requires converting spoken content into readable text. It also powers real-time AI assistants and automated workflows.

Question 9

How secure is Speech to Text transcription?

Accepted Answer

All Speech to Text data is processed with enterprise-grade security. Transcriptions can be handled through encrypted APIs, and sensitive information can be processed locally or with restricted access to meet compliance standards.

Question 10

Does Speech to Text work offline?

Accepted Answer

Speech to Text technology can work offline if models are deployed locally. Scribe supports cloud and on-premise configurations, allowing enterprises to control data handling while maintaining low latency and high accuracy.

Question 11

Can Speech to Text detect different speakers?

Accepted Answer

Yes. Advanced Speech to Text systems use speaker diarization to distinguish and label multiple speakers automatically, even in overlapping conversations.

Question 12

What is the difference between Speech to Text and transcription software?

Accepted Answer

Speech to Text refers to the automatic process of converting spoken language into text using AI, while transcription software may include editing tools, formatting, and collaboration features built around that core technology.

Speech to Text

The most accurate Speech to Text models

Real-time Speech to Text in under 150 ms with Scribe v2 Realtime

Transcribe live speech

High accuracy and ultra-low latency

Voice Activity Detection

Transcribe in 90 languages

Live in the API

Convert speech to text, caption, and edit audio and video with Scribe v1

Transcribe audio and video

Over 95% transcription accuracy

Powerful transcription tools

Dynamic audio tagging

Smart speaker diarization

Enterprise-grade security and infrastructure at scale

Enterprise-level data protection

Granular team permissions

Elevated support and custom deployments

Built for every workflow, from API to agents

Speech to Text APIs and SDKs

ElevenLabs Agents

ElevenLabs Studio

Frequently asked questions