Text to Music
Generate full music tracks from prompts — genre, instruments, tempo, mood, and key. Pop, electronic, rock, hip-hop, jazz, and ambient genres are all supported, up to 6 minutes per generation.
Turn a simple text prompt into music, sound effects, or ambient audio in seconds. Describe the genre, instruments, mood, tempo, or scene you want, and Stable Audio 3 quickly generates ready-to-use audio for videos, ads, games, podcasts, social content, and creative projects.
Core Mode
Text-to-audio is the core generation mode in Stable Audio 3: you write a prompt describing the audio you want, and the model synthesizes it from scratch. No reference clip, no upload, no DAW required. It is the same workflow most creators reach for first when they want to turn a written idea into sound.
What makes Stable Audio 3's text-to-audio different is the architecture. The model uses a semantic-acoustic autoencoder to project audio into a compact latent space, then a latent diffusion transformer conditioned on T5Gemma text embeddings to generate variable-length output — anywhere from a one-second SFX hit to a 6:20 full song, with per-second granularity. Adversarial post-training cuts inference to roughly 2 seconds on an H200 GPU.
"Text-to-audio is not constrained to fixed clip lengths. Stable Audio 3 generates exactly the duration you ask for, at per-second granularity."
Generate full music tracks from prompts — genre, instruments, tempo, mood, and key. Pop, electronic, rock, hip-hop, jazz, and ambient genres are all supported, up to 6 minutes per generation.
Produce short SFX from one-line prompts: door slams, crowd ambience, sci-fi UI hits, weather, footsteps. The Small-SFX variant runs on consumer devices and is built for game audio and post-production.
Generate long-form ambient pads, drones, and background beds for podcasts, videos, and live streams. Variable-length output means you size the bed to the cut, not the other way around.
Prompts that read like compact production briefs outperform vague descriptions. Stable Audio 3's text-to-audio mode rewards structure: state the genre first, then layer instruments, mood, tempo, and a production style cue. Generic prompts produce generic audio; specific prompts produce intentional audio.
[Genre] + [Instruments] + [Mood] + [Tempo / BPM] + [Production style]Prompt
> A lo-fi hip hop beat with mellow piano and brushed drums, 80 BPM, late-night studio mood, vinyl crackle texture.
Why it works: Genre (`lo-fi hip hop`), instruments (`mellow piano, brushed drums`), tempo (`80 BPM`), mood (`late-night`), production cue (`vinyl crackle texture`).
Prompt
> Heavy wooden door slamming shut in a stone hallway, with a brief reverb tail.
Why it works: Object (`heavy wooden door`), action (`slamming shut`), environment (`stone hallway`), production cue (`reverb tail`).
Prompt
> Slow evolving synth pad in C minor, warm analog texture, no drums, 60 BPM, podcast intro bed, 30 seconds.
Why it works: Texture-first (`slow evolving synth pad`), key (`C minor`), production cue (`warm analog`), tempo (`60 BPM`), use case (`podcast intro bed`), duration intent (`30 seconds`).
Want the full prompt formula and 30+ examples? Read the Stable Audio 3 prompt guide →
Cinematic orchestral build-up with strings, taiko drums, and brass swell, 100 BPM, trailer-ready dynamics.
House music at 124 BPM with festival energy, layered synths, and a four-on-the-floor kick.
Crackling campfire with distant owl calls and gentle wind through pine trees.
Slow evolving synth pad in C minor, warm analog texture, no drums, 60 BPM, podcast intro bed.
Generate from 1 second to 6 minutes 20 seconds — the model only renders the latents needed for your requested duration.
The same text encoder family used in Stable Diffusion 3.x — strong prompt adherence with minimal phrasing tax.
A unified latent space captures both style and timbre, so genre prompts and texture prompts work in the same pass.
Cuts diffusion steps without quality loss — sub-2-second generation on H200 GPUs, a few seconds on M4 MacBook Pro.
1,278,902 audio recordings — 806,284 from AudioSparx, 472,618 from Freesound (CC-0, CC-BY, CC Sampling+). The training set is described as fully licensed and Creative Commons-sourced.
Small and Medium variants are downloadable under the Stability AI Community License, free for commercial use under $1M ARR.
Source: Stable Audio 3 technical report, arXiv:2605.17991 (Evans et al., 2026).
Generate background music for TikTok, Reels, and YouTube Shorts — match duration to the cut.
Branded intro music and bed loops with full licensing safety for monetization.
Sketch UI sounds, ambience, and combat hits from text before commissioning final audio.
Cinematic orchestral beds and tension cues for animatics, sizzle reels, and pitch decks.
Hook stings, product launch beds, and short-form ad music with commercial-safe outputs.
Long ambient loops for Twitch, focus streams, and lo-fi channels with a workflow positioned around commercially usable outputs.
Stable Audio 3 is one of several models offering text-to-audio generation. The differences come down to licensing, duration, openness, and editability.
| Capability | Stable Audio 3 | Suno v4 | Udio | MusicGen |
|---|---|---|---|---|
| Open weights | Small + Medium | No | No | Yes |
| Max duration | 6:20 | ~4 min | ~2:10 | ~30s (default) |
| Training data | Fully licensed / documented | Closed / limited transparency | Closed / limited transparency | Mixed |
| Text-to-SFX | Yes | No | No | Limited |
| Variable-length output | Yes (per-second) | No | No | No |
| Audio inpainting | Yes | No | Limited | No |
| Commercial use under $1M ARR | Free | Paid tier | Paid tier | Free |
| Self-host | Yes | No | No | Yes |
Comparison reflects publicly documented features as of 2026. See each platform's terms for current limits.
Upload an existing clip and describe how it should change — shift genre, swap instrumentation, polish a rough sketch.
Explore the mode →Select a region of an uploaded clip and let Stable Audio 3 regenerate just that part — fix a section, remove a sound, extend a loop.
Explore the mode →New users get 100 free credits on sign-up — enough to generate roughly a dozen text-to-audio outputs across music, SFX, and ambient modes. No credit card. When you need more, credit packs start at the entry tier with credits that never expire and commercial-use licensing built in.
See Pricing →Stable Audio 3 text-to-audio is the generation mode where you input a written prompt and the model synthesizes audio from scratch — music, sound effects, or ambient beds. No reference clip is required. Outputs range from one-second SFX to 6:20 full tracks.
Yes. New users receive 100 free credits on sign-up with no credit card required. The Small and Medium model weights are also available for free download under the Stability AI Community License, which permits commercial use for organizations under $1M in annual revenue.
Stable Audio 3 uses the T5Gemma text encoder, which has strong English coverage and partial multilingual support. For production use, write prompts in English describing genre, instruments, mood, tempo, and production style.
Stable Audio 3 Medium and Large generate up to 380 seconds — roughly 6 minutes 20 seconds — per generation. Duration is set at per-second granularity, so a 5-second SFX does not consume the compute budget of a 6-minute track.
Yes, under the Stability AI Community License, organizations with less than $1M in annual revenue can distribute and commercialize outputs freely. Larger organizations need the Enterprise license. The training data is fully licensed from AudioSparx and Freesound CC-licensed sources.
Stable Audio 3 is fully open-weights for Small and Medium variants, trained on fully licensed data, and supports variable-length output up to 6:20 plus audio inpainting. Suno and Udio are closed, web-only services with shorter maximum durations and less transparent public model distribution.
Structured prompts that name genre, instruments, mood, tempo, and a production style cue produce far better results than generic descriptions. Specify BPM and key when you need a clip to sync to a video cut or sit under voiceover.
The Small and Small-SFX variants run on consumer hardware including Apple Silicon MacBooks and CPU-only laptops. The Medium variant runs on a MacBook Pro M4 in a few seconds per generation. Large is available via the Stability AI API.
100 free credits, no credit card. Generate your first text-to-audio clip in under 30 seconds.