Models & training

VAE

A VAE (Variational Autoencoder) is the component that converts images between full-resolution pixels and the compressed latent space a diffusion model works in - encoding on the way in and decoding on the way out.

Modern image models do not work directly on pixels - that would be far too slow. Instead they operate in a compact latent space. The VAE is the translator that makes this possible: its encoder compresses an image into latents, and its decoder turns finished latents back into a full-resolution picture.

How it fits in

Text-to-image: the diffusion model denoises in latent space, then the VAE decoder renders the final pixels.
Image-to-image: the VAE encoder turns your input photo into latents so the model can transform it.
Inpainting: the masked region is processed in latents and decoded back through the VAE.

Why it matters

The VAE affects color accuracy and fine detail. A poor or mismatched VAE can produce washed-out colors, slightly blurry output or odd artifacts; the right one gives crisp, vivid results. Most checkpoints ship with a baked-in VAE, so you rarely have to think about it - but it is the reason a "latent" model can ever show you actual pixels.

Try it in the generator

Put vae to work right now - free daily generations, commercial license included.

Start creating free

Related terms

Back to the glossary

VAE

How it fits in

Why it matters

Related terms

Ready to get started?