Imagen - unprecedented photorealism × deep level of language understanding by Google Research, Brain Team
Imagen is a text-to-image diffusion model with an unmatched level of photorealism and language comprehension. Imagen is based on the strength of diffusion models in high-fidelity picture production and draws on the power of big transformer language models in text interpretation. Our key discovery is that generic large language models (e.g., T5) that have been pre-trained on text-only corpora are surprisingly effective at encoding text for image synthesis: increasing the size of the language model in Imagen improves both sample fidelity and image-text alignment much more than increasing the size of the image diffusion model.
Without ever training on the COCO dataset, Imagen obtains a new state-of-the-art FID score of 7.27, and human raters judge Imagen samples to be on par with the COCO data itself in image-text alignment. To more thoroughly evaluate text-to-image models, we present DrawBench, a comprehensive and difficult benchmark for text-to-image models. We use DrawBench to compare Imagen to modern approaches such as VQ-GAN+CLIP, Latent Diffusion Models, and DALL-E 2, and find that human raters favor Imagen over other models in both samples quality and image-text alignment.
Imagen, which was published barely a month ago, can produce high-quality, high-resolution pictures from a description of a scenario, regardless of how logical or convincing such a setting is in reality. Several examples of similar photos are provided below.
- Large Pretrained Language Model × Cascaded Diffusion Model
- deep textual understanding → photorealistic generation
Imagen research highlights
- We show that large pretrained frozen text encoders are very effective for the text-to-image task.
- We show that scaling the pretrained text encoder size is more important than scaling the diffusion model size.
- We introduce a new thresholding diffusion sampler, which enables the use of very large classifier-free guidance weights.
- We introduce a new Efficient U-Net architecture, which is more compute efficient, more memory efficient, and converges faster.
- On COCO, we achieve a new state-of-the-art COCO FID of 7.27; and human raters find Imagen samples to be on-par with reference images in terms of image-text alignment.
DrawBench: new comprehensive challenging benchmark
- Side-by-side human evaluation.
- Systematically test for: compositionality, cardinality, spatial relations, long-form text, rare words, and challenging prompts.
- Human raters strongly prefer Imagen over other methods, in both image-text alignment and image fidelity.
Diffusion Models
- Diffusion Models are a technique for synthesizing data that resembles a collection of training data. They train by damaging the training data with noise, then learning to retrieve the data by reversing the noise process. The Diffusion Model, given an input picture, will repeatedly contaminate the image with Gaussian noise in a sequence of timesteps, eventually leaving pure Gaussian noise, or "TV static." The Diffusion Model will then go backwards, learning how to isolate and eliminate noise at each timestep, redoing the previous destructive process.