我们展示了Imagen,这是一个文本到图像的扩散模型,具有前所未有的真实感和深入的语言理解。Imagen建立在大型转换器语言模型理解文本的能力之上,并取决于高保真图像生成中扩散模型的强度。我们的关键发现是,在纯文本语料库上预训练的通用大型语言模型(例如T5)在编码用于图像合成的文本方面出奇地有效:增加Imagen中语言模型的大小比增加图像扩散模型的大小更能提高样本保真度和图像-文本对齐。Imagen在COCO数据集上获得了7.27的新的最先进的FID分数,而从未对COCO进行过训练,并且人类评分者发现Imagen样本在图像-文本对齐方面与COCO数据本身不相上下。为了更深入地评估文本到图像模型,我们引入了DrawBench,这是一个全面且具有挑战性的文本到图像建模基准。使用DrawBench,我们将Imagen与最近的方法进行了比较,包括VQ-GAN+CLIP、潜在扩散模型和DALL-E2,并发现在并排比较中,无论是在样本质量还是图像文本对齐方面,评分者都更喜欢Imagen而不是其他模型。
We present Imagen, a text-to-image diffusion model with an unprecedented degree of photorealism and a deep level of language understanding. Imagen builds on the power of large transformer language models in understanding text and hinges on the strength of diffusion models in high-fidelity image generation. Our key discovery is that generic large language models (e.g. T5), pretrained on text-only corpora, are surprisingly effective at encoding text for image synthesis: increasing the size of the language model in Imagen boosts both sample fidelity and image-text alignment much more than increasing the size of the image diffusion model. Imagen achieves a new state-of-the-art FID score of 7.27 on the COCO dataset, without ever training on COCO, and human raters find Imagen samples to be on par with the COCO data itself in image-text alignment. To assess text-to-image models in greater depth, we introduce DrawBench, a comprehensive and challenging benchmark for text-to-image models. With DrawBench, we compare Imagen with recent methods including VQ-GAN+CLIP, Latent Diffusion Models, and DALL-E 2, and find that human raters prefer Imagen over other models in side-by-side comparisons, both in terms of sample quality and image-text alignment.
[腾讯]混元大模型