Kuaishou and Tsinghua University Present an Innovative SVG Diffusion Model, Achieving a 62-Times Jump in Training Performance
发布时间:2025-10-29 18:19:44 | 责任编辑:吴昊 | 浏览量:8 次
Recently, VAE (Variational Autoencoder) has been facing an embarrassing situation of being gradually phased out in the tech world. With the collaboration between Tsinghua University and Kuaishou's Ling team, a new generative model called SVG (VAE-free latent diffusion model) has been introduced. This innovation not only achieved an amazing 6200% improvement in training efficiency but also saw a 3500% leap in generation speed.
The decline of VAE in the field of image generation mainly stems from the "semantic entanglement" issue. In other words, when we try to change just one feature of an image (such as the color of a cat), other features (such as body size or expression) are often affected, resulting in inaccurate generated images. To solve this problem, the SVG model developed by Tsinghua University and Kuaishou took a different approach, actively building a feature space that integrates semantics and details.
In the design of the SVG model, the team first used the DINOv3 pre-trained model as a semantic extractor. This model, trained through large-scale self-supervised learning, can effectively identify and separate features of different categories, solving the semantic confusion in traditional VAE models. Additionally, to supplement details, the team specially designed a lightweight residual encoder to ensure that detail information does not conflict with semantic features. The key distribution alignment mechanism further enhances the fusion of these two types of features, ensuring the high quality of the generated images.
Experimental results show that the SVG model comprehensively surpasses traditional VAE approaches in terms of generation quality and multi-task generalizability. On the ImageNet dataset, the SVG model achieved a FID value (a metric measuring the similarity between generated and real images) of 6.57 after only 80 training cycles, far exceeding VAE models of similar scale; in terms of inference efficiency, the SVG model also demonstrated excellent performance, generating clear images with fewer sampling steps. Moreover, the feature space of the SVG model can be directly used for various visual tasks such as image classification and semantic segmentation without additional fine-tuning, greatly improving application flexibility.
The new technology developed by Tsinghua University and Kuaishou not only brings revolutionary changes to the field of image generation but also shows great potential in multimodal generation tasks.
Paper link: https://arxiv.org/pdf/2510.15301
Welcome to the [AI Daily] column! This is your daily guide to exploring the world of artificial intelligence. Every day, we present you with hot topics in the AI field, focusing on developers, helping you understand technical trends, and learning about innovative AI product applications.
这是一篇关于Tsinghua University and Kuaishou Unveil a New SVG Diffusion Model with a 6200% Increase in Training Efficiency的文章,内容值得关注。
本网站(https://aigc.izzi.cn)刊载的所有内容,包括文字、图片、音频、视频等均在网上搜集。
访问者可将本网站提供的内容或服务用于个人学习、研究或欣赏,以及其他非商业性或非盈利性用途,但同时应遵守著作权法及其他相关法律的规定,不得侵犯本网站及相关权利人的合法权利。除此以外,将本网站任何内容或服务用于其他用途时,须征得本网站及相关权利人的书面许可,并支付报酬。
本网站内容原作者如不愿意在本网站刊登内容,请及时通知本站,予以删除。
