BiDM: Pushing Diffusion Model Quantization to the 1-Bit Limit, Achieving New State-of-the-Art Results

BiDM: Pushing Diffusion Model Quantization to the 1-Bit Limit, Achieving New State-of-the-Art ResultsDiffusion models (DMs) have garnered significant attention for their remarkable ability to generate high-quality and diverse data across various domains, including images, speech, and video. These models achieve this by iteratively refining a random noise input through a denoising process that can involve thousands of steps

BiDM: Pushing Diffusion Model Quantization to the 1-Bit Limit, Achieving New State-of-the-Art Results

Diffusion models (DMs) have garnered significant attention for their remarkable ability to generate high-quality and diverse data across various domains, including images, speech, and video. These models achieve this by iteratively refining a random noise input through a denoising process that can involve thousands of steps. While advancements in faster sampling techniques have reduced the number of steps needed, the computationally expensive floating-point operations at each step remain a significant bottleneck, limiting the widespread application of DMs in resource-constrained environments. Consequently, compressing diffusion models has emerged as a crucial step towards broader adoption. Existing compression methods primarily focus on quantization, distillation, and pruning, aiming to reduce storage and computational costs while preserving accuracy. Quantization, in particular, stands out as a highly effective technique, achieving compact storage and efficient computation during inference by representing weights and/or activations as low-bit integers or even binary values. Several studies have applied quantization to diffusion models, successfully compressing and accelerating them while maintaining reasonable generative quality. 1-bit quantization, or binarization, offers the most significant reduction in model size, and it has proven particularly effective in discriminative models like Convolutional Neural Networks (CNNs). Furthermore, when both weights and activations are quantized to 1 bit (full binarization), using efficient bitwise operations such as XNOR and bitcount can replace matrix multiplications, leading to maximum speedup. While some existing works have explored 1-bit quantization in diffusion models, their efforts have largely concentrated on quantizing weights alone, leaving the challenge of full binarization largely unaddressed. Full binarization of both weights and activations in diffusion models, however, presents significant challenges. The rich intermediate representations crucial for the generative capabilities of DMs are highly time-step dependent, and the highly dynamic activation ranges are severely restricted when using binarized weights and activations. Additionally, generating complete images, a hallmark of diffusion models, becomes problematic due to the highly discrete parameter and feature spaces, making it difficult to match the real-valued targets during training. The difficulty of optimization in this discrete space, coupled with the insufficient representational capacity of time-step-dependent representations, often leads to poor convergence or even training failure in binarized diffusion models.

Introducing BiDM: Full Binarization of Weights and Activations

To address these limitations, researchers from Beihang University, ETH Zurich, and other institutions have introduced BiDM, a novel method that pushes the boundaries of diffusion model compression by achieving full binarization of both weights and activations. BiDM is designed to tackle the unique requirements posed by the activation characteristics, model architecture, and generative nature of diffusion models, overcoming the challenges associated with full binarization. BiDM incorporates two key innovations:

1. Time-Step-Friendly Binary Structure (TBS): Recognizing the strong time-step dependency of activation features in diffusion models, TBS employs a learnable activation binarizer to match the dynamic activation ranges of the diffusion model. It also incorporates cross-time-step feature connections, leveraging the similarity between features in adjacent time steps to enhance the representational capacity of the binarized model.

2. Spatial Patch Distillation (SPD): Acknowledging the spatial locality inherent in the convolutional U-Net architecture commonly used in diffusion models, and the nature of image generation tasks, SPD introduces a full-precision model as supervision. By mimicking self-attention on patches, SPD focuses on local features, effectively guiding the optimization of the binarized diffusion model.

Extensive experiments demonstrate BiDM's superior performance, exceeding all existing baselines across various evaluation metrics while maintaining comparable inference efficiency. Specifically, in pixel-space diffusion models, BiDM achieves an Inception Score (IS) of 5.18, approaching the performance of full-precision models and outperforming the best baseline by 0.95. In Latent Diffusion Models (LDMs), BiDM achieves a Fréchet Inception Distance (FID) score of 22.74 on LSUN-Bedrooms, a significant improvement over the state-of-the-art of 59.44, while simultaneously achieving 28.0x storage savings and 52.7x operational efficiency gains. As the first method capable of full binarization for diffusion models, BiDM generates visually acceptable images, enabling efficient deployment of DMs in resource-constrained scenarios.

Implementation Details: Binarized Diffusion Models

Baseline Diffusion Models: Given a data distribution p(x₀), the forward diffusion process generates a sequence of random variables x₁, ..., xₜ using a transition kernel q(xₜ|xₜ₋₁), typically involving Gaussian perturbations:

q(xₜ|xₜ₋₁) = N(xₜ; √(1-βₜ)xₜ₋₁, βₜI)

where βₜ ∈ (0, 1) is a noise schedule. The Gaussian transition kernel allows for marginalization of the joint distribution, so that samples can be easily obtained by sampling a Gaussian vector ~N(0, I) and applying a transformation based on xₜ = √(1-βₜ)xₜ₋₁ + √βₜε.

The reverse process aims to generate samples by removing noise, approximating the unavailable conditional distribution q(xₜ₋₁|xₜ) with a learnable transition kernel pθ(xₜ₋₁|xₜ):

pθ(xₜ₋₁|xₜ) ≈ q(xₜ₋₁|xₜ) = N(xₜ₋₁; μθ(xₜ, t), Σθ(xₜ, t))

The mean and variance can be obtained using the reparameterization trick. The training of diffusion models typically uses a simplified variant of the variational lower bound as a loss function to improve sample quality:

L = Eₓ₀∼p(x₀), t∼U(1, T) [||xₜ₋₁ - μθ(xₜ, t)||²]

U-Nets are widely used as backbones in diffusion models due to their ability to fuse low-level and high-level features. The input and output blocks of a U-Net can be represented as xₘ and yₘ, where smaller m corresponds to lower levels. Skip connections propagate low-level information from Dₘ(·) to Uₘ(·), and the input to Uₘ thus becomes:

Uₘ(·) = [Dₘ(·), xₘ]

Binarization: Quantization compresses and accelerates the noise estimation model by discretizing weights and activations to low bit-widths. In a baseline binarized diffusion model, weights W are binarized to 1 bit:

W̃ = s(W) ||W||

where s(.) is the sign function, restricting W to +1 or -1 with a threshold of 0. ||W|| is a floating-point scalar, initialized to (n denoting the number of weights), learned during training. Activations are typically quantized using a simple BNN quantizer:

Ã = s(A) ||A||

When both weights and activations are quantized to 1 bit, the computation of the denoising model can be replaced by XNOR and bitcount operations, achieving significant compression and acceleration.

Time-Step-Friendly Binary Structure (TBS)

Before detailing the proposed method, the authors summarize their observations on Diffusion Model (DM) properties:

Observation 1: While activation ranges change significantly across long time steps, activation features exhibit similarity between short, adjacent time steps. Previous work, such as TDQ and Q-DM, has shown that the activation distribution of DMs is highly time-step dependent during the denoising process, exhibiting similarity between adjacent time steps but significant differences between distant time steps. Applying a fixed scaling factor across all time steps results in severe distortion of the activation range. This observation motivates a re-examination of existing binarization structures. Binarization, especially full binarization of weights and activations, leads to greater loss of activation range and accuracy compared to low-bit quantization like 4-bit. This makes generating rich activation features more difficult. The insufficiency of activation range and output features severely impairs generative models with rich representations like DMs. Therefore, employing a more flexible activation range binarizer and enhancing the overall expressive power of the model by leveraging its feature output are crucial strategies to improve generative capability after full binarization.

The authors first address the differences between long time steps. Most existing activation quantizers, such as BNN and Bi-Real, directly quantize activations to {+1, -1}. This method severely disrupts activation features and negatively impacts the expressive power of the generative model. Improved activation binarizers, like XNOR++, utilize a trainable scaling factor k:

Ã = s(kA)

While this partially recovers activation feature representation, it doesn't match the highly correlated time steps and can still lead to significant performance loss. The authors turn their attention to the original XNOR, which uses a dynamically computed mean to construct the activation binarizer. This method naturally preserves the range of activation features and dynamically adjusts to the input range at different time steps. However, due to the rich representation of DM features, local activations exhibit inconsistencies in the range before and after passing through a module, indicating that a predetermined k value cannot effectively recover activation representation. Therefore, the authors make k adjustable, allowing it to be learned during training to adap

声明：本文内容来源自网络，文字、图片等素材版权属于原作者，平台转载素材出于传递更多信息，文章内容仅供参考与学习，切勿作为商业目的使用。如果侵害了您的合法权益，请您及时与我们联系，我们会在第一时间进行处理！我们尊重版权，也致力于保护版权，站搜网感谢您的分享！(Email:[email protected])