SIGGRAPH Asia 2025

We introduce a learning-based guidance scheduler for denoising diffusion models that adaptively adjusts the guidance scale during generation.

By leveraging the difference between conditional and unconditional predictions of a diffusion model, our scheduler provides sample-specific, trajectory-aware guidance that achieves a superior balance between image quality and prompt alignment, outperforming existing methods on CLIP, FID, and FD-DINO — without added computational cost.

Abstract

Denoising diffusion models excel at generating high-quality images conditioned on text prompts, yet their effectiveness heavily relies on careful guidance during the sampling process. Classifier-Free Guidance (CFG) provides a widely used mechanism for steering generation by setting the guidance scale, which balances image quality and prompt alignment. However, the choice of the guidance scale has a critical impact on the convergence toward a visually appealing and prompt-adherent image. In this work, we propose an annealing guidance scheduler which dynamically adjusts the guidance scale over time based on the conditional noisy signal. By learning a scheduling policy, our method addresses the temperamental behavior of CFG. Empirical results demonstrate that our guidance scheduler significantly enhances image quality and alignment with the text prompt, advancing the performance of text-to-image generation. Notably, our novel scheduler requires no additional activations or memory consumption, and can seamlessly replace the common classifier-free guidance, offering an improved trade-off between prompt alignment and quality.

Annealing Guidance Scale

Our method replaces the fixed guidance scale used in CFG (left) with a learned scheduler that adaptively adjusts the guidance strength at every denoising step (highlighted in red), as shown in our algorithm (right). Instead of using a constant guidance scale, the scheduler computes a sample-specific value based on the current timestep, the difference between the model's conditional and unconditional predictions, and a user-defined prompt-alignment trade-off parameter λ, which offers superior control than traditional guidance scale w. Similarly to CFG++, we use the null-condition prediction at the renoising step (highlighted in red).

Here, we show how our annealing guidance scheduler adapts the guidance scale over time for two example prompts: A and B. On the right, the plot shows that while CFG++ uses a constant guidance value, our method dynamically adjusts the scale throughout the denoising process — with trajectories that vary across prompts and even between different random seeds. CFG is not shown in the plot for clarity, but it uses a fixed scale of w = 10. On the left, we compare outputs from CFG (left column), CFG++ (middle), and our method (right) for each prompt. Our approach improves both visual quality and prompt alignment — correcting distorted hands in A and fixing object count errors in B.

Results

We evaluate our method on the COCO evaluation set and visualize the trade-off between image quality and prompt alignment. We report FID and FD-DINO for image quality, and CLIP for prompt alignment.