Navigating with Annealing Guidance Scale in Diffusion Space

Anonymous Authors


We introduce a learning-based guidance scheduler for denoising diffusion models that adaptively adjusts the guidance scale during generation.

By leveraging the difference between conditional and unconditional predictions of a diffusion model, our scheduler provides sample-specific, trajectory-aware guidance that achieves a superior balance between image quality and prompt alignment, outperforming existing methods on CLIP, FID, and FD-DINO — without added computational cost.

Abstract

Denoising diffusion models excel at generating high-quality images conditioned on text prompts, yet their effectiveness heavily relies on careful guidance during the sampling process. Classifier-Free Guidance (CFG) provides a widely used mechanism for steering generation by setting the guidance scale, which balances image quality and prompt alignment. However, the choice of the guidance scale has a critical impact on the convergence toward a visually appealing and prompt-adherent image. In this work, we propose an annealing guidance scheduler which dynamically adjusts the guidance scale over time based on the conditional noisy signal. By learning a scheduling policy, our method addresses the temperamental behavior of CFG. Empirical results demonstrate that our guidance scheduler significantly enhances image quality and alignment with the text prompt, advancing the performance of text-to-image generation. Notably, our novel scheduler requires no additional activations or memory consumption, and can seamlessly replace the common classifier-free guidance, offering an improved trade-off between prompt alignment and quality.

Annealing Guidance Scale

CFG Inference
CFG
Annealing Scheduler (Ours)
Annealing Scheduler (Ours)

Our method replaces the fixed guidance scale used in CFG (left) with a learned scheduler that adaptively adjusts the guidance strength at every denoising step (highlighted in red), as shown in our algorithm (right). Instead of using a constant guidance scale, the scheduler computes a sample-specific value based on the current timestep, the difference between the model's conditional and unconditional predictions, and a user-defined prompt-alignment trade-off parameter λ, which offers superior control than traditional guidance scale w. Similarly to CFG++, we use the null-condition prediction at the renoising step (highlighted in red).

Anneal Outputs
Anneal Plot

Here, we show how our annealing guidance scheduler adapts the guidance scale over time for two example prompts: A and B. On the right, the plot shows that while CFG++ uses a constant guidance value, our method dynamically adjusts the scale throughout the denoising process — with trajectories that vary across prompts and even between different random seeds. CFG is not shown in the plot for clarity, but it uses a fixed scale of w = 10. On the left, we compare outputs from CFG (left column), CFG++ (middle), and our method (right) for each prompt. Our approach improves both visual quality and prompt alignment — correcting distorted hands in A and fixing object count errors in B.

Results

We evaluate our method on the COCO evaluation set and visualize the trade-off between image quality and prompt alignment. We report FID and FD-DINO for image quality, and CLIP for prompt alignment.

FID vs CLIP
FD-DINO vs CLIP

As shown in the plots, our method achieves a more favorable trade-off curve compared to existing guidance methods.

CFG
Ours

"A dog running with a stick in its mouth, eiffel tower in the background"

CFG++
Ours

"Photo of a bear wearing colorful glasses: left glass is red, right is blue"

CFG
Ours

"A statue of Abraham Lincoln wearing an opaque and shiny astronaut's helmet. The statue sits on the moon..."

CFG++
Ours

"A gloved hand holding a strawberry milkshake in a cup, complete with a straw and umbrella, with Earth visible in the distance on the moon horizon"

CFG
Ours

"A bride and groom cutting their wedding cake"

CFG++
Ours

"Three Young Foxes by Kain Shannon"

CFG
Ours

"Five red balls on a table"

CFG++
Ours

"A dog chasing a cat in a desert, at high speed"

CFG
Ours

"A demonic looking chucky like doll standing next to a white clock"

CFG++
Ours

"A ghost sitting on a living room chair"

CFG
Ours

"A girl riding a giant bird over a futuristic city"

CFG++
Ours

"A yellow diamond-shaped sign with a deer silhouette"

CFG
Ours

"A small boy trying to fly a small kite"

CFG++
Ours

"There is a stop sign outside of a window"

CFG
Ours

"A portrait of a statue of a pharaoh wearing steampunk glasses, white t-shirt and leather jacket. dslr photograph"

CFG++
Ours

"An old man lifts a barbell above his head"

CFG
Ours

"A baby sitting on a female's lap staring into the camera"

CFG++
Ours

"A white bear in glasses, wearing tuxedo, glowing hat, and with cigare at the British queen reception"

CFG
Ours

"A tropical bird"

CFG++
Ours

"A photo of a ram and a polar bear walking in London."