Denoising diffusion models excel at generating high-quality images conditioned on text prompts, yet their effectiveness heavily relies on careful guidance during the sampling process. Classifier-Free Guidance (CFG) provides a widely used mechanism for steering generation by setting the guidance scale, which balances image quality and prompt alignment. However, the choice of the guidance scale has a critical impact on the convergence toward a visually appealing and prompt-adherent image. In this work, we propose an annealing guidance scheduler which dynamically adjusts the guidance scale over time based on the conditional noisy signal. By learning a scheduling policy, our method addresses the temperamental behavior of CFG. Empirical results demonstrate that our guidance scheduler significantly enhances image quality and alignment with the text prompt, advancing the performance of text-to-image generation. Notably, our novel scheduler requires no additional activations or memory consumption, and can seamlessly replace the common classifier-free guidance, offering an improved trade-off between prompt alignment and quality.
Our method replaces the fixed guidance scale used in CFG (left) with a learned scheduler that adaptively adjusts the guidance strength at every denoising step (highlighted in red), as shown in our algorithm (right). Instead of using a constant guidance scale, the scheduler computes a sample-specific value based on the current timestep, the difference between the model's conditional and unconditional predictions, and a user-defined prompt-alignment trade-off parameter λ, which offers superior control than traditional guidance scale w
. Similarly to CFG++, we use the null-condition prediction at the renoising step (highlighted in red).
Here, we show how our annealing guidance scheduler adapts the guidance scale over time for two example prompts: A and B. On the right, the plot shows that while CFG++ uses a constant guidance value, our method dynamically adjusts the scale throughout the denoising process — with trajectories that vary across prompts and even between different random seeds. CFG is not shown in the plot for clarity, but it uses a fixed scale of w = 10
. On the left, we compare outputs from CFG (left column), CFG++ (middle), and our method (right) for each prompt. Our approach improves both visual quality and prompt alignment — correcting distorted hands in A and fixing object count errors in B.
We evaluate our method on the COCO evaluation set and visualize the trade-off between image quality and prompt alignment. We report FID and FD-DINO for image quality, and CLIP for prompt alignment.
As shown in the plots, our method achieves a more favorable trade-off curve compared to existing guidance methods.
"A dog running with a stick in its mouth, eiffel tower in the background"
"Photo of a bear wearing colorful glasses: left glass is red, right is blue"
"A statue of Abraham Lincoln wearing an opaque and shiny astronaut's helmet. The statue sits on the moon..."
"A gloved hand holding a strawberry milkshake in a cup, complete with a straw and umbrella, with Earth visible in the distance on the moon horizon"
"A bride and groom cutting their wedding cake"
"Three Young Foxes by Kain Shannon"
"Five red balls on a table"
"A dog chasing a cat in a desert, at high speed"
"A demonic looking chucky like doll standing next to a white clock"
"A ghost sitting on a living room chair"
"A girl riding a giant bird over a futuristic city"
"A yellow diamond-shaped sign with a deer silhouette"
"A small boy trying to fly a small kite"
"There is a stop sign outside of a window"
"A portrait of a statue of a pharaoh wearing steampunk glasses, white t-shirt and leather jacket. dslr photograph"
"An old man lifts a barbell above his head"
"A baby sitting on a female's lap staring into the camera"
"A white bear in glasses, wearing tuxedo, glowing hat, and with cigare at the British queen reception"
"A tropical bird"
"A photo of a ram and a polar bear walking in London."