DreamWalk: Style Space Exploration using Diffusion Guidance

Cornell Tech1, Geomagical Labs2, Google Research3

Abstract

Text-conditioned diffusion models can generate impressive images, but fall short when it comes to fine-grained control. Unlike direct-editing tools like Photoshop, text conditioned models require the artist to perform “prompt engineering,” constructing special text sentences to control the style or amount of a particular subject present in the output image. Our goal is to provide fine-grained control over the style and substance specified by the prompt, for example to adjust the intensity of styles in different regions of the image. Our approach is to decompose the text prompt into conceptual elements, and apply a separate guidance term for each element in a single diffusion process. We introduce guidance scale functions to control when in the diffusion process and where in the image to intervene. Since the method is based solely on adjusting diffusion guidance, it does not require fine-tuning or manipulating the internal layers of the diffusion model's neural network, and can be used in conjunction with LoRA- or DreamBooth-trained models.

Video Presentation

Single Style Application

DreamWalk allows fine-grained control of style text-to-image generation. We start with a base generated image (left), and apply style guidance terms with guidance scale functions to create different stylized images. Our style application works on any diffusion model regardless of architecture differences.

MY ALT TEXT

Changing the magnitude of the guidance scale has different effects based on the temporal component of guidance scale functions. Increasing the magnitude of a guidance scale without time dependence (bottom row) can change the layout of the image. Our default setting for temporal dependence preserves the rough layout of the base generation while still increasing the amount of style being added (top row).

MY ALT TEXT

"A peaceful riverside village with charming old cottages".

MY ALT TEXT

"a dog running on a beach".

MY ALT TEXT

"A serene garden with a serene pond and arched bridges".

Interpolation between Styles

Our method also allows users to interpolate between multiple styles. We show that our smooth interpolation between two different styles for models with very different architectures SD1.5 and SDXL.

Image

"Campsite with a fire at night (SDXL: Monet -> Picasso)"

Image

"A dog running on a beach (SDXL: Monet -> Hokusai)"

Spatially Varying Guidance Function

Based on the base image, style guidance can be applied to different parts of the image. This allows users to define their own masks manually or with computed signals like bounding boxes.

MY ALT TEXT
Image

"Fish swimming down a stream (SDXL: Picasso)"

Image

"Dog running on a beach (SDXL: Monet)"

Image

"Bird sitting on a tree branch (SDXL: Hokusai)"

Image

"Campsite with a fire at night (SDXL: Hokusai)"

Controllable subject vs. prompt emphasis

Our formulation can explore adherence to a DreamBooth subject or adherence to the text prompt.

MY ALT TEXT

DreamWalk on Real Images

DreamWalk works on real world images. We adopt DDIM inversion to the original image (left) to obtain an starting embedding for diffusion process, and apply style guidance terms with guidance scale functions to create different stylized images.

MY ALT TEXT

"a black cat walking in a park"

BibTeX

@misc{shu2024dreamwalk,
        title={DreamWalk: Style Space Exploration using Diffusion Guidance}, 
        author={Michelle Shu and Charles Herrmann and Richard Strong Bowen and Forrester Cole and Ramin Zabih},
        year={2024},
        eprint={2404.03145},
        archivePrefix={arXiv},
        primaryClass={cs.CV}
  }