Project 5: FUN WITH DIFFUSION MODELS!

Part A: The Power of Diffusion Models!

Setup

Firstly three images were sampled using seed = 45 with the text prompts in the captions. First 20 interference steps were used and then 50. The quality of the images were not really affected in my opinion, maybe the second image of a man wearing a hat is slightly more realistic.

Sampling loops

In this part sampling loops were written by using the pre-trained DeepFloyd denoisers.

Implementing the Forward process

The forward process adds noise to a clean image x_0 to produce a noisy image x_t at timestep t:

$$x_t = \sqrt{\alpha_t}x_0 + \sqrt{1 - \alpha_t}\epsilon, \quad \epsilon \sim \mathcal{N}(0, 1)$$

Noise increases with t, controlled by \alpha_t. Show results for t = [250, 500, 750]. The original image is displyed down below and thereafter the noised images.

Classical Denoising

Classical denoising was implemented by applying Gaussian blur to the noisy images. The images were processed for t = [250, 500, 750], and the results are displayed below.

One-Step Denoising

I used a pretrained diffusion model to denoise the images. The denoiser, implemented as stage_1.unet, is a UNet trained on a large dataset of image pairs. It recovers the Gaussian noise from the noisy image and removes it to approximate the original image.

Iterative Denoising

I created strided_timesteps, a list of monotonically decreasing timesteps starting at 990, with a stride of 30, and ending at 0. The timesteps were initialized using the function stage_1.scheduler.set_timesteps(timesteps=strided_timesteps).

The noisy image was displayed every 5th iteration of the denoising process, showing a gradual reduction in noise. The final predicted clean image was obtained using iterative denoising and displayed alongside comparisons to the results of single-step denoising, which appeared much worse, and Gaussian blurring from part 1.2.

Diffusion Model Sampling

Images were generated from scratch using a diffusion model by denoising random noise. Random noise tensors were created with torch.randn and processed through the iterative_denoise function. The prompt embedding "a high quality photo" guided the generation, producing five different images.

Classifier-Free Guidance (CFG)

Classifier-Free Guidance (CFG) was used to improve image quality by combining conditional and unconditional noise estimates with a scaling factor, γ, set to 7. The iterative_denoise_cfg function was implemented to denoise images using both a prompt embedding for "a high quality photo" and an empty prompt embedding for unconditional guidance. At each timestep, the UNet model was run twice to compute the conditional and unconditional noise estimates, which were combined using the CFG formula. This process was repeated five times, producing images with significantly enhanced quality compared to the previous part.

Image-to-image Translation

I applied the iterative_denoise_cfg function to the test image, adding noise and then denoising it with starting indices of [1, 3, 5, 7, 10, 20]. The denoising process was guided by the prompt "a high quality photo" and used a CFG scale of 7. To explore the method further, I repeated this process on two additional test images, demonstrating its versatility. Noteworthy is that the model tended to generate images featuring women, even when the original content was unrelated.

Editing Hand-Drawn and Web Images

The same was done but this time for a non-realistic image where the first one if from the internet and the second two were drawn by me.s

Inpainting

This function uses a binary mask to keep the original image in unmasked areas while filling in masked regions with new content through iterative denoising. I tested it by inpainting the top of the Campanile in the test image and also worked on two custom images with different masks.

Text-Conditional Image-to-image Translation

For this step, I used the text prompt "a rocket ship" to shape how the image changed during denoising. By starting at various noise levels [1, 3, 5, 7, 10, 20], the process mixed rocket-themed details into the original image. This was also done for two other images.

Visual Anagrams

In this part the UNet model was run twice: once to denoise the image normally and then to denoise it flipped upside down. This allowed me to create an image that shows one scene when viewed right side up and a different scene when flipped. Using the prompts in the captions below visual anagrams were created.

Hybrid Images

To create hybrid images, I adapted the previous approach by combining the denoised high-frequency image with the denoised low-frequency image instead of summing the normal and flipped denoised images. This method makes the high-frequency details visible up close and the low-frequency content visible from a distance. Using the following prompts yielded three different images: "a lithograph of waterfalls" and "a lithograph of a skull", "a lithograph of a skull" and "a photo of a dog" and "a rocket ship" and "an oil painting of people around a campfire".

Part B: Diffusion Models from Scratch!

Training a Single-Step Denoising UNet

First the unconditioned UNet was implemented in accordance with the specifications presented down below.

Implementing the UNet

Using the UNet to Train a Denoiser

To train the model on MNIST, I first added varying levels of noise to the images for the model to learn from. The noise levels used were [0.0, 0.2, 0.4, 0.5, 0.6, 0.8, 1.0]. To illustrate the effect of the noise, I plotted five digits at each noise level. The results are shown below.

Training

For training, I used image-class pairs with a batch size of 256, 5 epochs, \( \sigma = 0.5 \), and the Adam optimizer set to a learning rate of 1e-3. The training loss is shown below.

The results from the training are shown down below. First for after one epoch and then after five.

Out-of-Distribution Testing

The model was trained on digits denoised with \( \sigma = 0.5 \). Below are the results when the denoiser is used on digits with different levels of noise.

Training a Diffusion Model

In this section, a UNet model was trained to iteratively denoise images by predicting the noise added during the diffusion process. Instead of mapping a noisy image \( z \) back to a clean image \( x \), the UNet predicted the noise \( \epsilon \), as given by:

\[ L = \mathbb{E}_{\epsilon, z} \| \epsilon_\theta(z) - \epsilon \|^2, \]

where \( z = x + \sigma \epsilon \), and \( \epsilon \sim \mathcal{N}(0, I) \).

Adding Time Conditioning to UNet

Time conditioning was introduced by embedding a normalized timestep \( t \) into the UNet using fully connected (FC) blocks. It was implemented in accordance with the schematic below.

Training the UNet

The UNet was trained with the algorithm presented below.

The training loss curve for the UNet is displayed below.

Sampling from the UNet

The results from using the time conditioned UNet is displayed down below, both for 5 and 20 epochs.

Adding Class-Conditioning to UNet

Lastly the Class-Conditioned UNet was implemented, conditioning the model on the class of the digit to be generated. This was achieved by adding two fully connected blocks to the UNet, using a one-hot encoded class label \( c \) as input. The training algorithm is presented down below.

The training loss curve for this UNet is displayed below.

Sampling from the Class-Conditioned UNet

The results from this last UNet are presented down below, both after 5 epochs and after 20.

Back to Main Page