Diffusion Model

Generative model that gradually adds noise to data and learns to reverse the process for high-quality data generation.

What is a Diffusion Model?

A diffusion model is a generative model that works by gradually adding noise to data through a forward diffusion process, then learning to reverse this process to generate new data samples. The model learns to denoise data at various noise levels, enabling it to generate high-quality samples from random noise.

Key Characteristics

  • Gradual Noise Addition: Forward process adds noise progressively
  • Denoising Learning: Reverse process learns to remove noise
  • High-Quality Generation: Produces photorealistic images
  • Stable Training: More stable than GANs
  • Probabilistic Framework: Based on stochastic processes
  • Multi-Step Generation: Requires multiple denoising steps
  • Score-Based: Learns score functions of data distributions
  • Flexible Architecture: Can use various neural network architectures

Architecture Overview

graph LR
    A[Data x₀] --> B[Forward Process]
    B --> C[Noisy Data xₜ]
    C --> D[Reverse Process]
    D --> E[Generated Data x₀]
    F[Random Noise] --> D
    B -->|t steps| C
    D -->|t steps| E

Mathematical Representation

The forward diffusion process is defined as:

q(xₜ|xₜ₋₁) = N(xₜ; √(1-βₜ)xₜ₋₁, βₜI)

The reverse process is learned by a neural network:

p_θ(xₜ₋₁|xₜ) = N(xₜ₋₁; μ_θ(xₜ, t), Σ_θ(xₜ, t))

Where:

  • x₀ is the original data
  • xₜ is the noisy data at timestep t
  • βₜ is the noise schedule
  • μ_θ and Σ_θ are learned by the neural network

Core Components

Forward Diffusion Process

  • Gradually adds Gaussian noise to data
  • Defined by a noise schedule β₁, ..., βₜ
  • Typically uses 100-1000 timesteps
  • Transforms data distribution to standard normal
# Forward diffusion process
def forward_diffusion(x0, t, noise_schedule):
    """Add noise to data according to forward diffusion process"""
    sqrt_alpha_bar = np.sqrt(np.cumprod(1 - noise_schedule[:t]))
    sqrt_one_minus_alpha_bar = np.sqrt(1 - np.cumprod(1 - noise_schedule[:t]))
    noise = np.random.normal(0, 1, x0.shape)
    xt = sqrt_alpha_bar * x0 + sqrt_one_minus_alpha_bar * noise
    return xt

Reverse Diffusion Process

  • Learns to reverse the forward process
  • Uses neural network to predict noise
  • Typically implemented with U-Net architecture
  • Generates data from random noise
# Reverse diffusion process (conceptual)
class DiffusionModel:
    def __init__(self, model, timesteps):
        self.model = model  # Neural network (typically U-Net)
        self.timesteps = timesteps
        self.betas = self._cosine_noise_schedule()

    def _cosine_noise_schedule(self):
        """Cosine noise schedule"""
        steps = self.timesteps + 1
        s = 0.008
        t = np.linspace(0, self.timesteps, steps)
        alpha_bar = np.cos((t / self.timesteps + s) / (1 + s) * np.pi / 2) ** 2
        betas = []
        for i in range(1, self.timesteps):
            betas.append(min(1 - alpha_bar[i] / alpha_bar[i-1], 0.999))
        return np.array(betas)

    def sample(self, n_samples):
        """Generate samples using reverse diffusion"""
        # Start with random noise
        x = np.random.normal(0, 1, (n_samples, *self.model.input_shape[1:]))

        # Reverse diffusion process
        for t in range(self.timesteps-1, -1, -1):
            z = np.random.normal(0, 1, x.shape) if t > 0 else np.zeros_like(x)
            alpha = 1 - self.betas[t]
            alpha_bar = np.prod(1 - self.betas[:t+1])
            sigma = np.sqrt(self.betas[t])

            # Predict noise
            pred_noise = self.model.predict([x, np.ones(n_samples) * t])

            # Update x
            x = (1 / np.sqrt(alpha)) * (x - ((1 - alpha) / np.sqrt(1 - alpha_bar)) * pred_noise) + sigma * z

        return x

Noise Schedule

  • Controls how much noise is added at each step
  • Common schedules:
    • Linear: βₜ increases linearly
    • Cosine: βₜ follows cosine curve
    • Sigmoid: βₜ follows sigmoid curve
# Different noise schedules
def linear_noise_schedule(timesteps):
    """Linear noise schedule"""
    return np.linspace(1e-4, 0.02, timesteps)

def cosine_noise_schedule(timesteps):
    """Cosine noise schedule"""
    steps = timesteps + 1
    s = 0.008
    t = np.linspace(0, timesteps, steps)
    alpha_bar = np.cos((t / timesteps + s) / (1 + s) * np.pi / 2) ** 2
    betas = []
    for i in range(1, timesteps):
        betas.append(min(1 - alpha_bar[i] / alpha_bar[i-1], 0.999))
    return np.array(betas)

Diffusion Model Variants

Denoising Diffusion Probabilistic Model (DDPM)

  • Original diffusion model formulation
  • Uses fixed variance schedule
  • Simple and effective
  • Requires many sampling steps
# DDPM implementation (conceptual)
class DDPM:
    def __init__(self, model, timesteps):
        self.model = model
        self.timesteps = timesteps
        self.betas = linear_noise_schedule(timesteps)

    def train_step(self, x0):
        """Training step for DDPM"""
        # Sample random timestep
        t = np.random.randint(0, self.timesteps, (x0.shape[0],))

        # Sample noise
        noise = np.random.normal(0, 1, x0.shape)

        # Compute noisy data
        sqrt_alpha_bar = np.sqrt(np.array([np.prod(1 - self.betas[:t_i]) for t_i in t]))
        sqrt_one_minus_alpha_bar = np.sqrt(1 - np.array([np.prod(1 - self.betas[:t_i]) for t_i in t]))
        xt = sqrt_alpha_bar[:, None, None, None] * x0 + sqrt_one_minus_alpha_bar[:, None, None, None] * noise

        # Train model to predict noise
        loss = self.model.train_on_batch([xt, t], noise)
        return loss

Latent Diffusion Model

  • Operates in latent space instead of pixel space
  • More efficient computation
  • Used in Stable Diffusion
  • Combines diffusion with autoencoders
# Latent diffusion model (conceptual)
class LatentDiffusionModel:
    def __init__(self, autoencoder, diffusion_model):
        self.autoencoder = autoencoder  # Encoder + Decoder
        self.diffusion_model = diffusion_model  # Diffusion in latent space

    def train(self, images):
        """Train latent diffusion model"""
        # Encode images to latent space
        latents = self.autoencoder.encoder.predict(images)

        # Train diffusion model on latents
        loss = self.diffusion_model.train_step(latents)
        return loss

    def generate(self, n_samples):
        """Generate images using latent diffusion"""
        # Generate latents
        latents = self.diffusion_model.sample(n_samples)

        # Decode latents to images
        images = self.autoencoder.decoder.predict(latents)
        return images

Training Process

Forward Process

  1. Start with clean data x₀
  2. For each timestep t:
    • Sample noise ε ~ N(0, I)
    • Compute xₜ = √(αₜ)xₜ₋₁ + √(1-αₜ)ε
  3. Repeat until xₜ is approximately standard normal

Reverse Process

  1. Start with random noise xₜ ~ N(0, I)
  2. For each timestep t from T to 1:
    • Predict noise ε_θ(xₜ, t)
    • Compute xₜ₋₁ using predicted noise
  3. Repeat until x₀ is generated

Loss Function

The diffusion model is trained to minimize:

L = E[||ε - ε_θ(√(αₜ)x₀ + √(1-αₜ)ε, t)||²]

Where:

  • ε is the actual noise
  • ε_θ is the predicted noise
  • x₀ is the original data
  • αₜ is the noise schedule

Diffusion Models vs Other Generative Models

FeatureDiffusion ModelsGANsVAEs
Training StabilityVery stableUnstableStable
Generation QualityHighVery highMedium
Mode CoverageGoodPoor (mode collapse)Good
Sampling SpeedSlow (many steps)FastFast
ArchitectureFlexible (typically U-Net)Generator + DiscriminatorEncoder + Decoder
Theoretical BasisStrong (stochastic processes)WeakStrong (variational inference)
Latent SpaceImplicitImplicitExplicit
ApplicationsImages, audio, videoImages, some other domainsVarious

Applications

Image Generation

# Image generation with diffusion model
import matplotlib.pyplot as plt

# Create and train diffusion model
diffusion_model = DiffusionModel(unet_model, timesteps=1000)
diffusion_model.train(train_images, epochs=100)

# Generate images
n = 10  # Number of images to generate
generated_images = diffusion_model.sample(n)

# Display generated images
plt.figure(figsize=(20, 4))
for i in range(n):
    ax = plt.subplot(2, n//2, i + 1)
    plt.imshow(generated_images[i])
    plt.axis('off')
plt.suptitle('Generated Images')
plt.show()

Image Inpainting

# Image inpainting with diffusion model
def inpaint_image(model, image, mask):
    """Inpaint missing regions of an image"""
    # Encode image to latent space
    latent = model.autoencoder.encoder.predict(image[np.newaxis, ...])

    # Add noise to masked regions
    noisy_latent = model.diffusion_model.forward_diffusion(latent, t=500)

    # Reverse diffusion with masked guidance
    for t in range(500, 0, -1):
        # Predict noise
        pred_noise = model.diffusion_model.model.predict([noisy_latent, np.array([t])])

        # Only update masked regions
        update = (1 / np.sqrt(1 - model.diffusion_model.betas[t])) * (
            noisy_latent - (model.diffusion_model.betas[t] / np.sqrt(1 - np.prod(1 - model.diffusion_model.betas[:t]))) * pred_noise)

        # Apply mask
        update = update * mask + noisy_latent * (1 - mask)

        # Add noise for next step
        if t > 0:
            update = update + np.sqrt(model.diffusion_model.betas[t]) * np.random.normal(0, 1, update.shape)

        noisy_latent = update

    # Decode to image space
    inpainted = model.autoencoder.decoder.predict(noisy_latent)
    return inpainted[0]

Text-to-Image Generation

# Text-to-image generation with diffusion model (conceptual)
class TextToImageDiffusion:
    def __init__(self, diffusion_model, text_encoder):
        self.diffusion_model = diffusion_model
        self.text_encoder = text_encoder

    def generate(self, text_prompt, n_samples=1):
        """Generate images from text prompt"""
        # Encode text
        text_embedding = self.text_encoder.encode(text_prompt)

        # Generate latents conditioned on text
        latents = self.diffusion_model.sample(n_samples, text_embedding)

        # Decode to images
        images = self.diffusion_model.autoencoder.decoder.predict(latents)
        return images

Super-Resolution

# Super-resolution with diffusion model
def super_resolve(model, low_res_image, scale_factor=4):
    """Upscale image using diffusion model"""
    # Resize to target dimensions
    target_shape = (low_res_image.shape[0] * scale_factor,
                    low_res_image.shape[1] * scale_factor,
                    low_res_image.shape[2])
    resized = resize(low_res_image, target_shape)

    # Add noise
    noisy = model.diffusion_model.forward_diffusion(resized[np.newaxis, ...], t=500)

    # Reverse diffusion with low-res guidance
    for t in range(500, 0, -1):
        # Predict noise
        pred_noise = model.diffusion_model.model.predict([noisy, np.array([t])])

        # Update with guidance
        update = (1 / np.sqrt(1 - model.diffusion_model.betas[t])) * (
            noisy - (model.diffusion_model.betas[t] / np.sqrt(1 - np.prod(1 - model.diffusion_model.betas[:t]))) * pred_noise)

        # Add low-res guidance
        update = update + 0.1 * (resized[np.newaxis, ...] - model.autoencoder.encoder.predict(update))

        # Add noise for next step
        if t > 0:
            update = update + np.sqrt(model.diffusion_model.betas[t]) * np.random.normal(0, 1, update.shape)

        noisy = update

    # Decode to image
    high_res = model.autoencoder.decoder.predict(noisy)
    return high_res[0]

Research Directions

Key Papers

  1. "Denoising Diffusion Probabilistic Models" (Ho et al., 2020)
    • Introduced DDPM framework
    • Demonstrated high-quality image generation
    • Foundation for modern diffusion models
  2. "Improved Denoising Diffusion Probabilistic Models" (Nichol & Dhariwal, 2021)
    • Introduced improvements to DDPM
    • Demonstrated better sampling efficiency
    • Foundation for many subsequent works
  3. "Diffusion Models Beat GANs on Image Synthesis" (Dhariwal & Nichol, 2021)
    • Demonstrated superior performance to GANs
    • Introduced classifier guidance
    • Foundation for high-quality diffusion models
  4. "High-Resolution Image Synthesis with Latent Diffusion Models" (Rombach et al., 2022)
    • Introduced latent diffusion models
    • Demonstrated efficient high-resolution generation
    • Foundation for Stable Diffusion

Emerging Research

  • Faster Sampling: Methods to reduce sampling steps
  • Higher Resolution: Generating very high-resolution images
  • 3D Generation: Diffusion models for 3D objects
  • Video Generation: Generating realistic videos
  • Audio Generation: High-quality audio synthesis
  • Multimodal Models: Combining multiple modalities
  • Efficient Architectures: More efficient neural networks
  • Theoretical Foundations: Better understanding of diffusion models
  • Conditional Generation: Better control over generation
  • Few-Shot Learning: Learning from few examples
  • Neuromorphic Diffusion: Brain-inspired architectures
  • Quantum Diffusion: Diffusion models for quantum computing
  • Explainable Diffusion: More interpretable models

Best Practices

Implementation Guidelines

AspectRecommendationNotes
TimestepsStart with 1000 stepsCan reduce with advanced techniques
Noise ScheduleUse cosine scheduleBetter than linear for many cases
ArchitectureUse U-Net with attentionStandard for diffusion models
Learning RateStart with 1e-4 to 1e-3Use learning rate scheduling
Batch Size32-128 depending on GPU memoryLarger batches for stability
NormalizationNormalize data to -1, 1Works well with tanh activation
EMAUse exponential moving averageImproves model stability
Mixed PrecisionUse mixed precision trainingFaster training with less memory
Gradient ClippingClip gradients to 1.0Prevents exploding gradients
MonitoringTrack loss and sample qualityShould decrease over time

Common Pitfalls and Solutions

PitfallSolutionExample
Slow SamplingUse DDIM or other fast samplersReduce from 1000 to 50 steps
Poor Generation QualityIncrease model capacity, train longerAdd more layers to U-Net
Mode CollapseUse better noise scheduleSwitch to cosine schedule
Training InstabilityUse gradient clipping, EMAClip gradients to 1.0
Blurry OutputsIncrease model capacity, use guidanceAdd attention layers
Slow ConvergenceAdjust learning rate, use warmupUse learning rate warmup
Memory IssuesUse gradient checkpointingEnable gradient checkpointing
OverfittingUse data augmentation, regularizationAdd dropout to U-Net

Future Directions

  • Real-Time Generation: Faster sampling methods
  • Higher Fidelity: More realistic and detailed outputs
  • 3D and Video: Better 3D object and video generation
  • Multimodal Models: Combining vision, language, and audio
  • Efficient Architectures: More compute-efficient models
  • Theoretical Understanding: Better mathematical foundations
  • Neuromorphic Diffusion: Brain-inspired architectures
  • Quantum Diffusion: Diffusion models for quantum computing
  • Explainable Diffusion: More interpretable models
  • Ethical Diffusion: Models with built-in ethical constraints
  • Few-Shot Diffusion: Learning from few examples
  • Continual Learning: Models that learn continuously
  • Self-Supervised Diffusion: Better self-supervised learning

External Resources