Capsule Network

Neural network architecture that preserves hierarchical spatial relationships between features using capsules instead of traditional neurons.

What is a Capsule Network?

A capsule network (CapsNet) is a type of neural network architecture designed to better preserve hierarchical spatial relationships between features in data. Unlike traditional convolutional neural networks (CNNs) that use scalar activations, capsule networks use vector-valued capsules that encode both the presence and pose (orientation, position, scale) of features.

Key Characteristics

  • Vector-Valued Capsules: Encode both feature presence and pose
  • Dynamic Routing: Learns part-whole relationships between capsules
  • Equivariance: Preserves spatial hierarchies and relationships
  • Viewpoint Invariance: Recognizes objects from different viewpoints
  • Hierarchical Representation: Captures part-whole relationships
  • Robust to Transformations: Handles object transformations better
  • Interpretability: More interpretable feature representations
  • Efficient Representation: Compact feature encoding

Architecture Overview

graph TD
    A[Input Image] --> B[Primary Capsules]
    B --> C[Dynamic Routing]
    C --> D[Digit Capsules]
    D --> E[Output Probabilities]
    B -->|Pose Vectors| C
    C -->|Pose Vectors| D
    D -->|Length| E

Core Components

Capsule Structure

A capsule is a group of neurons that outputs a vector:

u_i = [u_1, u_2, ..., u_n]

Where:

  • u_i is the output vector of capsule i
  • Each component represents a different property (pose, orientation, etc.)
  • The length of the vector represents the probability of feature existence

Dynamic Routing Algorithm

# Dynamic routing between capsules
def dynamic_routing(u_hat, r, l, iterations=3):
    """
    u_hat: Predicted output vectors from lower-level capsules
    r: Number of routing iterations
    l: Layer index
    """
    # Initialize coupling coefficients
    b = torch.zeros(u_hat.shape[:-1])

    for iteration in range(iterations):
        # Softmax to get coupling coefficients
        c = F.softmax(b, dim=1)

        # Compute weighted sum of predictions
        s = torch.sum(c.unsqueeze(-1) * u_hat, dim=1)

        # Apply squash function
        v = squash(s)

        # Update coupling coefficients
        if iteration < iterations - 1:
            b = b + torch.sum(u_hat * v.unsqueeze(1), dim=-1)

    return v

def squash(s, dim=-1):
    """Non-linear activation function for capsules"""
    s_norm = torch.norm(s, p=2, dim=dim, keepdim=True)
    scale = s_norm**2 / (1 + s_norm**2)
    return scale * s / (s_norm + 1e-8)

CapsNet Architecture

# Capsule Network implementation
import torch
import torch.nn as nn
import torch.nn.functional as F

class PrimaryCapsules(nn.Module):
    def __init__(self, in_channels, out_capsules, dim_capsule, kernel_size=9):
        super(PrimaryCapsules, self).__init__()
        self.dim_capsule = dim_capsule
        self.capsules = nn.ModuleList([
            nn.Conv2d(in_channels, dim_capsule, kernel_size=kernel_size, stride=2)
            for _ in range(out_capsules)
        ])

    def forward(self, x):
        # Apply each capsule convolution
        outputs = [capsule(x) for capsule in self.capsules]
        outputs = torch.stack(outputs, dim=1)

        # Reshape and squash
        batch_size = outputs.size(0)
        outputs = outputs.view(batch_size, -1, self.dim_capsule)
        return squash(outputs)

class DigitCapsules(nn.Module):
    def __init__(self, num_capsules, dim_capsule, num_routing=3):
        super(DigitCapsules, self).__init__()
        self.num_capsules = num_capsules
        self.dim_capsule = dim_capsule
        self.num_routing = num_routing
        self.W = nn.Parameter(torch.randn(1, 32 * 6 * 6, num_capsules, dim_capsule, 8))

    def forward(self, x):
        # Expand input
        x = x.unsqueeze(2).unsqueeze(4)

        # Compute predicted vectors
        u_hat = torch.matmul(x, self.W)
        u_hat = u_hat.squeeze(3)

        # Dynamic routing
        b = torch.zeros(x.size(0), 32 * 6 * 6, self.num_capsules, 1).to(x.device)
        v = dynamic_routing(u_hat, b, self.num_routing)

        return v

class CapsuleNetwork(nn.Module):
    def __init__(self, num_classes=10):
        super(CapsuleNetwork, self).__init__()
        # Initial convolution
        self.conv1 = nn.Conv2d(1, 256, kernel_size=9)

        # Primary capsules
        self.primary_caps = PrimaryCapsules(256, 32, 8)

        # Digit capsules
        self.digit_caps = DigitCapsules(num_classes, 16)

        # Decoder for reconstruction
        self.decoder = nn.Sequential(
            nn.Linear(16 * num_classes, 512),
            nn.ReLU(),
            nn.Linear(512, 1024),
            nn.ReLU(),
            nn.Linear(1024, 784),
            nn.Sigmoid()
        )

    def forward(self, x, y=None):
        # Initial convolution
        x = F.relu(self.conv1(x))

        # Primary capsules
        x = self.primary_caps(x)

        # Digit capsules
        x = self.digit_caps(x)

        # Get class probabilities
        classes = torch.norm(x, dim=-1)

        # Reconstruction
        if y is not None:
            reconstructions = self.decoder((x * y.unsqueeze(-1)).view(x.size(0), -1))
        else:
            # Use the capsule with max length
            _, max_idx = classes.max(dim=1)
            masked = x[torch.arange(x.size(0)), max_idx]
            reconstructions = self.decoder(masked.view(x.size(0), -1))

        return classes, reconstructions

Capsule Networks vs CNNs

FeatureCapsule NetworksConvolutional Neural Networks
Feature RepresentationVector-valued capsulesScalar activations
Spatial RelationshipsPreserves hierarchical relationshipsLoses spatial hierarchies
Viewpoint InvarianceYes (equivariance)Limited (requires data augmentation)
Pose InformationEncoded in capsule vectorsNot explicitly encoded
Dynamic RoutingYes (learns part-whole relationships)No (fixed connections)
InterpretabilityHigh (pose information)Low (abstract features)
RobustnessHigh (handles transformations well)Moderate (sensitive to transformations)
Computational CostHigher (dynamic routing)Lower (fixed operations)
Training StabilityChallenging (routing convergence)Stable
ArchitectureMore complexSimpler

Training Capsule Networks

Margin Loss

# Margin loss for capsule networks
def margin_loss(v_c, y, m_plus=0.9, m_minus=0.1, lambda_=0.5):
    """
    v_c: Output vector lengths (class probabilities)
    y: Ground truth labels
    m_plus: Upper margin
    m_minus: Lower margin
    lambda_: Down-weighting factor
    """
    # Convert labels to one-hot
    y_onehot = torch.zeros(v_c.size()).to(v_c.device)
    y_onehot.scatter_(1, y.unsqueeze(1), 1.)

    # Calculate losses
    L_c = y_onehot * F.relu(m_plus - v_c)**2 + \
          lambda_ * (1 - y_onehot) * F.relu(v_c - m_minus)**2

    return torch.mean(torch.sum(L_c, dim=1))

Reconstruction Loss

# Reconstruction loss for capsule networks
def reconstruction_loss(x, reconstructions):
    """Binary cross-entropy loss for reconstructions"""
    return F.binary_cross_entropy(reconstructions, x.view(x.size(0), -1))

Training Loop

# Training loop for capsule network
def train_capsnet(model, train_loader, optimizer, epochs, device):
    model.train()

    for epoch in range(epochs):
        total_loss = 0
        correct = 0

        for batch_idx, (data, target) in enumerate(train_loader):
            data, target = data.to(device), target.to(device)

            # Forward pass
            optimizer.zero_grad()
            output, reconstructions = model(data, target)

            # Calculate losses
            loss = margin_loss(output, target) + 0.0005 * reconstruction_loss(data, reconstructions)

            # Backward pass
            loss.backward()
            optimizer.step()

            # Calculate accuracy
            pred = output.argmax(dim=1, keepdim=True)
            correct += pred.eq(target.view_as(pred)).sum().item()
            total_loss += loss.item()

        accuracy = 100. * correct / len(train_loader.dataset)
        avg_loss = total_loss / len(train_loader)

        print(f'Epoch {epoch+1}, Loss: {avg_loss:.4f}, Accuracy: {accuracy:.2f}%')

Applications

Image Classification

# Image classification with capsule networks
class ImageClassifier:
    def __init__(self, num_classes=10):
        self.model = CapsuleNetwork(num_classes)
        self.device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
        self.model.to(self.device)

    def train(self, train_loader, epochs=10, lr=0.001):
        """Train the capsule network"""
        optimizer = torch.optim.Adam(self.model.parameters(), lr=lr)
        train_capsnet(self.model, train_loader, optimizer, epochs, self.device)

    def predict(self, image):
        """Predict class for an image"""
        self.model.eval()
        with torch.no_grad():
            image = image.unsqueeze(0).to(self.device)
            output, _ = self.model(image)
            return output.argmax(dim=1).item()

Object Detection

# Object detection with capsule networks (conceptual)
class CapsuleObjectDetector(nn.Module):
    def __init__(self, num_classes):
        super(CapsuleObjectDetector, self).__init__()
        # Feature extraction
        self.conv1 = nn.Conv2d(3, 256, kernel_size=9)

        # Primary capsules
        self.primary_caps = PrimaryCapsules(256, 32, 8)

        # Detection capsules (one per class)
        self.detection_caps = DigitCapsules(num_classes, 16)

        # Bounding box regression capsules
        self.bbox_caps = nn.ModuleList([
            nn.Sequential(
                nn.Linear(16, 32),
                nn.ReLU(),
                nn.Linear(32, 4)  # x, y, width, height
            ) for _ in range(num_classes)
        ])

    def forward(self, x):
        # Feature extraction
        x = F.relu(self.conv1(x))

        # Primary capsules
        x = self.primary_caps(x)

        # Detection capsules
        detections = self.detection_caps(x)

        # Bounding box predictions
        bbox_preds = []
        for i, bbox_layer in enumerate(self.bbox_caps):
            # Use capsule pose information for bounding box prediction
            bbox_preds.append(bbox_layer(detections[:, i, :]))

        return detections, torch.stack(bbox_preds, dim=1)

Medical Imaging

# Medical imaging with capsule networks
class MedicalCapsuleNetwork(nn.Module):
    def __init__(self, num_classes=2):
        super(MedicalCapsuleNetwork, self).__init__()
        # Input convolution for medical images
        self.conv1 = nn.Conv2d(1, 64, kernel_size=7, stride=2)
        self.conv2 = nn.Conv2d(64, 128, kernel_size=5, stride=2)

        # Primary capsules
        self.primary_caps = PrimaryCapsules(128, 32, 8)

        # Diagnosis capsules
        self.diagnosis_caps = DigitCapsules(num_classes, 16)

        # Segmentation decoder
        self.segmentation_decoder = nn.Sequential(
            nn.ConvTranspose2d(16, 64, kernel_size=4, stride=2),
            nn.ReLU(),
            nn.ConvTranspose2d(64, 32, kernel_size=4, stride=2),
            nn.ReLU(),
            nn.ConvTranspose2d(32, 1, kernel_size=4, stride=2),
            nn.Sigmoid()
        )

    def forward(self, x):
        # Feature extraction
        x = F.relu(self.conv1(x))
        x = F.relu(self.conv2(x))

        # Primary capsules
        capsules = self.primary_caps(x)

        # Diagnosis capsules
        diagnosis = self.diagnosis_caps(capsules)

        # Segmentation
        # Use pose information from primary capsules for segmentation
        segmentation = self.segmentation_decoder(capsules.mean(dim=1).view(
            capsules.size(0), 8, 14, 14))

        return diagnosis, segmentation

Research Directions

Key Papers

  1. "Dynamic Routing Between Capsules" (Sabour et al., 2017)
    • Introduced capsule networks
    • Demonstrated dynamic routing algorithm
    • Foundation for capsule network research
  2. "Matrix Capsules with EM Routing" (Hinton et al., 2018)
    • Introduced matrix capsules
    • Demonstrated EM routing
    • Improved performance and stability
  3. "Path Capsule Networks" (Amer & Maul, 2019)
    • Introduced path routing
    • Demonstrated improved performance
    • Foundation for scalable capsule networks
  4. "Capsule Networks for Object Detection" (Lin et al., 2020)
    • Applied capsule networks to object detection
    • Demonstrated competitive performance
    • Foundation for detection applications
  5. "Capsule Networks: A Survey" (Kosiorek et al., 2021)
    • Comprehensive survey of capsule networks
    • Overview of applications and variants
    • Foundation for capsule network research

Emerging Research

  • Scalable Capsule Networks: Architectures for large-scale problems
  • 3D Capsule Networks: Capsules for 3D data and point clouds
  • Temporal Capsule Networks: Capsules for video and time series
  • Self-Supervised Capsules: Learning without labeled data
  • Explainable Capsules: More interpretable representations
  • Neuromorphic Capsules: Capsules for spiking neural networks
  • Quantum Capsules: Capsules for quantum computing
  • Efficient Capsules: More compute-efficient architectures
  • Multimodal Capsules: Capsules for multiple data modalities
  • Few-Shot Capsules: Learning from few examples
  • Adversarial Capsules: Robust capsule networks
  • Theoretical Foundations: Better understanding of capsules
  • Hardware Acceleration: Specialized hardware for capsules

Best Practices

Implementation Guidelines

AspectRecommendationNotes
Capsule Dimension8-16 for primary, 16-32 for higherBalance expressiveness and computation
Routing Iterations3-5 iterationsMore iterations can improve performance
Squash FunctionUse stable implementationAvoid numerical instability
InitializationSmall random weightsPrevents early saturation
Learning Rate0.001-0.01Use learning rate scheduling
Batch Size32-128Larger batches for stability
Loss FunctionMargin loss + reconstruction lossReconstruction helps regularization
RegularizationDropout, weight decayPrevents overfitting
NormalizationBatch normalizationImproves training stability
OptimizerAdam for most casesWorks well with capsule networks

Common Pitfalls and Solutions

PitfallSolutionExample
Routing InstabilityUse fewer iterations, better initializationStart with 3 routing iterations
Slow ConvergenceUse learning rate schedulingStart with lr=0.01, decay to 0.0001
OverfittingUse reconstruction loss, dropoutAdd dropout with p=0.2
Memory IssuesUse gradient checkpointingEnable gradient checkpointing
Numerical InstabilityUse stable squash functionAdd small epsilon (1e-8) to denominator
Class ImbalanceUse weighted lossWeight classes by inverse frequency
Pose AmbiguityUse appropriate capsule dimensionsUse 16D capsules for complex poses
Feature CollapseUse skip connectionsAdd residual connections

Future Directions

  • Foundation Capsule Models: Large pre-trained capsule networks
  • 3D Vision Capsules: Better 3D structure understanding
  • Video Capsules: Temporal capsule networks for video
  • Multimodal Capsules: Combining vision, language, and audio
  • Explainable Capsules: More interpretable representations
  • Neuromorphic Capsules: Brain-inspired architectures
  • Quantum Capsules: Capsules for quantum computing
  • Energy-Efficient Capsules: Ultra-low power implementations
  • Self-Supervised Capsules: Learning from unlabeled data
  • Few-Shot Capsules: Learning from few examples
  • Adversarial Capsules: Robust capsule networks
  • Theoretical Breakthroughs: Better understanding of capsules
  • Real-Time Capsules: Faster inference for edge devices

External Resources