Capsule Network
Neural network architecture that preserves hierarchical spatial relationships between features using capsules instead of traditional neurons.
What is a Capsule Network?
A capsule network (CapsNet) is a type of neural network architecture designed to better preserve hierarchical spatial relationships between features in data. Unlike traditional convolutional neural networks (CNNs) that use scalar activations, capsule networks use vector-valued capsules that encode both the presence and pose (orientation, position, scale) of features.
Key Characteristics
- Vector-Valued Capsules: Encode both feature presence and pose
- Dynamic Routing: Learns part-whole relationships between capsules
- Equivariance: Preserves spatial hierarchies and relationships
- Viewpoint Invariance: Recognizes objects from different viewpoints
- Hierarchical Representation: Captures part-whole relationships
- Robust to Transformations: Handles object transformations better
- Interpretability: More interpretable feature representations
- Efficient Representation: Compact feature encoding
Architecture Overview
graph TD
A[Input Image] --> B[Primary Capsules]
B --> C[Dynamic Routing]
C --> D[Digit Capsules]
D --> E[Output Probabilities]
B -->|Pose Vectors| C
C -->|Pose Vectors| D
D -->|Length| E
Core Components
Capsule Structure
A capsule is a group of neurons that outputs a vector:
u_i = [u_1, u_2, ..., u_n]
Where:
u_iis the output vector of capsule i- Each component represents a different property (pose, orientation, etc.)
- The length of the vector represents the probability of feature existence
Dynamic Routing Algorithm
# Dynamic routing between capsules
def dynamic_routing(u_hat, r, l, iterations=3):
"""
u_hat: Predicted output vectors from lower-level capsules
r: Number of routing iterations
l: Layer index
"""
# Initialize coupling coefficients
b = torch.zeros(u_hat.shape[:-1])
for iteration in range(iterations):
# Softmax to get coupling coefficients
c = F.softmax(b, dim=1)
# Compute weighted sum of predictions
s = torch.sum(c.unsqueeze(-1) * u_hat, dim=1)
# Apply squash function
v = squash(s)
# Update coupling coefficients
if iteration < iterations - 1:
b = b + torch.sum(u_hat * v.unsqueeze(1), dim=-1)
return v
def squash(s, dim=-1):
"""Non-linear activation function for capsules"""
s_norm = torch.norm(s, p=2, dim=dim, keepdim=True)
scale = s_norm**2 / (1 + s_norm**2)
return scale * s / (s_norm + 1e-8)
CapsNet Architecture
# Capsule Network implementation
import torch
import torch.nn as nn
import torch.nn.functional as F
class PrimaryCapsules(nn.Module):
def __init__(self, in_channels, out_capsules, dim_capsule, kernel_size=9):
super(PrimaryCapsules, self).__init__()
self.dim_capsule = dim_capsule
self.capsules = nn.ModuleList([
nn.Conv2d(in_channels, dim_capsule, kernel_size=kernel_size, stride=2)
for _ in range(out_capsules)
])
def forward(self, x):
# Apply each capsule convolution
outputs = [capsule(x) for capsule in self.capsules]
outputs = torch.stack(outputs, dim=1)
# Reshape and squash
batch_size = outputs.size(0)
outputs = outputs.view(batch_size, -1, self.dim_capsule)
return squash(outputs)
class DigitCapsules(nn.Module):
def __init__(self, num_capsules, dim_capsule, num_routing=3):
super(DigitCapsules, self).__init__()
self.num_capsules = num_capsules
self.dim_capsule = dim_capsule
self.num_routing = num_routing
self.W = nn.Parameter(torch.randn(1, 32 * 6 * 6, num_capsules, dim_capsule, 8))
def forward(self, x):
# Expand input
x = x.unsqueeze(2).unsqueeze(4)
# Compute predicted vectors
u_hat = torch.matmul(x, self.W)
u_hat = u_hat.squeeze(3)
# Dynamic routing
b = torch.zeros(x.size(0), 32 * 6 * 6, self.num_capsules, 1).to(x.device)
v = dynamic_routing(u_hat, b, self.num_routing)
return v
class CapsuleNetwork(nn.Module):
def __init__(self, num_classes=10):
super(CapsuleNetwork, self).__init__()
# Initial convolution
self.conv1 = nn.Conv2d(1, 256, kernel_size=9)
# Primary capsules
self.primary_caps = PrimaryCapsules(256, 32, 8)
# Digit capsules
self.digit_caps = DigitCapsules(num_classes, 16)
# Decoder for reconstruction
self.decoder = nn.Sequential(
nn.Linear(16 * num_classes, 512),
nn.ReLU(),
nn.Linear(512, 1024),
nn.ReLU(),
nn.Linear(1024, 784),
nn.Sigmoid()
)
def forward(self, x, y=None):
# Initial convolution
x = F.relu(self.conv1(x))
# Primary capsules
x = self.primary_caps(x)
# Digit capsules
x = self.digit_caps(x)
# Get class probabilities
classes = torch.norm(x, dim=-1)
# Reconstruction
if y is not None:
reconstructions = self.decoder((x * y.unsqueeze(-1)).view(x.size(0), -1))
else:
# Use the capsule with max length
_, max_idx = classes.max(dim=1)
masked = x[torch.arange(x.size(0)), max_idx]
reconstructions = self.decoder(masked.view(x.size(0), -1))
return classes, reconstructions
Capsule Networks vs CNNs
| Feature | Capsule Networks | Convolutional Neural Networks |
|---|---|---|
| Feature Representation | Vector-valued capsules | Scalar activations |
| Spatial Relationships | Preserves hierarchical relationships | Loses spatial hierarchies |
| Viewpoint Invariance | Yes (equivariance) | Limited (requires data augmentation) |
| Pose Information | Encoded in capsule vectors | Not explicitly encoded |
| Dynamic Routing | Yes (learns part-whole relationships) | No (fixed connections) |
| Interpretability | High (pose information) | Low (abstract features) |
| Robustness | High (handles transformations well) | Moderate (sensitive to transformations) |
| Computational Cost | Higher (dynamic routing) | Lower (fixed operations) |
| Training Stability | Challenging (routing convergence) | Stable |
| Architecture | More complex | Simpler |
Training Capsule Networks
Margin Loss
# Margin loss for capsule networks
def margin_loss(v_c, y, m_plus=0.9, m_minus=0.1, lambda_=0.5):
"""
v_c: Output vector lengths (class probabilities)
y: Ground truth labels
m_plus: Upper margin
m_minus: Lower margin
lambda_: Down-weighting factor
"""
# Convert labels to one-hot
y_onehot = torch.zeros(v_c.size()).to(v_c.device)
y_onehot.scatter_(1, y.unsqueeze(1), 1.)
# Calculate losses
L_c = y_onehot * F.relu(m_plus - v_c)**2 + \
lambda_ * (1 - y_onehot) * F.relu(v_c - m_minus)**2
return torch.mean(torch.sum(L_c, dim=1))
Reconstruction Loss
# Reconstruction loss for capsule networks
def reconstruction_loss(x, reconstructions):
"""Binary cross-entropy loss for reconstructions"""
return F.binary_cross_entropy(reconstructions, x.view(x.size(0), -1))
Training Loop
# Training loop for capsule network
def train_capsnet(model, train_loader, optimizer, epochs, device):
model.train()
for epoch in range(epochs):
total_loss = 0
correct = 0
for batch_idx, (data, target) in enumerate(train_loader):
data, target = data.to(device), target.to(device)
# Forward pass
optimizer.zero_grad()
output, reconstructions = model(data, target)
# Calculate losses
loss = margin_loss(output, target) + 0.0005 * reconstruction_loss(data, reconstructions)
# Backward pass
loss.backward()
optimizer.step()
# Calculate accuracy
pred = output.argmax(dim=1, keepdim=True)
correct += pred.eq(target.view_as(pred)).sum().item()
total_loss += loss.item()
accuracy = 100. * correct / len(train_loader.dataset)
avg_loss = total_loss / len(train_loader)
print(f'Epoch {epoch+1}, Loss: {avg_loss:.4f}, Accuracy: {accuracy:.2f}%')
Applications
Image Classification
# Image classification with capsule networks
class ImageClassifier:
def __init__(self, num_classes=10):
self.model = CapsuleNetwork(num_classes)
self.device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
self.model.to(self.device)
def train(self, train_loader, epochs=10, lr=0.001):
"""Train the capsule network"""
optimizer = torch.optim.Adam(self.model.parameters(), lr=lr)
train_capsnet(self.model, train_loader, optimizer, epochs, self.device)
def predict(self, image):
"""Predict class for an image"""
self.model.eval()
with torch.no_grad():
image = image.unsqueeze(0).to(self.device)
output, _ = self.model(image)
return output.argmax(dim=1).item()
Object Detection
# Object detection with capsule networks (conceptual)
class CapsuleObjectDetector(nn.Module):
def __init__(self, num_classes):
super(CapsuleObjectDetector, self).__init__()
# Feature extraction
self.conv1 = nn.Conv2d(3, 256, kernel_size=9)
# Primary capsules
self.primary_caps = PrimaryCapsules(256, 32, 8)
# Detection capsules (one per class)
self.detection_caps = DigitCapsules(num_classes, 16)
# Bounding box regression capsules
self.bbox_caps = nn.ModuleList([
nn.Sequential(
nn.Linear(16, 32),
nn.ReLU(),
nn.Linear(32, 4) # x, y, width, height
) for _ in range(num_classes)
])
def forward(self, x):
# Feature extraction
x = F.relu(self.conv1(x))
# Primary capsules
x = self.primary_caps(x)
# Detection capsules
detections = self.detection_caps(x)
# Bounding box predictions
bbox_preds = []
for i, bbox_layer in enumerate(self.bbox_caps):
# Use capsule pose information for bounding box prediction
bbox_preds.append(bbox_layer(detections[:, i, :]))
return detections, torch.stack(bbox_preds, dim=1)
Medical Imaging
# Medical imaging with capsule networks
class MedicalCapsuleNetwork(nn.Module):
def __init__(self, num_classes=2):
super(MedicalCapsuleNetwork, self).__init__()
# Input convolution for medical images
self.conv1 = nn.Conv2d(1, 64, kernel_size=7, stride=2)
self.conv2 = nn.Conv2d(64, 128, kernel_size=5, stride=2)
# Primary capsules
self.primary_caps = PrimaryCapsules(128, 32, 8)
# Diagnosis capsules
self.diagnosis_caps = DigitCapsules(num_classes, 16)
# Segmentation decoder
self.segmentation_decoder = nn.Sequential(
nn.ConvTranspose2d(16, 64, kernel_size=4, stride=2),
nn.ReLU(),
nn.ConvTranspose2d(64, 32, kernel_size=4, stride=2),
nn.ReLU(),
nn.ConvTranspose2d(32, 1, kernel_size=4, stride=2),
nn.Sigmoid()
)
def forward(self, x):
# Feature extraction
x = F.relu(self.conv1(x))
x = F.relu(self.conv2(x))
# Primary capsules
capsules = self.primary_caps(x)
# Diagnosis capsules
diagnosis = self.diagnosis_caps(capsules)
# Segmentation
# Use pose information from primary capsules for segmentation
segmentation = self.segmentation_decoder(capsules.mean(dim=1).view(
capsules.size(0), 8, 14, 14))
return diagnosis, segmentation
Research Directions
Key Papers
- "Dynamic Routing Between Capsules" (Sabour et al., 2017)
- Introduced capsule networks
- Demonstrated dynamic routing algorithm
- Foundation for capsule network research
- "Matrix Capsules with EM Routing" (Hinton et al., 2018)
- Introduced matrix capsules
- Demonstrated EM routing
- Improved performance and stability
- "Path Capsule Networks" (Amer & Maul, 2019)
- Introduced path routing
- Demonstrated improved performance
- Foundation for scalable capsule networks
- "Capsule Networks for Object Detection" (Lin et al., 2020)
- Applied capsule networks to object detection
- Demonstrated competitive performance
- Foundation for detection applications
- "Capsule Networks: A Survey" (Kosiorek et al., 2021)
- Comprehensive survey of capsule networks
- Overview of applications and variants
- Foundation for capsule network research
Emerging Research
- Scalable Capsule Networks: Architectures for large-scale problems
- 3D Capsule Networks: Capsules for 3D data and point clouds
- Temporal Capsule Networks: Capsules for video and time series
- Self-Supervised Capsules: Learning without labeled data
- Explainable Capsules: More interpretable representations
- Neuromorphic Capsules: Capsules for spiking neural networks
- Quantum Capsules: Capsules for quantum computing
- Efficient Capsules: More compute-efficient architectures
- Multimodal Capsules: Capsules for multiple data modalities
- Few-Shot Capsules: Learning from few examples
- Adversarial Capsules: Robust capsule networks
- Theoretical Foundations: Better understanding of capsules
- Hardware Acceleration: Specialized hardware for capsules
Best Practices
Implementation Guidelines
| Aspect | Recommendation | Notes |
|---|---|---|
| Capsule Dimension | 8-16 for primary, 16-32 for higher | Balance expressiveness and computation |
| Routing Iterations | 3-5 iterations | More iterations can improve performance |
| Squash Function | Use stable implementation | Avoid numerical instability |
| Initialization | Small random weights | Prevents early saturation |
| Learning Rate | 0.001-0.01 | Use learning rate scheduling |
| Batch Size | 32-128 | Larger batches for stability |
| Loss Function | Margin loss + reconstruction loss | Reconstruction helps regularization |
| Regularization | Dropout, weight decay | Prevents overfitting |
| Normalization | Batch normalization | Improves training stability |
| Optimizer | Adam for most cases | Works well with capsule networks |
Common Pitfalls and Solutions
| Pitfall | Solution | Example |
|---|---|---|
| Routing Instability | Use fewer iterations, better initialization | Start with 3 routing iterations |
| Slow Convergence | Use learning rate scheduling | Start with lr=0.01, decay to 0.0001 |
| Overfitting | Use reconstruction loss, dropout | Add dropout with p=0.2 |
| Memory Issues | Use gradient checkpointing | Enable gradient checkpointing |
| Numerical Instability | Use stable squash function | Add small epsilon (1e-8) to denominator |
| Class Imbalance | Use weighted loss | Weight classes by inverse frequency |
| Pose Ambiguity | Use appropriate capsule dimensions | Use 16D capsules for complex poses |
| Feature Collapse | Use skip connections | Add residual connections |
Future Directions
- Foundation Capsule Models: Large pre-trained capsule networks
- 3D Vision Capsules: Better 3D structure understanding
- Video Capsules: Temporal capsule networks for video
- Multimodal Capsules: Combining vision, language, and audio
- Explainable Capsules: More interpretable representations
- Neuromorphic Capsules: Brain-inspired architectures
- Quantum Capsules: Capsules for quantum computing
- Energy-Efficient Capsules: Ultra-low power implementations
- Self-Supervised Capsules: Learning from unlabeled data
- Few-Shot Capsules: Learning from few examples
- Adversarial Capsules: Robust capsule networks
- Theoretical Breakthroughs: Better understanding of capsules
- Real-Time Capsules: Faster inference for edge devices
External Resources
- Original CapsNet Paper (Sabour et al.)
- Matrix Capsules Paper (Hinton et al.)
- Capsule Networks Survey (Kosiorek et al.)
- CapsNet Implementation (GitHub)
- Dynamic Routing Tutorial
- Capsule Networks Explained (YouTube)
- CapsNet for Object Detection (arXiv)
- 3D Capsule Networks (arXiv)
- Capsules for Medical Imaging (arXiv)
- Efficient Capsule Networks (arXiv)
- Capsules for NLP (arXiv)
- Capsule Networks Hardware (arXiv)
- Capsules vs CNNs (arXiv)
- Capsule Networks Datasets