Multi-Head Attention
What is Multi-Head Attention?
Multi-Head Attention is an advanced attention mechanism that extends the basic attention mechanism by using multiple parallel attention heads to capture diverse relationships and patterns in the input data. Instead of computing a single attention function, multi-head attention computes multiple attention functions in parallel, allowing the model to jointly attend to information from different representation subspaces at different positions.
Key Characteristics
- Parallel Attention Heads: Multiple attention mechanisms working simultaneously
- Diverse Representations: Captures different types of relationships
- Efficient Computation: Parallel processing of attention heads
- Feature Learning: Learns rich representations through multiple perspectives
- Scalability: Works well with large models and datasets
- Interpretability: Provides multiple attention patterns
- Modularity: Can be easily integrated into various architectures
How Multi-Head Attention Works
Basic Architecture
- Input Projection: Project input to multiple query, key, value spaces
- Parallel Attention: Compute attention for each head independently
- Concatenation: Combine results from all attention heads
- Output Projection: Project concatenated results to final output space
Multi-Head Attention Diagram
Input → Linear Projections → Multiple Attention Heads → Concatenation → Output Projection → Output
Mathematical Foundations
Single Head Attention
For a single attention head: $$ \text{Attention}(Q, K, V) = \text{softmax}\left(\frac{QK^T}{\sqrt{d_k}}\right)V $$
Multi-Head Attention
For $h$ attention heads: $$ \text{MultiHead}(Q, K, V) = \text{Concat}(\text{head}_1, ..., \text{head}_h)W^O $$
where each head is computed as: $$ \text{head}_i = \text{Attention}(QW_i^Q, KW_i^K, VW_i^V) $$
with projection matrices: $$ W_i^Q \in \mathbb{R}^{d_{\text{model}} \times d_k}, W_i^K \in \mathbb{R}^{d_{\text{model}} \times d_k}, W_i^V \in \mathbb{R}^{d_{\text{model}} \times d_v}, W^O \in \mathbb{R}^{hd_v \times d_{\text{model}}} $$
Multi-Head Attention Implementation
Basic Implementation
import tensorflow as tf
from tensorflow.keras.layers import Layer, Dense
class MultiHeadAttention(Layer):
def __init__(self, d_model, num_heads):
super(MultiHeadAttention, self).__init__()
self.num_heads = num_heads
self.d_model = d_model
self.depth = d_model // num_heads
# Linear projections
self.wq = Dense(d_model)
self.wk = Dense(d_model)
self.wv = Dense(d_model)
self.wo = Dense(d_model)
def split_heads(self, x, batch_size):
"""Split the last dimension into (num_heads, depth)"""
x = tf.reshape(x, (batch_size, -1, self.num_heads, self.depth))
return tf.transpose(x, perm=[0, 2, 1, 3])
def call(self, q, k, v, mask=None):
batch_size = tf.shape(q)[0]
# Linear projections
q = self.wq(q) # (batch_size, seq_len, d_model)
k = self.wk(k) # (batch_size, seq_len, d_model)
v = self.wv(v) # (batch_size, seq_len, d_model)
# Split into multiple heads
q = self.split_heads(q, batch_size) # (batch_size, num_heads, seq_len_q, depth)
k = self.split_heads(k, batch_size) # (batch_size, num_heads, seq_len_k, depth)
v = self.split_heads(v, batch_size) # (batch_size, num_heads, seq_len_v, depth)
# Scaled dot-product attention
scaled_attention, attention_weights = self.scaled_dot_product_attention(q, k, v, mask)
# Concatenate heads
scaled_attention = tf.transpose(scaled_attention, perm=[0, 2, 1, 3]) # (batch_size, seq_len_q, num_heads, depth)
concat_attention = tf.reshape(scaled_attention, (batch_size, -1, self.d_model)) # (batch_size, seq_len_q, d_model)
# Final linear projection
output = self.wo(concat_attention) # (batch_size, seq_len_q, d_model)
return output, attention_weights
def scaled_dot_product_attention(self, q, k, v, mask=None):
"""Calculate scaled dot-product attention"""
matmul_qk = tf.matmul(q, k, transpose_b=True) # (..., seq_len_q, seq_len_k)
# Scale matmul_qk
dk = tf.cast(tf.shape(k)[-1], tf.float32)
scaled_attention_logits = matmul_qk / tf.math.sqrt(dk)
# Add mask if provided
if mask is not None:
scaled_attention_logits += (mask * -1e9)
# Softmax
attention_weights = tf.nn.softmax(scaled_attention_logits, axis=-1) # (..., seq_len_q, seq_len_k)
# Multiply by values
output = tf.matmul(attention_weights, v) # (..., seq_len_q, depth_v)
return output, attention_weights
Keras Implementation
from tensorflow.keras.layers import MultiHeadAttention
# Simple multi-head attention layer
mha = MultiHeadAttention(
num_heads=8,
key_dim=64,
value_dim=64,
output_shape=256,
dropout=0.1
)
# Usage in a model
inputs = tf.keras.Input(shape=(10, 256))
attention_output, attention_weights = mha(inputs, inputs, inputs)
Multi-Head Attention in Transformers
Transformer Architecture
from tensorflow.keras.layers import LayerNormalization, Dropout
class TransformerBlock(Layer):
def __init__(self, d_model, num_heads, dff, dropout_rate=0.1):
super(TransformerBlock, self).__init__()
self.mha = MultiHeadAttention(num_heads=num_heads, key_dim=d_model)
self.ffn = tf.keras.Sequential([
Dense(dff, activation='relu'),
Dense(d_model)
])
self.layernorm1 = LayerNormalization(epsilon=1e-6)
self.layernorm2 = LayerNormalization(epsilon=1e-6)
self.dropout1 = Dropout(dropout_rate)
self.dropout2 = Dropout(dropout_rate)
def call(self, x, training, mask=None):
# Multi-head attention
attn_output, _ = self.mha(x, x, x, mask) # (batch_size, input_seq_len, d_model)
attn_output = self.dropout1(attn_output, training=training)
out1 = self.layernorm1(x + attn_output) # (batch_size, input_seq_len, d_model)
# Feed-forward network
ffn_output = self.ffn(out1) # (batch_size, input_seq_len, d_model)
ffn_output = self.dropout2(ffn_output, training=training)
out2 = self.layernorm2(out1 + ffn_output) # (batch_size, input_seq_len, d_model)
return out2
Complete Transformer Model
from tensorflow.keras import Model
class Transformer(Model):
def __init__(self, num_layers, d_model, num_heads, dff, input_vocab_size,
target_vocab_size, dropout_rate=0.1):
super(Transformer, self).__init__()
self.encoder = tf.keras.Sequential([
tf.keras.layers.Embedding(input_vocab_size, d_model),
*[TransformerBlock(d_model, num_heads, dff, dropout_rate) for _ in range(num_layers)]
])
self.decoder = tf.keras.Sequential([
tf.keras.layers.Embedding(target_vocab_size, d_model),
*[TransformerBlock(d_model, num_heads, dff, dropout_rate) for _ in range(num_layers)]
])
self.final_layer = Dense(target_vocab_size)
def call(self, inputs, training):
inp, tar = inputs
# Encoder
enc_output = self.encoder(inp, training) # (batch_size, inp_seq_len, d_model)
# Decoder
dec_output = self.decoder(tar, training) # (batch_size, tar_seq_len, d_model)
# Final linear layer
final_output = self.final_layer(dec_output) # (batch_size, tar_seq_len, target_vocab_size)
return final_output
Advantages of Multi-Head Attention
Comparison with Single-Head Attention
| Feature | Single-Head Attention | Multi-Head Attention |
|---|---|---|
| Representation | Single attention pattern | Multiple diverse attention patterns |
| Relationships | Captures one type of relationship | Captures multiple types of relationships |
| Performance | Good for simple tasks | Better for complex tasks |
| Interpretability | Single attention map | Multiple attention maps |
| Computational Cost | Lower | Higher (but parallelizable) |
| Model Capacity | Limited | Higher |
| Flexibility | Less flexible | More flexible |
Benefits of Multiple Heads
- Diverse Feature Learning: Each head can learn different aspects of the data
- Parallel Processing: Heads can be computed in parallel
- Specialization: Heads can specialize in different types of relationships
- Robustness: Multiple heads provide redundancy and robustness
- Interpretability: Multiple attention patterns reveal different insights
- Scalability: Works well with large models and datasets
Multi-Head Attention Applications
Natural Language Processing
Machine Translation
# Transformer for machine translation
transformer = Transformer(
num_layers=6,
d_model=512,
num_heads=8,
dff=2048,
input_vocab_size=8500,
target_vocab_size=8000,
dropout_rate=0.1
)
# Training
transformer.compile(optimizer='adam', loss='sparse_categorical_crossentropy')
transformer.fit(train_dataset, epochs=10)
Text Generation
# GPT-like model with multi-head attention
class GPT(tf.keras.Model):
def __init__(self, vocab_size, d_model, num_heads, num_layers, max_seq_length):
super(GPT, self).__init__()
self.token_embedding = tf.keras.layers.Embedding(vocab_size, d_model)
self.position_embedding = tf.keras.layers.Embedding(max_seq_length, d_model)
self.blocks = [TransformerBlock(d_model, num_heads, d_model*4) for _ in range(num_layers)]
self.layernorm = LayerNormalization(epsilon=1e-6)
self.dense = Dense(vocab_size)
def call(self, inputs, training=False):
seq_len = tf.shape(inputs)[1]
positions = tf.range(seq_len)[tf.newaxis, :]
x = self.token_embedding(inputs)
x += self.position_embedding(positions)
for block in self.blocks:
x = block(x, training)
x = self.layernorm(x)
return self.dense(x)
Computer Vision
Vision Transformer (ViT)
# Vision Transformer with multi-head attention
class VisionTransformer(tf.keras.Model):
def __init__(self, image_size, patch_size, num_classes, d_model, num_heads, num_layers):
super(VisionTransformer, self).__init__()
num_patches = (image_size // patch_size) ** 2
# Patch embedding
self.patch_embedding = tf.keras.layers.Conv2D(d_model, kernel_size=patch_size, strides=patch_size)
# Class token and position embedding
self.class_token = self.add_weight("class_token", shape=(1, 1, d_model))
self.position_embedding = tf.keras.layers.Embedding(num_patches + 1, d_model)
# Transformer blocks
self.blocks = [TransformerBlock(d_model, num_heads, d_model*4) for _ in range(num_layers)]
# Classification head
self.mlp_head = tf.keras.Sequential([
LayerNormalization(epsilon=1e-6),
Dense(d_model, activation='tanh'),
Dense(num_classes)
])
def call(self, inputs, training=False):
batch_size = tf.shape(inputs)[0]
# Patch embedding
x = self.patch_embedding(inputs) # (batch_size, num_patches_h, num_patches_w, d_model)
x = tf.reshape(x, (batch_size, -1, x.shape[-1])) # (batch_size, num_patches, d_model)
# Add class token
class_token = tf.broadcast_to(self.class_token, (batch_size, 1, x.shape[-1]))
x = tf.concat([class_token, x], axis=1)
# Add position embedding
positions = tf.range(x.shape[1])[tf.newaxis, :]
x += self.position_embedding(positions)
# Transformer blocks
for block in self.blocks:
x = block(x, training)
# Classification
class_token = x[:, 0]
return self.mlp_head(class_token)
Image Captioning
# Image captioning with multi-head attention
class ImageCaptioningModel(tf.keras.Model):
def __init__(self, vocab_size, d_model, num_heads, num_layers):
super(ImageCaptioningModel, self).__init__()
# Image encoder (CNN)
self.cnn = tf.keras.applications.EfficientNetB0(include_top=False, pooling='avg')
# Text decoder (Transformer)
self.embedding = tf.keras.layers.Embedding(vocab_size, d_model)
self.transformer = Transformer(
num_layers=num_layers,
d_model=d_model,
num_heads=num_heads,
dff=d_model*4,
input_vocab_size=vocab_size,
target_vocab_size=vocab_size
)
# Output projection
self.dense = Dense(vocab_size)
def call(self, inputs, training=False):
images, captions = inputs
# Encode images
image_features = self.cnn(images) # (batch_size, d_model)
# Expand image features for each caption token
image_features = tf.expand_dims(image_features, axis=1)
image_features = tf.repeat(image_features, tf.shape(captions)[1], axis=1)
# Embed captions
caption_embeddings = self.embedding(captions)
# Combine image features and caption embeddings
decoder_input = tf.concat([image_features, caption_embeddings], axis=-1)
# Transformer
output = self.transformer((image_features, decoder_input), training)
# Output projection
return self.dense(output)
Multimodal Learning
# Multimodal model with cross-modal multi-head attention
class MultimodalModel(tf.keras.Model):
def __init__(self, text_vocab_size, image_size, d_model, num_heads, num_layers):
super(MultimodalModel, self).__init__()
# Text encoder
self.text_embedding = tf.keras.layers.Embedding(text_vocab_size, d_model)
self.text_encoder = Transformer(
num_layers=num_layers,
d_model=d_model,
num_heads=num_heads,
dff=d_model*4,
input_vocab_size=text_vocab_size,
target_vocab_size=text_vocab_size
)
# Image encoder
self.image_encoder = tf.keras.Sequential([
tf.keras.layers.Conv2D(64, (3,3), activation='relu', input_shape=(*image_size, 3)),
tf.keras.layers.MaxPooling2D((2,2)),
tf.keras.layers.Conv2D(128, (3,3), activation='relu'),
tf.keras.layers.MaxPooling2D((2,2)),
tf.keras.layers.Conv2D(d_model, (3,3), activation='relu'),
tf.keras.layers.GlobalAveragePooling2D()
])
# Cross-modal attention
self.cross_attention = MultiHeadAttention(num_heads=num_heads, key_dim=d_model)
# Classification head
self.classifier = tf.keras.Sequential([
LayerNormalization(epsilon=1e-6),
Dense(d_model, activation='relu'),
Dense(1, activation='sigmoid')
])
def call(self, inputs, training=False):
text, image = inputs
# Encode text
text_embeddings = self.text_embedding(text)
text_features = self.text_encoder((text_embeddings, text_embeddings), training)
# Encode image
image_features = self.image_encoder(image) # (batch_size, d_model)
image_features = tf.expand_dims(image_features, axis=1) # (batch_size, 1, d_model)
# Cross-modal attention
context, _ = self.cross_attention(image_features, text_features, text_features)
# Classification
return self.classifier(context[:, 0, :])
Multi-Head Attention Visualization
Attention Head Visualization
import matplotlib.pyplot as plt
import seaborn as sns
def visualize_attention_heads(attention_weights, layer_num, head_nums, input_tokens, output_tokens):
"""Visualize multiple attention heads from a multi-head attention layer"""
num_heads = len(head_nums)
fig, axes = plt.subplots(1, num_heads, figsize=(15, 5))
for i, head_num in enumerate(head_nums):
# Get attention weights for specific head
head_weights = attention_weights[layer_num, head_num]
# Plot heatmap
sns.heatmap(head_weights,
xticklabels=input_tokens,
yticklabels=output_tokens,
ax=axes[i],
cmap='viridis')
axes[i].set_title(f'Layer {layer_num}, Head {head_num}')
axes[i].set_xlabel('Input Sequence')
axes[i].set_ylabel('Output Sequence')
plt.tight_layout()
plt.show()
# Example usage
attention_weights = np.random.rand(6, 8, 10, 10) # 6 layers, 8 heads, 10x10 attention
input_tokens = ['The', 'quick', 'brown', 'fox', 'jumps', 'over', 'the', 'lazy', 'dog', '.']
output_tokens = ['Le', 'rapide', 'renard', 'brun', 'saute', 'par-dessus', 'le', 'chien', 'paresseux', '.']
visualize_attention_heads(attention_weights, 0, [0, 1, 2, 3], input_tokens, output_tokens)
Attention Head Diversity Analysis
def analyze_attention_diversity(attention_weights):
"""Analyze diversity across attention heads"""
num_layers, num_heads, seq_len, _ = attention_weights.shape
# Calculate pairwise similarity between heads
similarities = np.zeros((num_layers, num_heads, num_heads))
for layer in range(num_layers):
for i in range(num_heads):
for j in range(num_heads):
# Cosine similarity between attention patterns
vec_i = attention_weights[layer, i].flatten()
vec_j = attention_weights[layer, j].flatten()
similarities[layer, i, j] = np.dot(vec_i, vec_j) / (np.linalg.norm(vec_i) * np.linalg.norm(vec_j))
# Plot similarity matrices
fig, axes = plt.subplots(1, num_layers, figsize=(15, 5))
for layer in range(num_layers):
sns.heatmap(similarities[layer], ax=axes[layer], cmap='coolwarm', vmin=0, vmax=1)
axes[layer].set_title(f'Layer {layer} Head Similarity')
axes[layer].set_xlabel('Head')
axes[layer].set_ylabel('Head')
plt.tight_layout()
plt.show()
return similarities
Multi-Head Attention Research
Key Papers
- "Attention Is All You Need" (Vaswani et al., 2017)
- Introduced Transformer architecture with multi-head attention
- Demonstrated state-of-the-art performance in machine translation
- Foundation for modern attention-based models
- "BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding" (Devlin et al., 2019)
- Applied multi-head attention to large-scale pre-training
- Demonstrated effectiveness of bidirectional attention
- Foundation for modern language models
- "An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale" (Dosovitskiy et al., 2021)
- Applied multi-head attention to computer vision
- Introduced Vision Transformer (ViT)
- Demonstrated competitive performance with CNNs
- "Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer" (Raffel et al., 2020)
- Demonstrated effectiveness of multi-head attention across diverse NLP tasks
- Introduced T5 model architecture
- Comprehensive evaluation of transfer learning
- "Language Models are Few-Shot Learners" (Brown et al., 2020)
- Scaled multi-head attention to very large models
- Demonstrated few-shot learning capabilities
- Introduced GPT-3 architecture
Multi-Head Attention Best Practices
Implementation Guidelines
| Aspect | Recommendation | Notes |
|---|---|---|
| Number of Heads | 4-16 heads typically | More heads for larger models |
| Head Dimension | d_model / num_heads should be integer | Typically 64-128 for each head |
| Initialization | Proper weight initialization | Critical for stable training |
| Dropout | Apply dropout to attention weights | Prevents overfitting |
| Normalization | Layer normalization | Stabilizes training |
| Residuals | Use residual connections | Helps with gradient flow |
| Position Encoding | Use positional encodings | Essential for sequence order |
Training Considerations
- Memory Usage: Multi-head attention can be memory-intensive
- Computational Cost: $O(n^2)$ complexity for sequence length $n$
- Batch Size: Larger batches can improve efficiency
- Mixed Precision: Use FP16/FP32 mixed precision for acceleration
- Gradient Clipping: Helps with training stability
- Learning Rate: Typically requires careful tuning
- Warmup: Learning rate warmup can improve convergence
Optimization Techniques
# Efficient multi-head attention implementations
from tensorflow.keras.layers import MultiHeadAttention
# Standard implementation
mha = MultiHeadAttention(
num_heads=8,
key_dim=64,
value_dim=64,
output_shape=512,
dropout=0.1
)
# Memory-efficient implementation for long sequences
# (Conceptual - actual implementation may vary)
def memory_efficient_mha(query, key, value):
# Use approximate nearest neighbor or other techniques
# to reduce memory usage for long sequences
pass
# Flash attention implementation
def flash_attention(query, key, value):
# Optimized attention implementation with better
# memory access patterns and reduced memory usage
pass
Future Directions
- Sparse Multi-Head Attention: Reducing computational complexity for long sequences
- Adaptive Multi-Head Attention: Dynamically adjusting number of heads
- Efficient Implementations: Optimized attention algorithms (e.g., Flash Attention)
- Multi-Modal Attention: Cross-modal multi-head attention
- Neuromorphic Attention: Biologically-inspired multi-head attention
- Quantum Attention: Multi-head attention for quantum computing
- Explainable Attention: More interpretable multi-head attention
- Hierarchical Attention: Multi-level multi-head attention
External Resources
- Attention Is All You Need (arXiv)
- The Illustrated Transformer (Blog)
- Multi-Head Attention Explained (Towards Data Science)
- Transformer Architecture Explained (YouTube)
- Multi-Head Attention in PyTorch (PyTorch Docs)
- Efficient Transformers: A Survey (arXiv)
- FlashAttention: Fast and Memory-Efficient Exact Attention (arXiv)
- Multi-Head Attention in TensorFlow (TensorFlow Docs)