Multi-Head Attention

Advanced attention mechanism that uses multiple parallel attention heads to capture diverse relationships in data.

What is Multi-Head Attention?

Multi-Head Attention is an advanced attention mechanism that extends the basic attention mechanism by using multiple parallel attention heads to capture diverse relationships and patterns in the input data. Instead of computing a single attention function, multi-head attention computes multiple attention functions in parallel, allowing the model to jointly attend to information from different representation subspaces at different positions.

Key Characteristics

  • Parallel Attention Heads: Multiple attention mechanisms working simultaneously
  • Diverse Representations: Captures different types of relationships
  • Efficient Computation: Parallel processing of attention heads
  • Feature Learning: Learns rich representations through multiple perspectives
  • Scalability: Works well with large models and datasets
  • Interpretability: Provides multiple attention patterns
  • Modularity: Can be easily integrated into various architectures

How Multi-Head Attention Works

Basic Architecture

  1. Input Projection: Project input to multiple query, key, value spaces
  2. Parallel Attention: Compute attention for each head independently
  3. Concatenation: Combine results from all attention heads
  4. Output Projection: Project concatenated results to final output space

Multi-Head Attention Diagram

Input → Linear Projections → Multiple Attention Heads → Concatenation → Output Projection → Output

Mathematical Foundations

Single Head Attention

For a single attention head: $$ \text{Attention}(Q, K, V) = \text{softmax}\left(\frac{QK^T}{\sqrt{d_k}}\right)V $$

Multi-Head Attention

For $h$ attention heads: $$ \text{MultiHead}(Q, K, V) = \text{Concat}(\text{head}_1, ..., \text{head}_h)W^O $$

where each head is computed as: $$ \text{head}_i = \text{Attention}(QW_i^Q, KW_i^K, VW_i^V) $$

with projection matrices: $$ W_i^Q \in \mathbb{R}^{d_{\text{model}} \times d_k}, W_i^K \in \mathbb{R}^{d_{\text{model}} \times d_k}, W_i^V \in \mathbb{R}^{d_{\text{model}} \times d_v}, W^O \in \mathbb{R}^{hd_v \times d_{\text{model}}} $$

Multi-Head Attention Implementation

Basic Implementation

import tensorflow as tf
from tensorflow.keras.layers import Layer, Dense

class MultiHeadAttention(Layer):
    def __init__(self, d_model, num_heads):
        super(MultiHeadAttention, self).__init__()
        self.num_heads = num_heads
        self.d_model = d_model
        self.depth = d_model // num_heads

        # Linear projections
        self.wq = Dense(d_model)
        self.wk = Dense(d_model)
        self.wv = Dense(d_model)
        self.wo = Dense(d_model)

    def split_heads(self, x, batch_size):
        """Split the last dimension into (num_heads, depth)"""
        x = tf.reshape(x, (batch_size, -1, self.num_heads, self.depth))
        return tf.transpose(x, perm=[0, 2, 1, 3])

    def call(self, q, k, v, mask=None):
        batch_size = tf.shape(q)[0]

        # Linear projections
        q = self.wq(q)  # (batch_size, seq_len, d_model)
        k = self.wk(k)  # (batch_size, seq_len, d_model)
        v = self.wv(v)  # (batch_size, seq_len, d_model)

        # Split into multiple heads
        q = self.split_heads(q, batch_size)  # (batch_size, num_heads, seq_len_q, depth)
        k = self.split_heads(k, batch_size)  # (batch_size, num_heads, seq_len_k, depth)
        v = self.split_heads(v, batch_size)  # (batch_size, num_heads, seq_len_v, depth)

        # Scaled dot-product attention
        scaled_attention, attention_weights = self.scaled_dot_product_attention(q, k, v, mask)

        # Concatenate heads
        scaled_attention = tf.transpose(scaled_attention, perm=[0, 2, 1, 3])  # (batch_size, seq_len_q, num_heads, depth)
        concat_attention = tf.reshape(scaled_attention, (batch_size, -1, self.d_model))  # (batch_size, seq_len_q, d_model)

        # Final linear projection
        output = self.wo(concat_attention)  # (batch_size, seq_len_q, d_model)

        return output, attention_weights

    def scaled_dot_product_attention(self, q, k, v, mask=None):
        """Calculate scaled dot-product attention"""
        matmul_qk = tf.matmul(q, k, transpose_b=True)  # (..., seq_len_q, seq_len_k)

        # Scale matmul_qk
        dk = tf.cast(tf.shape(k)[-1], tf.float32)
        scaled_attention_logits = matmul_qk / tf.math.sqrt(dk)

        # Add mask if provided
        if mask is not None:
            scaled_attention_logits += (mask * -1e9)

        # Softmax
        attention_weights = tf.nn.softmax(scaled_attention_logits, axis=-1)  # (..., seq_len_q, seq_len_k)

        # Multiply by values
        output = tf.matmul(attention_weights, v)  # (..., seq_len_q, depth_v)

        return output, attention_weights

Keras Implementation

from tensorflow.keras.layers import MultiHeadAttention

# Simple multi-head attention layer
mha = MultiHeadAttention(
    num_heads=8,
    key_dim=64,
    value_dim=64,
    output_shape=256,
    dropout=0.1
)

# Usage in a model
inputs = tf.keras.Input(shape=(10, 256))
attention_output, attention_weights = mha(inputs, inputs, inputs)

Multi-Head Attention in Transformers

Transformer Architecture

from tensorflow.keras.layers import LayerNormalization, Dropout

class TransformerBlock(Layer):
    def __init__(self, d_model, num_heads, dff, dropout_rate=0.1):
        super(TransformerBlock, self).__init__()
        self.mha = MultiHeadAttention(num_heads=num_heads, key_dim=d_model)
        self.ffn = tf.keras.Sequential([
            Dense(dff, activation='relu'),
            Dense(d_model)
        ])

        self.layernorm1 = LayerNormalization(epsilon=1e-6)
        self.layernorm2 = LayerNormalization(epsilon=1e-6)

        self.dropout1 = Dropout(dropout_rate)
        self.dropout2 = Dropout(dropout_rate)

    def call(self, x, training, mask=None):
        # Multi-head attention
        attn_output, _ = self.mha(x, x, x, mask)  # (batch_size, input_seq_len, d_model)
        attn_output = self.dropout1(attn_output, training=training)
        out1 = self.layernorm1(x + attn_output)  # (batch_size, input_seq_len, d_model)

        # Feed-forward network
        ffn_output = self.ffn(out1)  # (batch_size, input_seq_len, d_model)
        ffn_output = self.dropout2(ffn_output, training=training)
        out2 = self.layernorm2(out1 + ffn_output)  # (batch_size, input_seq_len, d_model)

        return out2

Complete Transformer Model

from tensorflow.keras import Model

class Transformer(Model):
    def __init__(self, num_layers, d_model, num_heads, dff, input_vocab_size,
                 target_vocab_size, dropout_rate=0.1):
        super(Transformer, self).__init__()
        self.encoder = tf.keras.Sequential([
            tf.keras.layers.Embedding(input_vocab_size, d_model),
            *[TransformerBlock(d_model, num_heads, dff, dropout_rate) for _ in range(num_layers)]
        ])

        self.decoder = tf.keras.Sequential([
            tf.keras.layers.Embedding(target_vocab_size, d_model),
            *[TransformerBlock(d_model, num_heads, dff, dropout_rate) for _ in range(num_layers)]
        ])

        self.final_layer = Dense(target_vocab_size)

    def call(self, inputs, training):
        inp, tar = inputs

        # Encoder
        enc_output = self.encoder(inp, training)  # (batch_size, inp_seq_len, d_model)

        # Decoder
        dec_output = self.decoder(tar, training)  # (batch_size, tar_seq_len, d_model)

        # Final linear layer
        final_output = self.final_layer(dec_output)  # (batch_size, tar_seq_len, target_vocab_size)

        return final_output

Advantages of Multi-Head Attention

Comparison with Single-Head Attention

FeatureSingle-Head AttentionMulti-Head Attention
RepresentationSingle attention patternMultiple diverse attention patterns
RelationshipsCaptures one type of relationshipCaptures multiple types of relationships
PerformanceGood for simple tasksBetter for complex tasks
InterpretabilitySingle attention mapMultiple attention maps
Computational CostLowerHigher (but parallelizable)
Model CapacityLimitedHigher
FlexibilityLess flexibleMore flexible

Benefits of Multiple Heads

  1. Diverse Feature Learning: Each head can learn different aspects of the data
  2. Parallel Processing: Heads can be computed in parallel
  3. Specialization: Heads can specialize in different types of relationships
  4. Robustness: Multiple heads provide redundancy and robustness
  5. Interpretability: Multiple attention patterns reveal different insights
  6. Scalability: Works well with large models and datasets

Multi-Head Attention Applications

Natural Language Processing

Machine Translation

# Transformer for machine translation
transformer = Transformer(
    num_layers=6,
    d_model=512,
    num_heads=8,
    dff=2048,
    input_vocab_size=8500,
    target_vocab_size=8000,
    dropout_rate=0.1
)

# Training
transformer.compile(optimizer='adam', loss='sparse_categorical_crossentropy')
transformer.fit(train_dataset, epochs=10)

Text Generation

# GPT-like model with multi-head attention
class GPT(tf.keras.Model):
    def __init__(self, vocab_size, d_model, num_heads, num_layers, max_seq_length):
        super(GPT, self).__init__()
        self.token_embedding = tf.keras.layers.Embedding(vocab_size, d_model)
        self.position_embedding = tf.keras.layers.Embedding(max_seq_length, d_model)

        self.blocks = [TransformerBlock(d_model, num_heads, d_model*4) for _ in range(num_layers)]
        self.layernorm = LayerNormalization(epsilon=1e-6)
        self.dense = Dense(vocab_size)

    def call(self, inputs, training=False):
        seq_len = tf.shape(inputs)[1]
        positions = tf.range(seq_len)[tf.newaxis, :]

        x = self.token_embedding(inputs)
        x += self.position_embedding(positions)

        for block in self.blocks:
            x = block(x, training)

        x = self.layernorm(x)
        return self.dense(x)

Computer Vision

Vision Transformer (ViT)

# Vision Transformer with multi-head attention
class VisionTransformer(tf.keras.Model):
    def __init__(self, image_size, patch_size, num_classes, d_model, num_heads, num_layers):
        super(VisionTransformer, self).__init__()
        num_patches = (image_size // patch_size) ** 2

        # Patch embedding
        self.patch_embedding = tf.keras.layers.Conv2D(d_model, kernel_size=patch_size, strides=patch_size)

        # Class token and position embedding
        self.class_token = self.add_weight("class_token", shape=(1, 1, d_model))
        self.position_embedding = tf.keras.layers.Embedding(num_patches + 1, d_model)

        # Transformer blocks
        self.blocks = [TransformerBlock(d_model, num_heads, d_model*4) for _ in range(num_layers)]

        # Classification head
        self.mlp_head = tf.keras.Sequential([
            LayerNormalization(epsilon=1e-6),
            Dense(d_model, activation='tanh'),
            Dense(num_classes)
        ])

    def call(self, inputs, training=False):
        batch_size = tf.shape(inputs)[0]

        # Patch embedding
        x = self.patch_embedding(inputs)  # (batch_size, num_patches_h, num_patches_w, d_model)
        x = tf.reshape(x, (batch_size, -1, x.shape[-1]))  # (batch_size, num_patches, d_model)

        # Add class token
        class_token = tf.broadcast_to(self.class_token, (batch_size, 1, x.shape[-1]))
        x = tf.concat([class_token, x], axis=1)

        # Add position embedding
        positions = tf.range(x.shape[1])[tf.newaxis, :]
        x += self.position_embedding(positions)

        # Transformer blocks
        for block in self.blocks:
            x = block(x, training)

        # Classification
        class_token = x[:, 0]
        return self.mlp_head(class_token)

Image Captioning

# Image captioning with multi-head attention
class ImageCaptioningModel(tf.keras.Model):
    def __init__(self, vocab_size, d_model, num_heads, num_layers):
        super(ImageCaptioningModel, self).__init__()

        # Image encoder (CNN)
        self.cnn = tf.keras.applications.EfficientNetB0(include_top=False, pooling='avg')

        # Text decoder (Transformer)
        self.embedding = tf.keras.layers.Embedding(vocab_size, d_model)
        self.transformer = Transformer(
            num_layers=num_layers,
            d_model=d_model,
            num_heads=num_heads,
            dff=d_model*4,
            input_vocab_size=vocab_size,
            target_vocab_size=vocab_size
        )

        # Output projection
        self.dense = Dense(vocab_size)

    def call(self, inputs, training=False):
        images, captions = inputs

        # Encode images
        image_features = self.cnn(images)  # (batch_size, d_model)

        # Expand image features for each caption token
        image_features = tf.expand_dims(image_features, axis=1)
        image_features = tf.repeat(image_features, tf.shape(captions)[1], axis=1)

        # Embed captions
        caption_embeddings = self.embedding(captions)

        # Combine image features and caption embeddings
        decoder_input = tf.concat([image_features, caption_embeddings], axis=-1)

        # Transformer
        output = self.transformer((image_features, decoder_input), training)

        # Output projection
        return self.dense(output)

Multimodal Learning

# Multimodal model with cross-modal multi-head attention
class MultimodalModel(tf.keras.Model):
    def __init__(self, text_vocab_size, image_size, d_model, num_heads, num_layers):
        super(MultimodalModel, self).__init__()

        # Text encoder
        self.text_embedding = tf.keras.layers.Embedding(text_vocab_size, d_model)
        self.text_encoder = Transformer(
            num_layers=num_layers,
            d_model=d_model,
            num_heads=num_heads,
            dff=d_model*4,
            input_vocab_size=text_vocab_size,
            target_vocab_size=text_vocab_size
        )

        # Image encoder
        self.image_encoder = tf.keras.Sequential([
            tf.keras.layers.Conv2D(64, (3,3), activation='relu', input_shape=(*image_size, 3)),
            tf.keras.layers.MaxPooling2D((2,2)),
            tf.keras.layers.Conv2D(128, (3,3), activation='relu'),
            tf.keras.layers.MaxPooling2D((2,2)),
            tf.keras.layers.Conv2D(d_model, (3,3), activation='relu'),
            tf.keras.layers.GlobalAveragePooling2D()
        ])

        # Cross-modal attention
        self.cross_attention = MultiHeadAttention(num_heads=num_heads, key_dim=d_model)

        # Classification head
        self.classifier = tf.keras.Sequential([
            LayerNormalization(epsilon=1e-6),
            Dense(d_model, activation='relu'),
            Dense(1, activation='sigmoid')
        ])

    def call(self, inputs, training=False):
        text, image = inputs

        # Encode text
        text_embeddings = self.text_embedding(text)
        text_features = self.text_encoder((text_embeddings, text_embeddings), training)

        # Encode image
        image_features = self.image_encoder(image)  # (batch_size, d_model)
        image_features = tf.expand_dims(image_features, axis=1)  # (batch_size, 1, d_model)

        # Cross-modal attention
        context, _ = self.cross_attention(image_features, text_features, text_features)

        # Classification
        return self.classifier(context[:, 0, :])

Multi-Head Attention Visualization

Attention Head Visualization

import matplotlib.pyplot as plt
import seaborn as sns

def visualize_attention_heads(attention_weights, layer_num, head_nums, input_tokens, output_tokens):
    """Visualize multiple attention heads from a multi-head attention layer"""
    num_heads = len(head_nums)
    fig, axes = plt.subplots(1, num_heads, figsize=(15, 5))

    for i, head_num in enumerate(head_nums):
        # Get attention weights for specific head
        head_weights = attention_weights[layer_num, head_num]

        # Plot heatmap
        sns.heatmap(head_weights,
                   xticklabels=input_tokens,
                   yticklabels=output_tokens,
                   ax=axes[i],
                   cmap='viridis')
        axes[i].set_title(f'Layer {layer_num}, Head {head_num}')
        axes[i].set_xlabel('Input Sequence')
        axes[i].set_ylabel('Output Sequence')

    plt.tight_layout()
    plt.show()

# Example usage
attention_weights = np.random.rand(6, 8, 10, 10)  # 6 layers, 8 heads, 10x10 attention
input_tokens = ['The', 'quick', 'brown', 'fox', 'jumps', 'over', 'the', 'lazy', 'dog', '.']
output_tokens = ['Le', 'rapide', 'renard', 'brun', 'saute', 'par-dessus', 'le', 'chien', 'paresseux', '.']
visualize_attention_heads(attention_weights, 0, [0, 1, 2, 3], input_tokens, output_tokens)

Attention Head Diversity Analysis

def analyze_attention_diversity(attention_weights):
    """Analyze diversity across attention heads"""
    num_layers, num_heads, seq_len, _ = attention_weights.shape

    # Calculate pairwise similarity between heads
    similarities = np.zeros((num_layers, num_heads, num_heads))

    for layer in range(num_layers):
        for i in range(num_heads):
            for j in range(num_heads):
                # Cosine similarity between attention patterns
                vec_i = attention_weights[layer, i].flatten()
                vec_j = attention_weights[layer, j].flatten()
                similarities[layer, i, j] = np.dot(vec_i, vec_j) / (np.linalg.norm(vec_i) * np.linalg.norm(vec_j))

    # Plot similarity matrices
    fig, axes = plt.subplots(1, num_layers, figsize=(15, 5))
    for layer in range(num_layers):
        sns.heatmap(similarities[layer], ax=axes[layer], cmap='coolwarm', vmin=0, vmax=1)
        axes[layer].set_title(f'Layer {layer} Head Similarity')
        axes[layer].set_xlabel('Head')
        axes[layer].set_ylabel('Head')

    plt.tight_layout()
    plt.show()

    return similarities

Multi-Head Attention Research

Key Papers

  1. "Attention Is All You Need" (Vaswani et al., 2017)
    • Introduced Transformer architecture with multi-head attention
    • Demonstrated state-of-the-art performance in machine translation
    • Foundation for modern attention-based models
  2. "BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding" (Devlin et al., 2019)
    • Applied multi-head attention to large-scale pre-training
    • Demonstrated effectiveness of bidirectional attention
    • Foundation for modern language models
  3. "An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale" (Dosovitskiy et al., 2021)
    • Applied multi-head attention to computer vision
    • Introduced Vision Transformer (ViT)
    • Demonstrated competitive performance with CNNs
  4. "Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer" (Raffel et al., 2020)
    • Demonstrated effectiveness of multi-head attention across diverse NLP tasks
    • Introduced T5 model architecture
    • Comprehensive evaluation of transfer learning
  5. "Language Models are Few-Shot Learners" (Brown et al., 2020)
    • Scaled multi-head attention to very large models
    • Demonstrated few-shot learning capabilities
    • Introduced GPT-3 architecture

Multi-Head Attention Best Practices

Implementation Guidelines

AspectRecommendationNotes
Number of Heads4-16 heads typicallyMore heads for larger models
Head Dimensiond_model / num_heads should be integerTypically 64-128 for each head
InitializationProper weight initializationCritical for stable training
DropoutApply dropout to attention weightsPrevents overfitting
NormalizationLayer normalizationStabilizes training
ResidualsUse residual connectionsHelps with gradient flow
Position EncodingUse positional encodingsEssential for sequence order

Training Considerations

  • Memory Usage: Multi-head attention can be memory-intensive
  • Computational Cost: $O(n^2)$ complexity for sequence length $n$
  • Batch Size: Larger batches can improve efficiency
  • Mixed Precision: Use FP16/FP32 mixed precision for acceleration
  • Gradient Clipping: Helps with training stability
  • Learning Rate: Typically requires careful tuning
  • Warmup: Learning rate warmup can improve convergence

Optimization Techniques

# Efficient multi-head attention implementations
from tensorflow.keras.layers import MultiHeadAttention

# Standard implementation
mha = MultiHeadAttention(
    num_heads=8,
    key_dim=64,
    value_dim=64,
    output_shape=512,
    dropout=0.1
)

# Memory-efficient implementation for long sequences
# (Conceptual - actual implementation may vary)
def memory_efficient_mha(query, key, value):
    # Use approximate nearest neighbor or other techniques
    # to reduce memory usage for long sequences
    pass

# Flash attention implementation
def flash_attention(query, key, value):
    # Optimized attention implementation with better
    # memory access patterns and reduced memory usage
    pass

Future Directions

  • Sparse Multi-Head Attention: Reducing computational complexity for long sequences
  • Adaptive Multi-Head Attention: Dynamically adjusting number of heads
  • Efficient Implementations: Optimized attention algorithms (e.g., Flash Attention)
  • Multi-Modal Attention: Cross-modal multi-head attention
  • Neuromorphic Attention: Biologically-inspired multi-head attention
  • Quantum Attention: Multi-head attention for quantum computing
  • Explainable Attention: More interpretable multi-head attention
  • Hierarchical Attention: Multi-level multi-head attention

External Resources