Positional Encoding
What is Positional Encoding?
Positional encoding is a technique used to inject information about the relative or absolute position of tokens in a sequence into attention-based models that lack inherent sequential processing capabilities. Unlike recurrent neural networks (RNNs) that process sequences sequentially, transformer models and other attention-based architectures process all tokens in parallel, making it necessary to explicitly provide positional information to maintain sequence order awareness.
Key Characteristics
- Sequence Order Awareness: Provides information about token positions
- Differentiable: Fully compatible with backpropagation
- Fixed or Learned: Can be pre-defined or learned during training
- Scalable: Works with sequences of varying lengths
- Model-Agnostic: Can be used with any attention-based architecture
- Efficient: Computationally inexpensive to implement
- Interpretable: Provides insights into positional relationships
Why Positional Encoding Matters
Limitations of Attention Mechanisms
Attention mechanisms and multi-head attention are permutation-equivariant, meaning they treat input sequences as sets rather than ordered sequences. Without positional information:
- Order Information Lost: "cat chases mouse" becomes indistinguishable from "mouse chases cat"
- Temporal Relationships Ignored: Sequence order and timing information is lost
- Contextual Understanding Limited: Models cannot distinguish between different positions
- Performance Degradation: Tasks requiring sequence order perform poorly
Benefits of Positional Encoding
| Benefit | Description |
|---|---|
| Order Preservation | Maintains sequence order information |
| Temporal Awareness | Enables understanding of time/position |
| Contextual Understanding | Improves comprehension of sequential data |
| Model Performance | Enhances performance on sequence tasks |
| Flexibility | Works with variable-length sequences |
| Interpretability | Provides insights into positional patterns |
Types of Positional Encoding
Fixed Positional Encoding
Sinusoidal Positional Encoding
The original positional encoding introduced in the Transformer paper uses sinusoidal functions:
$$ PE_{(pos, 2i)} = \sin\left(\frac{pos}{10000^{2i/d_{\text{model}}}}\right) $$ $$ PE_{(pos, 2i+1)} = \cos\left(\frac{pos}{10000^{2i/d_{\text{model}}}}\right) $$
where:
- $pos$ is the position in the sequence
- $i$ is the dimension index (0 ≤ $i$ < $d_{\text{model}}/2$)
- $d_{\text{model}}$ is the embedding dimension
import numpy as np
import tensorflow as tf
def sinusoidal_positional_encoding(max_length, d_model):
"""Generate sinusoidal positional encoding"""
position = np.arange(max_length)[:, np.newaxis]
div_term = np.exp(np.arange(0, d_model, 2) * -(np.log(10000.0) / d_model))
pe = np.zeros((max_length, d_model))
pe[:, 0::2] = np.sin(position * div_term)
pe[:, 1::2] = np.cos(position * div_term)
return tf.constant(pe, dtype=tf.float32)
# Usage
max_length = 100
d_model = 512
pos_encoding = sinusoidal_positional_encoding(max_length, d_model)
Visualization of Sinusoidal Encoding
import matplotlib.pyplot as plt
def plot_positional_encoding(pe, title="Sinusoidal Positional Encoding"):
"""Visualize positional encoding patterns"""
plt.figure(figsize=(12, 8))
plt.pcolormesh(pe.numpy().T, cmap='viridis')
plt.xlabel('Position in Sequence')
plt.ylabel('Embedding Dimension')
plt.title(title)
plt.colorbar()
plt.show()
# Plot sinusoidal positional encoding
plot_positional_encoding(pos_encoding)
Learned Positional Encoding
Instead of using fixed functions, positional encodings can be learned during training:
from tensorflow.keras.layers import Embedding
class LearnedPositionalEncoding(tf.keras.layers.Layer):
def __init__(self, max_length, d_model):
super(LearnedPositionalEncoding, self).__init__()
self.position_embedding = Embedding(
input_dim=max_length,
output_dim=d_model
)
def call(self, inputs):
"""Add positional encoding to inputs"""
seq_length = tf.shape(inputs)[1]
positions = tf.range(seq_length)[tf.newaxis, :]
return inputs + self.position_embedding(positions)
# Usage
max_length = 100
d_model = 512
pos_encoding_layer = LearnedPositionalEncoding(max_length, d_model)
Relative Positional Encoding
Encodes relative distances between tokens rather than absolute positions:
class RelativePositionalEncoding(tf.keras.layers.Layer):
def __init__(self, max_relative_position, d_model):
super(RelativePositionalEncoding, self).__init__()
self.max_relative_position = max_relative_position
self.d_model = d_model
self.embedding = tf.keras.layers.Embedding(
input_dim=2 * max_relative_position + 1,
output_dim=d_model
)
def call(self, inputs):
"""Compute relative positional encoding"""
batch_size, seq_length = tf.shape(inputs)[0], tf.shape(inputs)[1]
# Create relative position matrix
range_vec = tf.range(seq_length)
distance_mat = range_vec[tf.newaxis, :] - range_vec[:, tf.newaxis]
# Clip distances
distance_mat_clipped = tf.clip_by_value(
distance_mat,
-self.max_relative_position,
self.max_relative_position
)
# Shift values to be positive
final_mat = distance_mat_clipped + self.max_relative_position
# Get embeddings
embeddings = self.embedding(final_mat)
# Add to inputs (broadcast across batch)
return inputs + embeddings
Hybrid Positional Encoding
Combines multiple positional encoding techniques:
class HybridPositionalEncoding(tf.keras.layers.Layer):
def __init__(self, max_length, d_model):
super(HybridPositionalEncoding, self).__init__()
self.sinusoidal_pe = sinusoidal_positional_encoding(max_length, d_model)
self.learned_pe = LearnedPositionalEncoding(max_length, d_model)
def call(self, inputs):
"""Combine sinusoidal and learned positional encoding"""
seq_length = tf.shape(inputs)[1]
# Get sinusoidal encoding
sinusoidal = self.sinusoidal_pe[:seq_length, :]
sinusoidal = tf.expand_dims(sinusoidal, axis=0) # Add batch dimension
# Get learned encoding
learned = self.learned_pe(inputs)
# Combine (can use different weights)
return inputs + 0.5 * sinusoidal + 0.5 * learned
Positional Encoding in Transformers
Standard Transformer Implementation
from tensorflow.keras.layers import LayerNormalization, Dropout, Dense
class TransformerBlock(tf.keras.layers.Layer):
def __init__(self, d_model, num_heads, dff, max_length, dropout_rate=0.1):
super(TransformerBlock, self).__init__()
self.mha = tf.keras.layers.MultiHeadAttention(
num_heads=num_heads,
key_dim=d_model
)
self.ffn = tf.keras.Sequential([
Dense(dff, activation='relu'),
Dense(d_model)
])
self.layernorm1 = LayerNormalization(epsilon=1e-6)
self.layernorm2 = LayerNormalization(epsilon=1e-6)
self.dropout1 = Dropout(dropout_rate)
self.dropout2 = Dropout(dropout_rate)
# Positional encoding
self.pos_encoding = sinusoidal_positional_encoding(max_length, d_model)
def call(self, x, training, mask=None):
seq_len = tf.shape(x)[1]
# Add positional encoding
x += self.pos_encoding[:seq_len, :]
# Multi-head attention
attn_output, _ = self.mha(x, x, x, mask)
attn_output = self.dropout1(attn_output, training=training)
out1 = self.layernorm1(x + attn_output)
# Feed-forward network
ffn_output = self.ffn(out1)
ffn_output = self.dropout2(ffn_output, training=training)
out2 = self.layernorm2(out1 + ffn_output)
return out2
Complete Transformer Model with Positional Encoding
class Transformer(tf.keras.Model):
def __init__(self, num_layers, d_model, num_heads, dff, input_vocab_size,
target_vocab_size, max_length, dropout_rate=0.1):
super(Transformer, self).__init__()
# Embedding layers
self.token_embedding = tf.keras.layers.Embedding(input_vocab_size, d_model)
self.pos_encoding = sinusoidal_positional_encoding(max_length, d_model)
# Encoder
self.encoder_layers = [
TransformerBlock(d_model, num_heads, dff, max_length, dropout_rate)
for _ in range(num_layers)
]
# Decoder
self.decoder_embedding = tf.keras.layers.Embedding(target_vocab_size, d_model)
self.decoder_layers = [
TransformerBlock(d_model, num_heads, dff, max_length, dropout_rate)
for _ in range(num_layers)
]
# Final layer
self.final_layer = Dense(target_vocab_size)
def call(self, inputs, training):
inp, tar = inputs
# Encoder
enc_output = self.token_embedding(inp)
enc_output += self.pos_encoding[:tf.shape(inp)[1], :]
for layer in self.encoder_layers:
enc_output = layer(enc_output, training)
# Decoder
dec_output = self.decoder_embedding(tar)
dec_output += self.pos_encoding[:tf.shape(tar)[1], :]
for layer in self.decoder_layers:
dec_output = layer(dec_output, training)
# Final output
return self.final_layer(dec_output)
Positional Encoding Variants
Rotary Positional Embedding (RoPE)
Rotary positional embedding applies a rotation to the embedding vectors based on their position:
class RotaryPositionalEmbedding(tf.keras.layers.Layer):
def __init__(self, d_model):
super(RotaryPositionalEmbedding, self).__init__()
self.d_model = d_model
self.theta = 10000.0
def build(self, input_shape):
# Create rotation matrix parameters
self.inv_freq = 1.0 / (
self.theta ** (tf.range(0, self.d_model, 2, dtype=tf.float32) / self.d_model)
)
def call(self, inputs, position=None):
"""Apply rotary positional embedding"""
seq_len = tf.shape(inputs)[1]
if position is None:
position = tf.range(seq_len, dtype=tf.float32)[tf.newaxis, :]
# Compute rotation angles
sinusoid_inp = tf.einsum("i,j->ij", position, self.inv_freq)
sin = tf.sin(sinusoid_inp)
cos = tf.cos(sinusoid_inp)
# Split features into two parts
x1, x2 = tf.split(inputs, 2, axis=-1)
# Apply rotation
rotated = tf.concat([
x1 * cos - x2 * sin,
x2 * cos + x1 * sin
], axis=-1)
return rotated
ALiBi (Attention with Linear Biases)
ALiBi adds linear biases to attention scores based on relative positions:
class ALiBiPositionalBias(tf.keras.layers.Layer):
def __init__(self, num_heads, max_length):
super(ALiBiPositionalBias, self).__init__()
self.num_heads = num_heads
self.max_length = max_length
# Create slopes for each head
self.slopes = self._get_slopes(num_heads)
def _get_slopes(self, n):
"""Get geometric sequence of slopes"""
def get_slopes_power_of_2(n):
start = 2 ** (-(2 ** -(tf.math.log2(n) - 3)))
ratio = start
return [start * ratio ** i for i in range(n)]
if tf.math.log2(n) % 1 == 0:
return get_slopes_power_of_2(n)
closest_power_of_2 = 2 ** tf.math.floor(tf.math.log2(n))
return get_slopes_power_of_2(closest_power_of_2) + self._get_slopes(2 * closest_power_of_2)[0::2][:n-closest_power_of_2]
def call(self, attention_scores):
"""Add ALiBi biases to attention scores"""
seq_len = tf.shape(attention_scores)[-1]
# Create position differences
position = tf.range(seq_len, dtype=tf.float32)
distance = position[tf.newaxis, :] - position[:, tf.newaxis]
# Create bias matrix for each head
biases = []
for i in range(self.num_heads):
bias = -tf.abs(distance) * self.slopes[i]
bias = tf.expand_dims(bias, axis=0) # Add head dimension
biases.append(bias)
# Stack biases for all heads
biases = tf.stack(biases, axis=0)
biases = tf.expand_dims(biases, axis=0) # Add batch dimension
# Add to attention scores
return attention_scores + biases
T5's Relative Positional Encoding
T5 uses bucketized relative positional encodings:
class T5RelativePositionalEncoding(tf.keras.layers.Layer):
def __init__(self, num_buckets, max_distance, d_model):
super(T5RelativePositionalEncoding, self).__init__()
self.num_buckets = num_buckets
self.max_distance = max_distance
self.embedding = tf.keras.layers.Embedding(num_buckets, d_model)
def _relative_position_bucket(self, relative_position):
"""Compute bucket for relative position"""
ret = tf.where(relative_position < 0,
tf.math.abs(relative_position),
relative_position)
max_exact = self.num_buckets // 2
is_small = ret < max_exact
val_if_large = max_exact + tf.cast(
tf.math.log(tf.cast(ret, tf.float32) / max_exact) /
tf.math.log(self.max_distance / max_exact) *
(self.num_buckets - max_exact),
tf.int32
)
val_if_large = tf.minimum(val_if_large, self.num_buckets - 1)
return tf.where(is_small, ret, val_if_large)
def call(self, inputs):
"""Add T5 relative positional encoding"""
batch_size, seq_length = tf.shape(inputs)[0], tf.shape(inputs)[1]
# Create relative position matrix
range_vec = tf.range(seq_length)
distance_mat = range_vec[tf.newaxis, :] - range_vec[:, tf.newaxis]
# Compute buckets
buckets = self._relative_position_bucket(distance_mat)
buckets = tf.stop_gradient(buckets) # Don't train the bucket function
# Get embeddings
embeddings = self.embedding(buckets)
# Add to inputs
return inputs + embeddings
Positional Encoding Applications
Natural Language Processing
Machine Translation
# Transformer for machine translation with positional encoding
transformer = Transformer(
num_layers=6,
d_model=512,
num_heads=8,
dff=2048,
input_vocab_size=8500,
target_vocab_size=8000,
max_length=100,
dropout_rate=0.1
)
Text Generation
# GPT-like model with positional encoding
class GPT(tf.keras.Model):
def __init__(self, vocab_size, d_model, num_heads, num_layers, max_length):
super(GPT, self).__init__()
self.token_embedding = tf.keras.layers.Embedding(vocab_size, d_model)
self.pos_encoding = sinusoidal_positional_encoding(max_length, d_model)
self.blocks = [
TransformerBlock(d_model, num_heads, d_model*4, max_length)
for _ in range(num_layers)
]
self.layernorm = LayerNormalization(epsilon=1e-6)
self.dense = Dense(vocab_size)
def call(self, inputs, training=False):
seq_len = tf.shape(inputs)[1]
# Embed tokens and add positional encoding
x = self.token_embedding(inputs)
x += self.pos_encoding[:seq_len, :]
# Transformer blocks
for block in self.blocks:
x = block(x, training)
# Final layer
x = self.layernorm(x)
return self.dense(x)
Computer Vision
Vision Transformer (ViT)
# Vision Transformer with positional encoding
class VisionTransformer(tf.keras.Model):
def __init__(self, image_size, patch_size, num_classes, d_model, num_heads, num_layers):
super(VisionTransformer, self).__init__()
num_patches = (image_size // patch_size) ** 2
# Patch embedding
self.patch_embedding = tf.keras.layers.Conv2D(
d_model,
kernel_size=patch_size,
strides=patch_size
)
# Class token
self.class_token = self.add_weight(
"class_token",
shape=(1, 1, d_model)
)
# Positional encoding
self.pos_encoding = sinusoidal_positional_encoding(num_patches + 1, d_model)
# Transformer blocks
self.blocks = [
TransformerBlock(d_model, num_heads, d_model*4, num_patches + 1)
for _ in range(num_layers)
]
# Classification head
self.mlp_head = tf.keras.Sequential([
LayerNormalization(epsilon=1e-6),
Dense(d_model, activation='tanh'),
Dense(num_classes)
])
def call(self, inputs, training=False):
batch_size = tf.shape(inputs)[0]
# Patch embedding
x = self.patch_embedding(inputs)
x = tf.reshape(x, (batch_size, -1, x.shape[-1]))
# Add class token
class_token = tf.broadcast_to(self.class_token, (batch_size, 1, x.shape[-1]))
x = tf.concat([class_token, x], axis=1)
# Add positional encoding
x += self.pos_encoding
# Transformer blocks
for block in self.blocks:
x = block(x, training)
# Classification
class_token = x[:, 0]
return self.mlp_head(class_token)
Image Captioning
# Image captioning with positional encoding
class ImageCaptioningModel(tf.keras.Model):
def __init__(self, vocab_size, d_model, num_heads, num_layers, max_length):
super(ImageCaptioningModel, self).__init__()
# Image encoder (CNN)
self.cnn = tf.keras.applications.EfficientNetB0(
include_top=False,
pooling='avg'
)
# Text decoder
self.embedding = tf.keras.layers.Embedding(vocab_size, d_model)
self.pos_encoding = sinusoidal_positional_encoding(max_length, d_model)
self.blocks = [
TransformerBlock(d_model, num_heads, d_model*4, max_length)
for _ in range(num_layers)
]
# Output projection
self.dense = Dense(vocab_size)
def call(self, inputs, training=False):
images, captions = inputs
# Encode images
image_features = self.cnn(images)
image_features = tf.expand_dims(image_features, axis=1)
# Embed captions and add positional encoding
x = self.embedding(captions)
seq_len = tf.shape(x)[1]
x += self.pos_encoding[:seq_len, :]
# Combine image features and text
x = tf.concat([image_features, x], axis=1)
# Transformer blocks
for block in self.blocks:
x = block(x, training)
# Output projection
return self.dense(x)
Speech Processing
# Speech recognition with positional encoding
class SpeechRecognitionModel(tf.keras.Model):
def __init__(self, vocab_size, d_model, num_heads, num_layers, max_length):
super(SpeechRecognitionModel, self).__init__()
# Audio encoder
self.audio_encoder = tf.keras.Sequential([
tf.keras.layers.Conv1D(64, 3, activation='relu', input_shape=(None, 80)),
tf.keras.layers.MaxPooling1D(2),
tf.keras.layers.Conv1D(128, 3, activation='relu'),
tf.keras.layers.MaxPooling1D(2),
tf.keras.layers.Conv1D(d_model, 3, activation='relu'),
tf.keras.layers.GlobalAveragePooling1D()
])
# Text decoder
self.embedding = tf.keras.layers.Embedding(vocab_size, d_model)
self.pos_encoding = sinusoidal_positional_encoding(max_length, d_model)
self.blocks = [
TransformerBlock(d_model, num_heads, d_model*4, max_length)
for _ in range(num_layers)
]
# Output projection
self.dense = Dense(vocab_size)
def call(self, inputs, training=False):
audio, text = inputs
# Encode audio
audio_features = self.audio_encoder(audio)
audio_features = tf.expand_dims(audio_features, axis=1)
# Embed text and add positional encoding
x = self.embedding(text)
seq_len = tf.shape(x)[1]
x += self.pos_encoding[:seq_len, :]
# Combine audio features and text
x = tf.concat([audio_features, x], axis=1)
# Transformer blocks
for block in self.blocks:
x = block(x, training)
# Output projection
return self.dense(x)
Positional Encoding Research
Key Papers
- "Attention Is All You Need" (Vaswani et al., 2017)
- Introduced sinusoidal positional encoding
- Demonstrated effectiveness in Transformer architecture
- Foundation for modern positional encoding techniques
- "BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding" (Devlin et al., 2019)
- Used learned positional embeddings
- Demonstrated effectiveness in large-scale pre-training
- Foundation for modern language models
- "Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer" (Raffel et al., 2020)
- Evaluated different positional encoding schemes
- Introduced T5's relative positional encoding
- Comprehensive evaluation across NLP tasks
- "Train Short, Test Long: Attention with Linear Biases Enables Input Length Extrapolation" (Press et al., 2022)
- Introduced ALiBi (Attention with Linear Biases)
- Demonstrated ability to extrapolate to longer sequences
- Improved performance on long-range tasks
- "RoFormer: Enhanced Transformer with Rotary Position Embedding" (Su et al., 2021)
- Introduced rotary positional embedding (RoPE)
- Demonstrated improved performance and extrapolation
- Widely adopted in modern language models
Positional Encoding Best Practices
Implementation Guidelines
| Aspect | Recommendation | Notes |
|---|---|---|
| Encoding Type | Sinusoidal for general use | Works well in most cases |
| Sequence Length | Set max_length appropriately | Consider task requirements |
| Dimension | Match embedding dimension | Typically 512-1024 for transformers |
| Initialization | Proper initialization for learned PEs | Critical for stable training |
| Normalization | Normalize embeddings if needed | Can improve training stability |
| Extrapolation | Consider ALiBi or RoPE for long sequences | Better for length generalization |
| Hybrid Approaches | Combine multiple techniques | Can capture different positional aspects |
Training Considerations
- Sequence Length: Choose appropriate max_length for your task
- Batch Processing: Ensure consistent sequence lengths in batches
- Memory Usage: Positional encodings add minimal memory overhead
- Training Stability: Proper initialization prevents training issues
- Learning Rate: May need adjustment for learned positional encodings
- Regularization: Consider dropout for learned positional encodings
- Extrapolation: Test model's ability to handle longer sequences
Optimization Techniques
# Efficient positional encoding implementations
# Pre-compute positional encodings for fixed lengths
max_length = 1024
d_model = 512
precomputed_pe = sinusoidal_positional_encoding(max_length, d_model)
# Use in model
def add_positional_encoding(inputs):
seq_len = tf.shape(inputs)[1]
return inputs + precomputed_pe[:seq_len, :]
# Memory-efficient implementation for very long sequences
def memory_efficient_pe(max_length, d_model):
"""Generate positional encoding on-the-fly for long sequences"""
def pe_for_length(seq_len):
position = tf.range(seq_len, dtype=tf.float32)[:, tf.newaxis]
div_term = tf.exp(tf.range(0, d_model, 2, dtype=tf.float32) *
-(tf.math.log(10000.0) / d_model))
pe = tf.zeros((seq_len, d_model))
pe[:, 0::2] = tf.sin(position * div_term)
pe[:, 1::2] = tf.cos(position * div_term)
return pe
return pe_for_length
Positional Encoding Analysis
Mathematical Properties
Sinusoidal Encoding Properties
- Periodicity: Each dimension has a different period $$ T_i = 2\pi \cdot 10000^{2i/d_{\text{model}}} $$
- Linear Relationships: Allows model to learn relative positions $$ PE_{pos+k} = PE_ \cdot R_k + PE_k $$ where $R_k$ is a rotation matrix
- Bounded Values: All values are in -1, 1 range
- Unique Representation: Each position has a unique encoding
Relative Position Advantages
- Translation Invariance: Relative positions are translation-invariant
- Generalization: Better generalization to unseen sequence lengths
- Efficiency: Can be computed on-the-fly for arbitrary lengths
- Extrapolation: Better ability to handle longer sequences
Positional Encoding Visualization
import matplotlib.pyplot as plt
import seaborn as sns
def analyze_positional_encoding(pe, title="Positional Encoding Analysis"):
"""Analyze and visualize positional encoding properties"""
fig, axes = plt.subplots(2, 2, figsize=(15, 12))
# Plot encoding patterns
sns.heatmap(pe.numpy().T, ax=axes[0, 0], cmap='viridis')
axes[0, 0].set_title('Positional Encoding Patterns')
axes[0, 0].set_xlabel('Position')
axes[0, 0].set_ylabel('Dimension')
# Plot first few dimensions
for i in range(4):
axes[0, 1].plot(pe.numpy()[:, i], label=f'Dim {i}')
axes[0, 1].set_title('First 4 Dimensions')
axes[0, 1].set_xlabel('Position')
axes[0, 1].set_ylabel('Value')
axes[0, 1].legend()
# Plot autocorrelation
autocorr = np.correlate(pe.numpy().flatten(), pe.numpy().flatten(), mode='full')
autocorr = autocorr[len(autocorr)//2:]
axes[1, 0].plot(autocorr[:100])
axes[1, 0].set_title('Autocorrelation')
axes[1, 0].set_xlabel('Lag')
axes[1, 0].set_ylabel('Correlation')
# Plot frequency spectrum
fft = np.abs(np.fft.fft(pe.numpy(), axis=0))
axes[1, 1].plot(fft[:50, 0])
axes[1, 1].set_title('Frequency Spectrum (First Dimension)')
axes[1, 1].set_xlabel('Frequency')
axes[1, 1].set_ylabel('Magnitude')
plt.suptitle(title)
plt.tight_layout()
plt.show()
# Analyze sinusoidal positional encoding
analyze_positional_encoding(pos_encoding)
Future Directions
- Adaptive Positional Encoding: Positional encodings that adapt to input content
- Content-Aware Positional Encoding: Positional encodings that consider input content
- Hierarchical Positional Encoding: Multi-level positional information
- Neuromorphic Positional Encoding: Biologically-inspired positional representations
- Quantum Positional Encoding: Positional encodings for quantum computing
- Cross-Modal Positional Encoding: Positional encodings for multimodal data
- Dynamic Positional Encoding: Positional encodings that change during training
- Explainable Positional Encoding: More interpretable positional representations
External Resources
- Attention Is All You Need (arXiv)
- The Illustrated Transformer (Blog)
- Positional Encoding Explained (Towards Data Science)
- Transformer Architecture Explained (YouTube)
- ALiBi: Attention with Linear Biases (arXiv)
- RoFormer: Rotary Position Embedding (arXiv)
- T5's Relative Positional Encoding (arXiv)
- Positional Encoding in PyTorch (PyTorch Docs)
- Positional Encoding in TensorFlow (TensorFlow Docs)