Word2Vec
What is Word2Vec?
Word2Vec is a groundbreaking word embedding technique that transforms words into dense vector representations, capturing semantic and syntactic relationships between words. Developed by Tomas Mikolov and researchers at Google in 2013, Word2Vec revolutionized natural language processing by enabling machines to understand word meanings through their vector representations.
Key Characteristics
- Dense Representations: Compact vector representations (typically 50-300 dimensions)
- Semantic Relationships: Captures meaning through vector arithmetic
- Efficient Training: Uses shallow neural networks for fast training
- Contextual Understanding: Learns from word co-occurrence patterns
- Dimensionality Reduction: Reduces high-dimensional one-hot vectors to low-dimensional dense vectors
- Transfer Learning: Pre-trained embeddings can be used across tasks
- Scalability: Can handle large vocabularies efficiently
- Interpretability: Vectors capture human-interpretable relationships
Word2Vec Models
Word2Vec comes in two main architectures:
1. Continuous Bag of Words (CBOW)
graph LR
A[Context Words] --> B[Input Layer]
B --> C[Projection Layer]
C --> D[Output Layer]
D --> E[Target Word]
style A fill:#f9f,stroke:#333
style E fill:#f9f,stroke:#333
CBOW Architecture:
- Predicts a target word from its context words
- Uses the average of context word vectors as input
- Typically faster to train than Skip-gram
- Works well with smaller datasets
2. Skip-gram
graph LR
A[Target Word] --> B[Input Layer]
B --> C[Projection Layer]
C --> D[Output Layer]
D --> E[Context Words]
style A fill:#f9f,stroke:#333
style E fill:#f9f,stroke:#333
Skip-gram Architecture:
- Predicts context words from a target word
- Uses the target word vector to predict surrounding words
- Typically performs better on larger datasets
- Better at capturing rare words
Mathematical Foundations
Objective Function
Word2Vec optimizes the following objective:
For CBOW: $$ J(\theta) = \frac{1}{T} \sum_^{T} \log p(w_t | w_, \dots, w_, w_{t+1}, \dots, w_{t+n}) $$
For Skip-gram: $$ J(\theta) = \frac{1}{T} \sum_^{T} \sum_{-n \leq j \leq n, j \neq 0} \log p(w_{t+j} | w_t) $$
Where:
- $ T $ is the number of words in the corpus
- $ w_t $ is the target word
- $ w_{t+j} $ are context words
- $ n $ is the context window size
Softmax Probability
The probability of a word given its context is computed using softmax:
$$ p(w_O | w_I) = \frac{\exp(v_^\top v_)}{\sum_^{W} \exp(v_w^\top v_)} $$
Where:
- $ w_O $ is the output word
- $ w_I $ is the input word
- $ v_w $ is the vector representation of word $ w $
- $ W $ is the vocabulary size
Negative Sampling
To improve efficiency, Word2Vec uses negative sampling:
$$ J(\theta) = \log \sigma(v_^\top v_) + \sum_^{k} \mathbb{E}_{w_i \sim P_n(w)} \log \sigma(-v_^\top v_) $$
Where:
- $ \sigma $ is the sigmoid function
- $ k $ is the number of negative samples
- $ P_n(w) $ is the noise distribution
Implementation
PyTorch Implementation
import torch
import torch.nn as nn
import torch.optim as optim
from collections import Counter
import numpy as np
class Word2Vec(nn.Module):
def __init__(self, vocab_size, embedding_dim, model_type='skipgram'):
super(Word2Vec, self).__init__()
self.model_type = model_type
# Embedding layers
self.input_embeddings = nn.Embedding(vocab_size, embedding_dim)
self.output_embeddings = nn.Embedding(vocab_size, embedding_dim)
# Initialize weights
self.input_embeddings.weight.data.uniform_(-0.5, 0.5)
self.output_embeddings.weight.data.uniform_(-0.5, 0.5)
def forward(self, input_word, output_words=None):
if self.model_type == 'skipgram':
return self.forward_skipgram(input_word, output_words)
else:
return self.forward_cbow(input_word)
def forward_skipgram(self, input_word, output_words):
# Get input embedding
input_embedding = self.input_embeddings(input_word) # [batch_size, embedding_dim]
# Get output embeddings
output_embeddings = self.output_embeddings(output_words) # [batch_size, num_neg_samples+1, embedding_dim]
# Compute scores
scores = torch.bmm(output_embeddings, input_embedding.unsqueeze(2)).squeeze(2) # [batch_size, num_neg_samples+1]
return scores
def forward_cbow(self, context_words):
# Average context embeddings
context_embeddings = self.input_embeddings(context_words) # [batch_size, context_size, embedding_dim]
input_embedding = torch.mean(context_embeddings, dim=1) # [batch_size, embedding_dim]
# Get output embedding
output_embedding = self.output_embeddings.weight # [vocab_size, embedding_dim]
# Compute scores
scores = torch.matmul(output_embedding, input_embedding.transpose(0, 1)) # [vocab_size, batch_size]
return scores
class Word2VecTrainer:
def __init__(self, corpus, embedding_dim=100, model_type='skipgram',
window_size=5, negative_samples=5, learning_rate=0.01):
self.corpus = corpus
self.embedding_dim = embedding_dim
self.model_type = model_type
self.window_size = window_size
self.negative_samples = negative_samples
self.learning_rate = learning_rate
# Build vocabulary
self.build_vocab()
# Initialize model
self.model = Word2Vec(len(self.vocab), embedding_dim, model_type)
# Loss function and optimizer
self.criterion = nn.CrossEntropyLoss()
self.optimizer = optim.SGD(self.model.parameters(), lr=learning_rate)
def build_vocab(self):
"""Build vocabulary from corpus"""
word_counts = Counter(self.corpus)
self.vocab = {word: i for i, (word, _) in enumerate(word_counts.most_common())}
self.inv_vocab = {i: word for word, i in self.vocab.items()}
self.vocab_size = len(self.vocab)
def generate_training_data(self):
"""Generate training data for Word2Vec"""
data = []
if self.model_type == 'skipgram':
for i, target_word in enumerate(self.corpus):
# Get context window
start = max(0, i - self.window_size)
end = min(len(self.corpus), i + self.window_size + 1)
# Generate positive samples
for j in range(start, end):
if j != i:
context_word = self.corpus[j]
data.append((self.vocab[target_word], self.vocab[context_word]))
# Generate negative samples
negative_samples = []
for target_idx, context_idx in data:
negatives = []
for _ in range(self.negative_samples):
# Sample from noise distribution (unigram distribution raised to 3/4 power)
neg_word = np.random.choice(
list(self.vocab.values()),
p=self.word_probs
)
while neg_word == context_idx:
neg_word = np.random.choice(
list(self.vocab.values()),
p=self.word_probs
)
negatives.append(neg_word)
negative_samples.append(negatives)
return data, negative_samples
else: # CBOW
for i, target_word in enumerate(self.corpus):
# Get context window
start = max(0, i - self.window_size)
end = min(len(self.corpus), i + self.window_size + 1)
# Get context words
context_words = []
for j in range(start, end):
if j != i:
context_words.append(self.vocab[self.corpus[j]])
# Only keep samples with full context
if len(context_words) == 2 * self.window_size:
data.append((context_words, self.vocab[target_word]))
return data
def precompute_word_probs(self):
"""Precompute word probabilities for negative sampling"""
word_counts = Counter(self.corpus)
total_words = sum(word_counts.values())
# Compute unigram distribution raised to 3/4 power
word_probs = np.zeros(len(self.vocab))
for word, idx in self.vocab.items():
word_probs[idx] = (word_counts[word] / total_words) ** 0.75
# Normalize
word_probs = word_probs / word_probs.sum()
self.word_probs = word_probs
def train(self, epochs=5):
"""Train the Word2Vec model"""
# Precompute word probabilities for negative sampling
self.precompute_word_probs()
# Generate training data
if self.model_type == 'skipgram':
data, negative_samples = self.generate_training_data()
else:
data = self.generate_training_data()
# Convert to tensors
if self.model_type == 'skipgram':
input_words = torch.LongTensor([item[0] for item in data])
output_words = torch.LongTensor([item[1] for item in data])
negative_words = torch.LongTensor(negative_samples)
else:
context_words = torch.LongTensor([item[0] for item in data])
target_words = torch.LongTensor([item[1] for item in data])
# Training loop
for epoch in range(epochs):
total_loss = 0
if self.model_type == 'skipgram':
# Skip-gram training
for i in range(len(input_words)):
# Zero gradients
self.optimizer.zero_grad()
# Get positive sample
pos_sample = output_words[i].unsqueeze(0)
# Get negative samples
neg_samples = negative_words[i]
# Combine positive and negative samples
output_words_batch = torch.cat([pos_sample, neg_samples])
# Forward pass
scores = self.model(input_words[i].unsqueeze(0), output_words_batch.unsqueeze(0))
# Create labels (0 for negative, 1 for positive)
labels = torch.zeros(scores.size(1), dtype=torch.long)
labels[0] = 1 # First sample is positive
# Compute loss
loss = self.criterion(scores, labels)
# Backward pass
loss.backward()
self.optimizer.step()
total_loss += loss.item()
else:
# CBOW training
for i in range(len(context_words)):
# Zero gradients
self.optimizer.zero_grad()
# Forward pass
scores = self.model(context_words[i].unsqueeze(0))
# Compute loss
loss = self.criterion(scores, target_words[i].unsqueeze(0))
# Backward pass
loss.backward()
self.optimizer.step()
total_loss += loss.item()
print(f"Epoch {epoch+1}/{epochs}, Loss: {total_loss/len(data):.4f}")
def get_embeddings(self):
"""Get the learned word embeddings"""
return self.model.input_embeddings.weight.data.cpu().numpy()
def get_word_vector(self, word):
"""Get vector for a specific word"""
if word not in self.vocab:
return None
idx = self.vocab[word]
return self.model.input_embeddings.weight.data[idx].cpu().numpy()
def find_similar_words(self, word, topn=5):
"""Find similar words using cosine similarity"""
if word not in self.vocab:
return []
# Get word vector
word_vec = self.get_word_vector(word)
if word_vec is None:
return []
# Get all word vectors
all_vecs = self.get_embeddings()
# Compute cosine similarities
similarities = []
for i, vec in enumerate(all_vecs):
if i != self.vocab[word]:
cos_sim = np.dot(word_vec, vec) / (np.linalg.norm(word_vec) * np.linalg.norm(vec))
similarities.append((self.inv_vocab[i], cos_sim))
# Sort by similarity
similarities.sort(key=lambda x: x[1], reverse=True)
return similarities[:topn]
TensorFlow Implementation
import tensorflow as tf
from tensorflow.keras.layers import Embedding, Dot, Dense
from tensorflow.keras.models import Model
import numpy as np
from collections import Counter
class Word2VecTF:
def __init__(self, corpus, embedding_dim=100, model_type='skipgram',
window_size=5, negative_samples=5, learning_rate=0.01):
self.corpus = corpus
self.embedding_dim = embedding_dim
self.model_type = model_type
self.window_size = window_size
self.negative_samples = negative_samples
self.learning_rate = learning_rate
# Build vocabulary
self.build_vocab()
# Build model
self.build_model()
def build_vocab(self):
"""Build vocabulary from corpus"""
word_counts = Counter(self.corpus)
self.vocab = {word: i for i, (word, _) in enumerate(word_counts.most_common())}
self.inv_vocab = {i: word for word, i in self.vocab.items()}
self.vocab_size = len(self.vocab)
def build_model(self):
"""Build the Word2Vec model"""
if self.model_type == 'skipgram':
self.build_skipgram_model()
else:
self.build_cbow_model()
def build_skipgram_model(self):
"""Build Skip-gram model"""
# Input layer
input_word = tf.keras.Input(shape=(1,))
output_word = tf.keras.Input(shape=(1 + self.negative_samples,))
# Embedding layers
input_embedding = Embedding(
self.vocab_size, self.embedding_dim,
embeddings_initializer=tf.keras.initializers.RandomUniform(-0.5, 0.5),
name='input_embedding'
)(input_word)
output_embedding = Embedding(
self.vocab_size, self.embedding_dim,
embeddings_initializer=tf.keras.initializers.RandomUniform(-0.5, 0.5),
name='output_embedding'
)(output_word)
# Dot product
dot_product = Dot(axes=2)([output_embedding, input_embedding])
dot_product = tf.squeeze(dot_product, axis=1)
# Output layer
output = Dense(1 + self.negative_samples, activation='softmax')(dot_product)
# Create model
self.model = Model(inputs=[input_word, output_word], outputs=output)
self.model.compile(
optimizer=tf.keras.optimizers.SGD(learning_rate=self.learning_rate),
loss='sparse_categorical_crossentropy'
)
def build_cbow_model(self):
"""Build CBOW model"""
# Input layer
context_words = tf.keras.Input(shape=(2 * self.window_size,))
# Embedding layer
embeddings = Embedding(
self.vocab_size, self.embedding_dim,
embeddings_initializer=tf.keras.initializers.RandomUniform(-0.5, 0.5),
name='embedding'
)(context_words)
# Average embeddings
avg_embedding = tf.reduce_mean(embeddings, axis=1)
# Output layer
output = Dense(self.vocab_size, activation='softmax')(avg_embedding)
# Create model
self.model = Model(inputs=context_words, outputs=output)
self.model.compile(
optimizer=tf.keras.optimizers.SGD(learning_rate=self.learning_rate),
loss='sparse_categorical_crossentropy'
)
def generate_training_data(self):
"""Generate training data for Word2Vec"""
if self.model_type == 'skipgram':
return self.generate_skipgram_data()
else:
return self.generate_cbow_data()
def generate_skipgram_data(self):
"""Generate Skip-gram training data"""
data = []
labels = []
for i, target_word in enumerate(self.corpus):
# Get context window
start = max(0, i - self.window_size)
end = min(len(self.corpus), i + self.window_size + 1)
# Generate positive samples
for j in range(start, end):
if j != i:
context_word = self.corpus[j]
data.append((self.vocab[target_word], self.vocab[context_word]))
labels.append(0) # Positive sample
# Generate negative samples
for _ in range(self.negative_samples):
# Sample from noise distribution
neg_word = np.random.choice(
list(self.vocab.values()),
p=self.word_probs
)
while neg_word == self.vocab[context_word]:
neg_word = np.random.choice(
list(self.vocab.values()),
p=self.word_probs
)
data.append((self.vocab[target_word], neg_word))
labels.append(1) # Negative sample
# Convert to numpy arrays
input_words = np.array([item[0] for item in data])
output_words = np.array([item[1] for item in data])
labels = np.array(labels)
return input_words, output_words, labels
def generate_cbow_data(self):
"""Generate CBOW training data"""
data = []
labels = []
for i, target_word in enumerate(self.corpus):
# Get context window
start = max(0, i - self.window_size)
end = min(len(self.corpus), i + self.window_size + 1)
# Get context words
context_words = []
for j in range(start, end):
if j != i:
context_words.append(self.vocab[self.corpus[j]])
# Only keep samples with full context
if len(context_words) == 2 * self.window_size:
data.append(context_words)
labels.append(self.vocab[target_word])
return np.array(data), np.array(labels)
def precompute_word_probs(self):
"""Precompute word probabilities for negative sampling"""
word_counts = Counter(self.corpus)
total_words = sum(word_counts.values())
# Compute unigram distribution raised to 3/4 power
word_probs = np.zeros(len(self.vocab))
for word, idx in self.vocab.items():
word_probs[idx] = (word_counts[word] / total_words) ** 0.75
# Normalize
word_probs = word_probs / word_probs.sum()
self.word_probs = word_probs
def train(self, epochs=5, batch_size=32):
"""Train the Word2Vec model"""
# Precompute word probabilities for negative sampling
self.precompute_word_probs()
# Generate training data
if self.model_type == 'skipgram':
input_words, output_words, labels = self.generate_training_data()
# For Skip-gram, we need to reshape the data
input_words = input_words.reshape(-1, 1)
output_words = output_words.reshape(-1, 1)
else:
context_words, target_words = self.generate_training_data()
# Training loop
for epoch in range(epochs):
if self.model_type == 'skipgram':
# Skip-gram training
history = self.model.fit(
[input_words, output_words], labels,
batch_size=batch_size,
epochs=1,
verbose=1
)
else:
# CBOW training
history = self.model.fit(
context_words, target_words,
batch_size=batch_size,
epochs=1,
verbose=1
)
print(f"Epoch {epoch+1}/{epochs}, Loss: {history.history['loss'][0]:.4f}")
def get_embeddings(self):
"""Get the learned word embeddings"""
if self.model_type == 'skipgram':
return self.model.get_layer('input_embedding').get_weights()[0]
else:
return self.model.get_layer('embedding').get_weights()[0]
def get_word_vector(self, word):
"""Get vector for a specific word"""
if word not in self.vocab:
return None
idx = self.vocab[word]
embeddings = self.get_embeddings()
return embeddings[idx]
def find_similar_words(self, word, topn=5):
"""Find similar words using cosine similarity"""
if word not in self.vocab:
return []
# Get word vector
word_vec = self.get_word_vector(word)
if word_vec is None:
return []
# Get all word vectors
all_vecs = self.get_embeddings()
# Compute cosine similarities
similarities = []
for i, vec in enumerate(all_vecs):
if i != self.vocab[word]:
cos_sim = np.dot(word_vec, vec) / (np.linalg.norm(word_vec) * np.linalg.norm(vec))
similarities.append((self.inv_vocab[i], cos_sim))
# Sort by similarity
similarities.sort(key=lambda x: x[1], reverse=True)
return similarities[:topn]
Word2Vec Applications
Semantic Relationships
Word2Vec captures semantic relationships through vector arithmetic:
# Example of semantic relationships
def demonstrate_semantic_relationships(model):
"""Demonstrate semantic relationships in Word2Vec"""
# King - Man + Woman ≈ Queen
king = model.get_word_vector('king')
man = model.get_word_vector('man')
woman = model.get_word_vector('woman')
if king is not None and man is not None and woman is not None:
queen_vector = king - man + woman
similar = model.find_similar_words_by_vector(queen_vector, topn=1)
print(f"king - man + woman ≈ {similar[0][0]}")
# Paris - France + Germany ≈ Berlin
paris = model.get_word_vector('paris')
france = model.get_word_vector('france')
germany = model.get_word_vector('germany')
if paris is not None and france is not None and germany is not None:
berlin_vector = paris - france + germany
similar = model.find_similar_words_by_vector(berlin_vector, topn=1)
print(f"paris - france + germany ≈ {similar[0][0]}")
# Car - Drive + Fly ≈ Airplane
car = model.get_word_vector('car')
drive = model.get_word_vector('drive')
fly = model.get_word_vector('fly')
if car is not None and drive is not None and fly is not None:
airplane_vector = car - drive + fly
similar = model.find_similar_words_by_vector(airplane_vector, topn=1)
print(f"car - drive + fly ≈ {similar[0][0]}")
def find_similar_words_by_vector(self, vector, topn=5):
"""Find similar words to a given vector"""
# Get all word vectors
all_vecs = self.get_embeddings()
# Compute cosine similarities
similarities = []
for i, vec in enumerate(all_vecs):
cos_sim = np.dot(vector, vec) / (np.linalg.norm(vector) * np.linalg.norm(vec))
similarities.append((self.inv_vocab[i], cos_sim))
# Sort by similarity
similarities.sort(key=lambda x: x[1], reverse=True)
return similarities[:topn]
Text Classification
# Word2Vec for text classification
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
import numpy as np
class Word2VecClassifier:
def __init__(self, word2vec_model):
self.word2vec = word2vec_model
self.classifier = LogisticRegression(max_iter=1000)
def document_to_vector(self, document):
"""Convert document to vector using Word2Vec"""
vectors = []
for word in document:
vec = self.word2vec.get_word_vector(word)
if vec is not None:
vectors.append(vec)
if len(vectors) == 0:
return np.zeros(self.word2vec.embedding_dim)
return np.mean(vectors, axis=0)
def train(self, documents, labels):
"""Train the classifier"""
# Convert documents to vectors
X = np.array([self.document_to_vector(doc) for doc in documents])
y = np.array(labels)
# Split into train and test
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
# Train classifier
self.classifier.fit(X_train, y_train)
# Evaluate
y_pred = self.classifier.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)
print(f"Accuracy: {accuracy:.4f}")
return accuracy
def predict(self, document):
"""Predict class for a document"""
vec = self.document_to_vector(document)
return self.classifier.predict([vec])[0]
def predict_proba(self, document):
"""Predict class probabilities for a document"""
vec = self.document_to_vector(document)
return self.classifier.predict_proba([vec])[0]
Information Retrieval
# Word2Vec for information retrieval
from sklearn.metrics.pairwise import cosine_similarity
class Word2VecRetrieval:
def __init__(self, word2vec_model):
self.word2vec = word2vec_model
def document_to_vector(self, document):
"""Convert document to vector"""
vectors = []
for word in document:
vec = self.word2vec.get_word_vector(word)
if vec is not None:
vectors.append(vec)
if len(vectors) == 0:
return np.zeros(self.word2vec.embedding_dim)
return np.mean(vectors, axis=0)
def index_documents(self, documents):
"""Index documents for retrieval"""
self.doc_vectors = []
self.documents = documents
for doc in documents:
vec = self.document_to_vector(doc)
self.doc_vectors.append(vec)
self.doc_vectors = np.array(self.doc_vectors)
def search(self, query, topn=5):
"""Search for similar documents"""
# Convert query to vector
query_vec = self.document_to_vector(query)
# Compute similarities
similarities = cosine_similarity([query_vec], self.doc_vectors)[0]
# Get top results
results = []
for i in np.argsort(similarities)[::-1][:topn]:
results.append({
'document': self.documents[i],
'similarity': similarities[i]
})
return results
def semantic_search(self, query, topn=5):
"""Semantic search using Word2Vec"""
return self.search(query, topn)
Recommendation Systems
# Word2Vec for recommendation systems
class Word2VecRecommender:
def __init__(self, word2vec_model):
self.word2vec = word2vec_model
def item_to_vector(self, item):
"""Convert item to vector"""
# Item can be a single word or multiple words
if isinstance(item, str):
vec = self.word2vec.get_word_vector(item)
if vec is not None:
return vec
else:
return np.zeros(self.word2vec.embedding_dim)
else:
return self.word2vec.document_to_vector(item)
def train(self, user_history):
"""Train the recommender on user history"""
# user_history: dict of {user_id: [item1, item2, ...]}
self.user_history = user_history
self.user_vectors = {}
# Create user vectors by averaging their item vectors
for user_id, items in user_history.items():
vectors = []
for item in items:
vec = self.item_to_vector(item)
vectors.append(vec)
if len(vectors) > 0:
self.user_vectors[user_id] = np.mean(vectors, axis=0)
else:
self.user_vectors[user_id] = np.zeros(self.word2vec.embedding_dim)
def recommend(self, user_id, candidate_items, topn=5):
"""Recommend items to a user"""
if user_id not in self.user_vectors:
return []
# Get user vector
user_vec = self.user_vectors[user_id]
# Compute similarities with candidate items
similarities = []
for item in candidate_items:
item_vec = self.item_to_vector(item)
sim = cosine_similarity([user_vec], [item_vec])[0][0]
similarities.append((item, sim))
# Sort by similarity
similarities.sort(key=lambda x: x[1], reverse=True)
return similarities[:topn]
def similar_items(self, item, topn=5):
"""Find similar items"""
return self.word2vec.find_similar_words(item, topn)
Word2Vec Variants and Extensions
Word2Vec with Subword Information
# Word2Vec with subword information (similar to FastText)
class Word2VecSubword:
def __init__(self, vocab_size, embedding_dim, model_type='skipgram', n_grams=3):
self.vocab_size = vocab_size
self.embedding_dim = embedding_dim
self.model_type = model_type
self.n_grams = n_grams
# Character n-gram embeddings
self.char_embeddings = nn.Embedding(26 + 1, embedding_dim) # 26 letters + padding
# Word embeddings
self.word_embeddings = nn.Embedding(vocab_size, embedding_dim)
def get_subword_vectors(self, word):
"""Get subword vectors for a word"""
# Convert word to character indices
char_indices = [ord(c) - ord('a') + 1 for c in word.lower() if c.isalpha()]
char_indices = torch.LongTensor(char_indices)
# Get character embeddings
char_embeds = self.char_embeddings(char_indices) # [len(word), embedding_dim]
# Compute n-grams
n_gram_vectors = []
for n in range(1, self.n_grams + 1):
for i in range(len(char_indices) - n + 1):
n_gram = char_embeds[i:i+n]
n_gram_vectors.append(torch.mean(n_gram, dim=0))
if len(n_gram_vectors) == 0:
return torch.zeros(self.embedding_dim)
return torch.mean(torch.stack(n_gram_vectors), dim=0)
def forward(self, input_word, output_words=None):
"""Forward pass"""
if self.model_type == 'skipgram':
return self.forward_skipgram(input_word, output_words)
else:
return self.forward_cbow(input_word)
def forward_skipgram(self, input_word, output_words):
"""Skip-gram forward pass with subword information"""
# Get word embedding
word_embedding = self.word_embeddings(input_word) # [batch_size, embedding_dim]
# Get subword embedding
subword_embedding = self.get_subword_vectors(self.inv_vocab[input_word.item()])
subword_embedding = subword_embedding.unsqueeze(0).expand(word_embedding.size(0), -1)
# Combine word and subword embeddings
combined_embedding = word_embedding + subword_embedding
# Get output embeddings
output_embeddings = self.word_embeddings(output_words) # [batch_size, num_neg_samples+1, embedding_dim]
# Compute scores
scores = torch.bmm(output_embeddings, combined_embedding.unsqueeze(2)).squeeze(2)
return scores
Contextual Word2Vec
# Contextual Word2Vec that considers sentence context
class ContextualWord2Vec(nn.Module):
def __init__(self, vocab_size, embedding_dim, hidden_dim, model_type='skipgram'):
super(ContextualWord2Vec, self).__init__()
self.model_type = model_type
# Word embeddings
self.word_embeddings = nn.Embedding(vocab_size, embedding_dim)
# Context encoder (LSTM)
self.context_encoder = nn.LSTM(embedding_dim, hidden_dim, batch_first=True)
# Output embeddings
self.output_embeddings = nn.Embedding(vocab_size, hidden_dim)
def forward(self, input_word, context_words, output_words=None):
"""Forward pass with context"""
if self.model_type == 'skipgram':
return self.forward_skipgram(input_word, context_words, output_words)
else:
return self.forward_cbow(input_word, context_words)
def forward_skipgram(self, input_word, context_words, output_words):
"""Skip-gram forward pass with context"""
# Get input word embedding
input_embedding = self.word_embeddings(input_word) # [batch_size, embedding_dim]
# Encode context
context_embeddings = self.word_embeddings(context_words) # [batch_size, context_size, embedding_dim]
_, (hidden, _) = self.context_encoder(context_embeddings)
# Combine input and context
combined = input_embedding + hidden.squeeze(0)
# Get output embeddings
output_embeddings = self.output_embeddings(output_words) # [batch_size, num_neg_samples+1, hidden_dim]
# Compute scores
scores = torch.bmm(output_embeddings, combined.unsqueeze(2)).squeeze(2)
return scores
Word2Vec vs Other Embedding Methods
| Feature | Word2Vec | GloVe | FastText | BERT/Transformers |
|---|---|---|---|---|
| Training Method | Predictive (neural network) | Count-based (matrix factorization) | Predictive with subword info | Contextual (transformer) |
| Context Handling | Local context window | Global co-occurrence | Local context with subwords | Full sentence context |
| Subword Info | No | No | Yes (character n-grams) | Yes (WordPiece/BytePair) |
| Training Speed | Fast | Very fast | Moderate | Slow |
| Memory Usage | Low | Low | Moderate | High |
| Rare Words | Poor handling | Poor handling | Good handling | Excellent handling |
| Contextual | No | No | No | Yes |
| Transfer Learning | Good | Good | Good | Excellent |
| Implementation | Simple | Simple | Moderate | Complex |
| Use Case | General purpose | General purpose | Morphologically rich languages | Context-dependent tasks |
| Vector Arithmetic | Excellent | Good | Good | Limited |
| Pre-trained Models | Available | Available | Available | Widely available |
| Fine-tuning | Not applicable | Not applicable | Not applicable | Yes |
Training Word2Vec
Best Practices
| Aspect | Recommendation | Notes |
|---|---|---|
| Corpus Size | Large corpus (100M+ words) | More data = better embeddings |
| Vocabulary Size | 50K-300K words | Balance between coverage and efficiency |
| Embedding Dim | 100-300 dimensions | 300 is common for good performance |
| Window Size | 5-10 words | Larger windows capture broader context |
| Model Type | Skip-gram for large datasets | CBOW is faster for smaller datasets |
| Negative Samples | 5-20 | More samples = better quality |
| Subsampling | Use subsampling of frequent words | Improves quality and training speed |
| Iterations | 5-50 epochs | More iterations for larger datasets |
| Learning Rate | 0.01-0.05 | Start high, decay over time |
| Batch Size | 100-1000 | Larger batches for stability |
| Hardware | GPU acceleration | Significant speedup for large datasets |
| Evaluation | Use intrinsic and extrinsic evaluation | Word similarity, analogy tasks, downstream tasks |
Training Pipeline
# Complete Word2Vec training pipeline
class Word2VecPipeline:
def __init__(self, corpus_path, embedding_dim=300, model_type='skipgram',
window_size=5, negative_samples=5, min_count=5):
self.corpus_path = corpus_path
self.embedding_dim = embedding_dim
self.model_type = model_type
self.window_size = window_size
self.negative_samples = negative_samples
self.min_count = min_count
def load_corpus(self):
"""Load and preprocess corpus"""
with open(self.corpus_path, 'r', encoding='utf-8') as f:
text = f.read()
# Basic preprocessing
text = text.lower()
words = text.split()
# Filter rare words
word_counts = Counter(words)
words = [word for word in words if word_counts[word] >= self.min_count]
return words
def train_model(self, epochs=10):
"""Train Word2Vec model"""
# Load corpus
corpus = self.load_corpus()
# Initialize model
trainer = Word2VecTrainer(
corpus,
embedding_dim=self.embedding_dim,
model_type=self.model_type,
window_size=self.window_size,
negative_samples=self.negative_samples
)
# Train model
trainer.train(epochs=epochs)
return trainer
def evaluate_model(self, model):
"""Evaluate the trained model"""
# Intrinsic evaluation - word similarities
print("Word Similarities:")
print(f"king - man + woman ≈ {model.find_similar_words('king', 1)[0][0]}")
print(f"paris - france + germany ≈ {model.find_similar_words('paris', 1)[0][0]}")
print(f"car - drive + fly ≈ {model.find_similar_words('car', 1)[0][0]}")
# Extrinsic evaluation - analogy tasks
self.evaluate_analogies(model)
def evaluate_analogies(self, model):
"""Evaluate on analogy tasks"""
analogies = [
('king', 'man', 'woman', 'queen'),
('paris', 'france', 'germany', 'berlin'),
('big', 'bigger', 'small', 'smaller'),
('good', 'better', 'bad', 'worse'),
('jump', 'jumped', 'run', 'ran')
]
correct = 0
for a, b, c, expected in analogies:
try:
# Compute a - b + c
a_vec = model.get_word_vector(a)
b_vec = model.get_word_vector(b)
c_vec = model.get_word_vector(c)
if a_vec is not None and b_vec is not None and c_vec is not None:
result_vec = a_vec - b_vec + c_vec
similar = model.find_similar_words_by_vector(result_vec, topn=1)[0][0]
if similar == expected:
correct += 1
print(f"✓ {a} - {b} + {c} = {similar}")
else:
print(f"✗ {a} - {b} + {c} = {similar} (expected {expected})")
except:
print(f"✗ {a} - {b} + {c} = ? (missing words)")
print(f"Accuracy: {correct}/{len(analogies)} = {correct/len(analogies):.2f}")
def save_model(self, model, output_path):
"""Save the trained model"""
import pickle
# Save model and metadata
model_data = {
'embeddings': model.get_embeddings(),
'vocab': model.vocab,
'inv_vocab': model.inv_vocab,
'config': {
'embedding_dim': self.embedding_dim,
'model_type': self.model_type,
'window_size': self.window_size,
'negative_samples': self.negative_samples
}
}
with open(output_path, 'wb') as f:
pickle.dump(model_data, f)
def load_model(self, model_path):
"""Load a trained model"""
import pickle
with open(model_path, 'rb') as f:
model_data = pickle.load(f)
# Create a lightweight model for inference
class InferenceModel:
def __init__(self, model_data):
self.embeddings = model_data['embeddings']
self.vocab = model_data['vocab']
self.inv_vocab = model_data['inv_vocab']
self.embedding_dim = model_data['config']['embedding_dim']
def get_word_vector(self, word):
if word not in self.vocab:
return None
idx = self.vocab[word]
return self.embeddings[idx]
def find_similar_words(self, word, topn=5):
if word not in self.vocab:
return []
word_vec = self.get_word_vector(word)
if word_vec is None:
return []
similarities = []
for i, vec in enumerate(self.embeddings):
if i != self.vocab[word]:
cos_sim = np.dot(word_vec, vec) / (np.linalg.norm(word_vec) * np.linalg.norm(vec))
similarities.append((self.inv_vocab[i], cos_sim))
similarities.sort(key=lambda x: x[1], reverse=True)
return similarities[:topn]
def find_similar_words_by_vector(self, vector, topn=5):
similarities = []
for i, vec in enumerate(self.embeddings):
cos_sim = np.dot(vector, vec) / (np.linalg.norm(vector) * np.linalg.norm(vec))
similarities.append((self.inv_vocab[i], cos_sim))
similarities.sort(key=lambda x: x[1], reverse=True)
return similarities[:topn]
return InferenceModel(model_data)
Word2Vec Research
Key Papers
- "Efficient Estimation of Word Representations in Vector Space" (Mikolov et al., 2013)
- Introduced Word2Vec
- Demonstrated efficient training of word vectors
- Foundation for modern word embeddings
- "Distributed Representations of Words and Phrases and their Compositionality" (Mikolov et al., 2013)
- Introduced Skip-gram with negative sampling
- Demonstrated word vector arithmetic
- Foundation for semantic relationships
- "Linguistic Regularities in Continuous Space Word Representations" (Mikolov et al., 2013)
- Demonstrated semantic and syntactic regularities
- Showed vector arithmetic for analogies
- Foundation for understanding word vector properties
- "word2vec Parameter Learning Explained" (Rong, 2014)
- Provided detailed explanation of Word2Vec mathematics
- Clarified training process
- Foundation for understanding implementation details
- "Enriching Word Vectors with Subword Information" (Bojanowski et al., 2017)
- Introduced subword information (FastText)
- Demonstrated improved handling of rare words
- Foundation for subword embeddings
Emerging Research Directions
- Contextual Word2Vec: Incorporating sentence context
- Multilingual Word2Vec: Cross-lingual embeddings
- Domain-Specific Word2Vec: Specialized embeddings for specific domains
- Dynamic Word2Vec: Time-evolving word representations
- Interpretable Word2Vec: More interpretable vector spaces
- Efficient Word2Vec: Faster training algorithms
- Multimodal Word2Vec: Combining text with other modalities
- Cognitive Word2Vec: Brain-inspired word representations
- Green Word2Vec: Energy-efficient training
- Few-Shot Word2Vec: Learning from limited data
- Adversarial Word2Vec: Robust word representations
- Theoretical Foundations: Better understanding of word vectors
- Hardware Acceleration: Specialized hardware for Word2Vec
Best Practices
Implementation Guidelines
| Aspect | Recommendation | Notes |
|---|---|---|
| Preprocessing | Clean text, handle case, tokenize | Remove noise, normalize text |
| Vocabulary | Filter rare words, handle OOV | Use min_count, consider subword info |
| Model Selection | Skip-gram for large datasets | CBOW for smaller datasets |
| Hyperparameters | Tune embedding dim, window size | 300 dim, window size 5-10 common |
| Training | Use negative sampling, subsampling | 5-20 negative samples, subsampling rate 1e-3 to 1e-5 |
| Evaluation | Use multiple metrics | Word similarity, analogy tasks, downstream tasks |
| Deployment | Optimize for inference | Use efficient data structures |
| Monitoring | Track training loss, evaluation metrics | Early stopping based on validation |
| Scaling | Use distributed training for large data | Consider GPU acceleration |
Common Pitfalls and Solutions
| Pitfall | Solution | Example |
|---|---|---|
| Small Corpus | Use pre-trained embeddings | GloVe, FastText pre-trained vectors |
| Rare Words | Use subword information | FastText, character n-grams |
| Overfitting | Use early stopping, regularization | Monitor validation loss |
| Slow Training | Use negative sampling, subsampling | Negative sampling with k=5-20 |
| Poor Quality | Increase corpus size, tune hyperparameters | Use larger corpus, adjust window size |
| Memory Issues | Use memory-efficient implementations | Gensim, optimized C implementations |
| Context Window | Experiment with different window sizes | Try 5, 10, 15 word windows |
| Learning Rate | Use learning rate scheduling | Start with 0.025, decay over time |
| Evaluation Bias | Use multiple evaluation metrics | Word similarity + downstream tasks |
| Domain Mismatch | Fine-tune on domain-specific data | Continue training on domain corpus |
| OOV Words | Use subword models or fallback strategies | FastText, character-level models |
| Interpretability | Use dimensionality reduction techniques | PCA, t-SNE for visualization |
Future Directions
- Contextual Word Embeddings: Moving beyond static embeddings to contextual representations
- Multimodal Embeddings: Combining text with images, audio, and other modalities
- Dynamic Embeddings: Time-evolving word representations that capture language change
- Interpretable Embeddings: More human-understandable vector spaces
- Efficient Training: Faster algorithms for large-scale training
- Green Embeddings: Energy-efficient training methods
- Multilingual Embeddings: Better cross-lingual representations
- Domain-Specific Embeddings: Specialized embeddings for specific domains
- Few-Shot Learning: Learning embeddings from limited data
- Adversarial Robustness: Robust embeddings against adversarial attacks
- Cognitive Models: Brain-inspired word representations
- Theoretical Breakthroughs: Better understanding of word vector properties
- Hardware Acceleration: Specialized hardware for embedding training
External Resources
- Original Word2Vec Paper (Mikolov et al.)
- Word2Vec Parameter Learning Explained (Rong)
- Word2Vec Implementation (Gensim)
- Word2Vec Tutorial (TensorFlow)
- Word2Vec Tutorial (PyTorch)
- Pre-trained Word Vectors (GloVe)
- FastText Pre-trained Vectors
- Word2Vec Visualization (TensorBoard)
- Word Embedding Evaluation (SimLex-999)
- Word Embedding Benchmarks
- Word2Vec vs GloVe Comparison
- Word2Vec for Recommendation Systems
- Contextual Word Embeddings (ELMo)
- Multilingual Word Embeddings
- Word2Vec Hardware Acceleration