Word2Vec

Word embedding technique that represents words as dense vectors capturing semantic relationships.

What is Word2Vec?

Word2Vec is a groundbreaking word embedding technique that transforms words into dense vector representations, capturing semantic and syntactic relationships between words. Developed by Tomas Mikolov and researchers at Google in 2013, Word2Vec revolutionized natural language processing by enabling machines to understand word meanings through their vector representations.

Key Characteristics

Dense Representations: Compact vector representations (typically 50-300 dimensions)
Semantic Relationships: Captures meaning through vector arithmetic
Efficient Training: Uses shallow neural networks for fast training
Contextual Understanding: Learns from word co-occurrence patterns
Dimensionality Reduction: Reduces high-dimensional one-hot vectors to low-dimensional dense vectors
Transfer Learning: Pre-trained embeddings can be used across tasks
Scalability: Can handle large vocabularies efficiently
Interpretability: Vectors capture human-interpretable relationships

Word2Vec Models

Word2Vec comes in two main architectures:

1. Continuous Bag of Words (CBOW)

graph LR
    A[Context Words] --> B[Input Layer]
    B --> C[Projection Layer]
    C --> D[Output Layer]
    D --> E[Target Word]

    style A fill:#f9f,stroke:#333
    style E fill:#f9f,stroke:#333

CBOW Architecture:

Predicts a target word from its context words
Uses the average of context word vectors as input
Typically faster to train than Skip-gram
Works well with smaller datasets

2. Skip-gram

graph LR
    A[Target Word] --> B[Input Layer]
    B --> C[Projection Layer]
    C --> D[Output Layer]
    D --> E[Context Words]

    style A fill:#f9f,stroke:#333
    style E fill:#f9f,stroke:#333

Skip-gram Architecture:

Predicts context words from a target word
Uses the target word vector to predict surrounding words
Typically performs better on larger datasets
Better at capturing rare words

Mathematical Foundations

Objective Function

Word2Vec optimizes the following objective:

For CBOW: $$ J(\theta) = \frac{1}{T} \sum_^{T} \log p(w_t | w_, \dots, w_, w_{t+1}, \dots, w_{t+n}) $$

For Skip-gram: $$ J(\theta) = \frac{1}{T} \sum_^{T} \sum_{-n \leq j \leq n, j \neq 0} \log p(w_{t+j} | w_t) $$

Where:

$ T $ is the number of words in the corpus
$ w_t $ is the target word
$ w_{t+j} $ are context words
$ n $ is the context window size

Softmax Probability

The probability of a word given its context is computed using softmax:

$$ p(w_O | w_I) = \frac{\exp(v_^\top v_)}{\sum_^{W} \exp(v_w^\top v_)} $$

Where:

$ w_O $ is the output word
$ w_I $ is the input word
$ v_w $ is the vector representation of word $ w $
$ W $ is the vocabulary size

Negative Sampling

To improve efficiency, Word2Vec uses negative sampling:

$$ J(\theta) = \log \sigma(v_^\top v_) + \sum_^{k} \mathbb{E}_{w_i \sim P_n(w)} \log \sigma(-v_^\top v_) $$

Where:

$ \sigma $ is the sigmoid function
$ k $ is the number of negative samples
$ P_n(w) $ is the noise distribution

Implementation

PyTorch Implementation

import torch
import torch.nn as nn
import torch.optim as optim
from collections import Counter
import numpy as np

class Word2Vec(nn.Module):
    def __init__(self, vocab_size, embedding_dim, model_type='skipgram'):
        super(Word2Vec, self).__init__()
        self.model_type = model_type

        # Embedding layers
        self.input_embeddings = nn.Embedding(vocab_size, embedding_dim)
        self.output_embeddings = nn.Embedding(vocab_size, embedding_dim)

        # Initialize weights
        self.input_embeddings.weight.data.uniform_(-0.5, 0.5)
        self.output_embeddings.weight.data.uniform_(-0.5, 0.5)

    def forward(self, input_word, output_words=None):
        if self.model_type == 'skipgram':
            return self.forward_skipgram(input_word, output_words)
        else:
            return self.forward_cbow(input_word)

    def forward_skipgram(self, input_word, output_words):
        # Get input embedding
        input_embedding = self.input_embeddings(input_word)  # [batch_size, embedding_dim]

        # Get output embeddings
        output_embeddings = self.output_embeddings(output_words)  # [batch_size, num_neg_samples+1, embedding_dim]

        # Compute scores
        scores = torch.bmm(output_embeddings, input_embedding.unsqueeze(2)).squeeze(2)  # [batch_size, num_neg_samples+1]

        return scores

    def forward_cbow(self, context_words):
        # Average context embeddings
        context_embeddings = self.input_embeddings(context_words)  # [batch_size, context_size, embedding_dim]
        input_embedding = torch.mean(context_embeddings, dim=1)  # [batch_size, embedding_dim]

        # Get output embedding
        output_embedding = self.output_embeddings.weight  # [vocab_size, embedding_dim]

        # Compute scores
        scores = torch.matmul(output_embedding, input_embedding.transpose(0, 1))  # [vocab_size, batch_size]

        return scores

class Word2VecTrainer:
    def __init__(self, corpus, embedding_dim=100, model_type='skipgram',
                 window_size=5, negative_samples=5, learning_rate=0.01):
        self.corpus = corpus
        self.embedding_dim = embedding_dim
        self.model_type = model_type
        self.window_size = window_size
        self.negative_samples = negative_samples
        self.learning_rate = learning_rate

        # Build vocabulary
        self.build_vocab()

        # Initialize model
        self.model = Word2Vec(len(self.vocab), embedding_dim, model_type)

        # Loss function and optimizer
        self.criterion = nn.CrossEntropyLoss()
        self.optimizer = optim.SGD(self.model.parameters(), lr=learning_rate)

    def build_vocab(self):
        """Build vocabulary from corpus"""
        word_counts = Counter(self.corpus)
        self.vocab = {word: i for i, (word, _) in enumerate(word_counts.most_common())}
        self.inv_vocab = {i: word for word, i in self.vocab.items()}
        self.vocab_size = len(self.vocab)

    def generate_training_data(self):
        """Generate training data for Word2Vec"""
        data = []

        if self.model_type == 'skipgram':
            for i, target_word in enumerate(self.corpus):
                # Get context window
                start = max(0, i - self.window_size)
                end = min(len(self.corpus), i + self.window_size + 1)

                # Generate positive samples
                for j in range(start, end):
                    if j != i:
                        context_word = self.corpus[j]
                        data.append((self.vocab[target_word], self.vocab[context_word]))

            # Generate negative samples
            negative_samples = []
            for target_idx, context_idx in data:
                negatives = []
                for _ in range(self.negative_samples):
                    # Sample from noise distribution (unigram distribution raised to 3/4 power)
                    neg_word = np.random.choice(
                        list(self.vocab.values()),
                        p=self.word_probs
                    )
                    while neg_word == context_idx:
                        neg_word = np.random.choice(
                            list(self.vocab.values()),
                            p=self.word_probs
                        )
                    negatives.append(neg_word)
                negative_samples.append(negatives)

            return data, negative_samples

        else:  # CBOW
            for i, target_word in enumerate(self.corpus):
                # Get context window
                start = max(0, i - self.window_size)
                end = min(len(self.corpus), i + self.window_size + 1)

                # Get context words
                context_words = []
                for j in range(start, end):
                    if j != i:
                        context_words.append(self.vocab[self.corpus[j]])

                # Only keep samples with full context
                if len(context_words) == 2 * self.window_size:
                    data.append((context_words, self.vocab[target_word]))

            return data

    def precompute_word_probs(self):
        """Precompute word probabilities for negative sampling"""
        word_counts = Counter(self.corpus)
        total_words = sum(word_counts.values())

        # Compute unigram distribution raised to 3/4 power
        word_probs = np.zeros(len(self.vocab))
        for word, idx in self.vocab.items():
            word_probs[idx] = (word_counts[word] / total_words) ** 0.75

        # Normalize
        word_probs = word_probs / word_probs.sum()
        self.word_probs = word_probs

    def train(self, epochs=5):
        """Train the Word2Vec model"""
        # Precompute word probabilities for negative sampling
        self.precompute_word_probs()

        # Generate training data
        if self.model_type == 'skipgram':
            data, negative_samples = self.generate_training_data()
        else:
            data = self.generate_training_data()

        # Convert to tensors
        if self.model_type == 'skipgram':
            input_words = torch.LongTensor([item[0] for item in data])
            output_words = torch.LongTensor([item[1] for item in data])
            negative_words = torch.LongTensor(negative_samples)
        else:
            context_words = torch.LongTensor([item[0] for item in data])
            target_words = torch.LongTensor([item[1] for item in data])

        # Training loop
        for epoch in range(epochs):
            total_loss = 0

            if self.model_type == 'skipgram':
                # Skip-gram training
                for i in range(len(input_words)):
                    # Zero gradients
                    self.optimizer.zero_grad()

                    # Get positive sample
                    pos_sample = output_words[i].unsqueeze(0)

                    # Get negative samples
                    neg_samples = negative_words[i]

                    # Combine positive and negative samples
                    output_words_batch = torch.cat([pos_sample, neg_samples])

                    # Forward pass
                    scores = self.model(input_words[i].unsqueeze(0), output_words_batch.unsqueeze(0))

                    # Create labels (0 for negative, 1 for positive)
                    labels = torch.zeros(scores.size(1), dtype=torch.long)
                    labels[0] = 1  # First sample is positive

                    # Compute loss
                    loss = self.criterion(scores, labels)

                    # Backward pass
                    loss.backward()
                    self.optimizer.step()

                    total_loss += loss.item()

            else:
                # CBOW training
                for i in range(len(context_words)):
                    # Zero gradients
                    self.optimizer.zero_grad()

                    # Forward pass
                    scores = self.model(context_words[i].unsqueeze(0))

                    # Compute loss
                    loss = self.criterion(scores, target_words[i].unsqueeze(0))

                    # Backward pass
                    loss.backward()
                    self.optimizer.step()

                    total_loss += loss.item()

            print(f"Epoch {epoch+1}/{epochs}, Loss: {total_loss/len(data):.4f}")

    def get_embeddings(self):
        """Get the learned word embeddings"""
        return self.model.input_embeddings.weight.data.cpu().numpy()

    def get_word_vector(self, word):
        """Get vector for a specific word"""
        if word not in self.vocab:
            return None
        idx = self.vocab[word]
        return self.model.input_embeddings.weight.data[idx].cpu().numpy()

    def find_similar_words(self, word, topn=5):
        """Find similar words using cosine similarity"""
        if word not in self.vocab:
            return []

        # Get word vector
        word_vec = self.get_word_vector(word)
        if word_vec is None:
            return []

        # Get all word vectors
        all_vecs = self.get_embeddings()

        # Compute cosine similarities
        similarities = []
        for i, vec in enumerate(all_vecs):
            if i != self.vocab[word]:
                cos_sim = np.dot(word_vec, vec) / (np.linalg.norm(word_vec) * np.linalg.norm(vec))
                similarities.append((self.inv_vocab[i], cos_sim))

        # Sort by similarity
        similarities.sort(key=lambda x: x[1], reverse=True)

        return similarities[:topn]

TensorFlow Implementation

import tensorflow as tf
from tensorflow.keras.layers import Embedding, Dot, Dense
from tensorflow.keras.models import Model
import numpy as np
from collections import Counter

class Word2VecTF:
    def __init__(self, corpus, embedding_dim=100, model_type='skipgram',
                 window_size=5, negative_samples=5, learning_rate=0.01):
        self.corpus = corpus
        self.embedding_dim = embedding_dim
        self.model_type = model_type
        self.window_size = window_size
        self.negative_samples = negative_samples
        self.learning_rate = learning_rate

        # Build vocabulary
        self.build_vocab()

        # Build model
        self.build_model()

    def build_vocab(self):
        """Build vocabulary from corpus"""
        word_counts = Counter(self.corpus)
        self.vocab = {word: i for i, (word, _) in enumerate(word_counts.most_common())}
        self.inv_vocab = {i: word for word, i in self.vocab.items()}
        self.vocab_size = len(self.vocab)

    def build_model(self):
        """Build the Word2Vec model"""
        if self.model_type == 'skipgram':
            self.build_skipgram_model()
        else:
            self.build_cbow_model()

    def build_skipgram_model(self):
        """Build Skip-gram model"""
        # Input layer
        input_word = tf.keras.Input(shape=(1,))
        output_word = tf.keras.Input(shape=(1 + self.negative_samples,))

        # Embedding layers
        input_embedding = Embedding(
            self.vocab_size, self.embedding_dim,
            embeddings_initializer=tf.keras.initializers.RandomUniform(-0.5, 0.5),
            name='input_embedding'
        )(input_word)

        output_embedding = Embedding(
            self.vocab_size, self.embedding_dim,
            embeddings_initializer=tf.keras.initializers.RandomUniform(-0.5, 0.5),
            name='output_embedding'
        )(output_word)

        # Dot product
        dot_product = Dot(axes=2)([output_embedding, input_embedding])
        dot_product = tf.squeeze(dot_product, axis=1)

        # Output layer
        output = Dense(1 + self.negative_samples, activation='softmax')(dot_product)

        # Create model
        self.model = Model(inputs=[input_word, output_word], outputs=output)
        self.model.compile(
            optimizer=tf.keras.optimizers.SGD(learning_rate=self.learning_rate),
            loss='sparse_categorical_crossentropy'
        )

    def build_cbow_model(self):
        """Build CBOW model"""
        # Input layer
        context_words = tf.keras.Input(shape=(2 * self.window_size,))

        # Embedding layer
        embeddings = Embedding(
            self.vocab_size, self.embedding_dim,
            embeddings_initializer=tf.keras.initializers.RandomUniform(-0.5, 0.5),
            name='embedding'
        )(context_words)

        # Average embeddings
        avg_embedding = tf.reduce_mean(embeddings, axis=1)

        # Output layer
        output = Dense(self.vocab_size, activation='softmax')(avg_embedding)

        # Create model
        self.model = Model(inputs=context_words, outputs=output)
        self.model.compile(
            optimizer=tf.keras.optimizers.SGD(learning_rate=self.learning_rate),
            loss='sparse_categorical_crossentropy'
        )

    def generate_training_data(self):
        """Generate training data for Word2Vec"""
        if self.model_type == 'skipgram':
            return self.generate_skipgram_data()
        else:
            return self.generate_cbow_data()

    def generate_skipgram_data(self):
        """Generate Skip-gram training data"""
        data = []
        labels = []

        for i, target_word in enumerate(self.corpus):
            # Get context window
            start = max(0, i - self.window_size)
            end = min(len(self.corpus), i + self.window_size + 1)

            # Generate positive samples
            for j in range(start, end):
                if j != i:
                    context_word = self.corpus[j]
                    data.append((self.vocab[target_word], self.vocab[context_word]))
                    labels.append(0)  # Positive sample

                    # Generate negative samples
                    for _ in range(self.negative_samples):
                        # Sample from noise distribution
                        neg_word = np.random.choice(
                            list(self.vocab.values()),
                            p=self.word_probs
                        )
                        while neg_word == self.vocab[context_word]:
                            neg_word = np.random.choice(
                                list(self.vocab.values()),
                                p=self.word_probs
                            )
                        data.append((self.vocab[target_word], neg_word))
                        labels.append(1)  # Negative sample

        # Convert to numpy arrays
        input_words = np.array([item[0] for item in data])
        output_words = np.array([item[1] for item in data])
        labels = np.array(labels)

        return input_words, output_words, labels

    def generate_cbow_data(self):
        """Generate CBOW training data"""
        data = []
        labels = []

        for i, target_word in enumerate(self.corpus):
            # Get context window
            start = max(0, i - self.window_size)
            end = min(len(self.corpus), i + self.window_size + 1)

            # Get context words
            context_words = []
            for j in range(start, end):
                if j != i:
                    context_words.append(self.vocab[self.corpus[j]])

            # Only keep samples with full context
            if len(context_words) == 2 * self.window_size:
                data.append(context_words)
                labels.append(self.vocab[target_word])

        return np.array(data), np.array(labels)

    def precompute_word_probs(self):
        """Precompute word probabilities for negative sampling"""
        word_counts = Counter(self.corpus)
        total_words = sum(word_counts.values())

        # Compute unigram distribution raised to 3/4 power
        word_probs = np.zeros(len(self.vocab))
        for word, idx in self.vocab.items():
            word_probs[idx] = (word_counts[word] / total_words) ** 0.75

        # Normalize
        word_probs = word_probs / word_probs.sum()
        self.word_probs = word_probs

    def train(self, epochs=5, batch_size=32):
        """Train the Word2Vec model"""
        # Precompute word probabilities for negative sampling
        self.precompute_word_probs()

        # Generate training data
        if self.model_type == 'skipgram':
            input_words, output_words, labels = self.generate_training_data()
            # For Skip-gram, we need to reshape the data
            input_words = input_words.reshape(-1, 1)
            output_words = output_words.reshape(-1, 1)
        else:
            context_words, target_words = self.generate_training_data()

        # Training loop
        for epoch in range(epochs):
            if self.model_type == 'skipgram':
                # Skip-gram training
                history = self.model.fit(
                    [input_words, output_words], labels,
                    batch_size=batch_size,
                    epochs=1,
                    verbose=1
                )
            else:
                # CBOW training
                history = self.model.fit(
                    context_words, target_words,
                    batch_size=batch_size,
                    epochs=1,
                    verbose=1
                )

            print(f"Epoch {epoch+1}/{epochs}, Loss: {history.history['loss'][0]:.4f}")

    def get_embeddings(self):
        """Get the learned word embeddings"""
        if self.model_type == 'skipgram':
            return self.model.get_layer('input_embedding').get_weights()[0]
        else:
            return self.model.get_layer('embedding').get_weights()[0]

    def get_word_vector(self, word):
        """Get vector for a specific word"""
        if word not in self.vocab:
            return None
        idx = self.vocab[word]
        embeddings = self.get_embeddings()
        return embeddings[idx]

    def find_similar_words(self, word, topn=5):
        """Find similar words using cosine similarity"""
        if word not in self.vocab:
            return []

        # Get word vector
        word_vec = self.get_word_vector(word)
        if word_vec is None:
            return []

        # Get all word vectors
        all_vecs = self.get_embeddings()

        # Compute cosine similarities
        similarities = []
        for i, vec in enumerate(all_vecs):
            if i != self.vocab[word]:
                cos_sim = np.dot(word_vec, vec) / (np.linalg.norm(word_vec) * np.linalg.norm(vec))
                similarities.append((self.inv_vocab[i], cos_sim))

        # Sort by similarity
        similarities.sort(key=lambda x: x[1], reverse=True)

        return similarities[:topn]

Word2Vec Applications

Semantic Relationships

Word2Vec captures semantic relationships through vector arithmetic:

# Example of semantic relationships
def demonstrate_semantic_relationships(model):
    """Demonstrate semantic relationships in Word2Vec"""

    # King - Man + Woman ≈ Queen
    king = model.get_word_vector('king')
    man = model.get_word_vector('man')
    woman = model.get_word_vector('woman')

    if king is not None and man is not None and woman is not None:
        queen_vector = king - man + woman
        similar = model.find_similar_words_by_vector(queen_vector, topn=1)
        print(f"king - man + woman ≈ {similar[0][0]}")

    # Paris - France + Germany ≈ Berlin
    paris = model.get_word_vector('paris')
    france = model.get_word_vector('france')
    germany = model.get_word_vector('germany')

    if paris is not None and france is not None and germany is not None:
        berlin_vector = paris - france + germany
        similar = model.find_similar_words_by_vector(berlin_vector, topn=1)
        print(f"paris - france + germany ≈ {similar[0][0]}")

    # Car - Drive + Fly ≈ Airplane
    car = model.get_word_vector('car')
    drive = model.get_word_vector('drive')
    fly = model.get_word_vector('fly')

    if car is not None and drive is not None and fly is not None:
        airplane_vector = car - drive + fly
        similar = model.find_similar_words_by_vector(airplane_vector, topn=1)
        print(f"car - drive + fly ≈ {similar[0][0]}")

def find_similar_words_by_vector(self, vector, topn=5):
    """Find similar words to a given vector"""
    # Get all word vectors
    all_vecs = self.get_embeddings()

    # Compute cosine similarities
    similarities = []
    for i, vec in enumerate(all_vecs):
        cos_sim = np.dot(vector, vec) / (np.linalg.norm(vector) * np.linalg.norm(vec))
        similarities.append((self.inv_vocab[i], cos_sim))

    # Sort by similarity
    similarities.sort(key=lambda x: x[1], reverse=True)

    return similarities[:topn]

Text Classification

# Word2Vec for text classification
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
import numpy as np

class Word2VecClassifier:
    def __init__(self, word2vec_model):
        self.word2vec = word2vec_model
        self.classifier = LogisticRegression(max_iter=1000)

    def document_to_vector(self, document):
        """Convert document to vector using Word2Vec"""
        vectors = []
        for word in document:
            vec = self.word2vec.get_word_vector(word)
            if vec is not None:
                vectors.append(vec)

        if len(vectors) == 0:
            return np.zeros(self.word2vec.embedding_dim)

        return np.mean(vectors, axis=0)

    def train(self, documents, labels):
        """Train the classifier"""
        # Convert documents to vectors
        X = np.array([self.document_to_vector(doc) for doc in documents])
        y = np.array(labels)

        # Split into train and test
        X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

        # Train classifier
        self.classifier.fit(X_train, y_train)

        # Evaluate
        y_pred = self.classifier.predict(X_test)
        accuracy = accuracy_score(y_test, y_pred)
        print(f"Accuracy: {accuracy:.4f}")

        return accuracy

    def predict(self, document):
        """Predict class for a document"""
        vec = self.document_to_vector(document)
        return self.classifier.predict([vec])[0]

    def predict_proba(self, document):
        """Predict class probabilities for a document"""
        vec = self.document_to_vector(document)
        return self.classifier.predict_proba([vec])[0]

Information Retrieval

# Word2Vec for information retrieval
from sklearn.metrics.pairwise import cosine_similarity

class Word2VecRetrieval:
    def __init__(self, word2vec_model):
        self.word2vec = word2vec_model

    def document_to_vector(self, document):
        """Convert document to vector"""
        vectors = []
        for word in document:
            vec = self.word2vec.get_word_vector(word)
            if vec is not None:
                vectors.append(vec)

        if len(vectors) == 0:
            return np.zeros(self.word2vec.embedding_dim)

        return np.mean(vectors, axis=0)

    def index_documents(self, documents):
        """Index documents for retrieval"""
        self.doc_vectors = []
        self.documents = documents

        for doc in documents:
            vec = self.document_to_vector(doc)
            self.doc_vectors.append(vec)

        self.doc_vectors = np.array(self.doc_vectors)

    def search(self, query, topn=5):
        """Search for similar documents"""
        # Convert query to vector
        query_vec = self.document_to_vector(query)

        # Compute similarities
        similarities = cosine_similarity([query_vec], self.doc_vectors)[0]

        # Get top results
        results = []
        for i in np.argsort(similarities)[::-1][:topn]:
            results.append({
                'document': self.documents[i],
                'similarity': similarities[i]
            })

        return results

    def semantic_search(self, query, topn=5):
        """Semantic search using Word2Vec"""
        return self.search(query, topn)

Recommendation Systems

# Word2Vec for recommendation systems
class Word2VecRecommender:
    def __init__(self, word2vec_model):
        self.word2vec = word2vec_model

    def item_to_vector(self, item):
        """Convert item to vector"""
        # Item can be a single word or multiple words
        if isinstance(item, str):
            vec = self.word2vec.get_word_vector(item)
            if vec is not None:
                return vec
            else:
                return np.zeros(self.word2vec.embedding_dim)
        else:
            return self.word2vec.document_to_vector(item)

    def train(self, user_history):
        """Train the recommender on user history"""
        # user_history: dict of {user_id: [item1, item2, ...]}
        self.user_history = user_history
        self.user_vectors = {}

        # Create user vectors by averaging their item vectors
        for user_id, items in user_history.items():
            vectors = []
            for item in items:
                vec = self.item_to_vector(item)
                vectors.append(vec)

            if len(vectors) > 0:
                self.user_vectors[user_id] = np.mean(vectors, axis=0)
            else:
                self.user_vectors[user_id] = np.zeros(self.word2vec.embedding_dim)

    def recommend(self, user_id, candidate_items, topn=5):
        """Recommend items to a user"""
        if user_id not in self.user_vectors:
            return []

        # Get user vector
        user_vec = self.user_vectors[user_id]

        # Compute similarities with candidate items
        similarities = []
        for item in candidate_items:
            item_vec = self.item_to_vector(item)
            sim = cosine_similarity([user_vec], [item_vec])[0][0]
            similarities.append((item, sim))

        # Sort by similarity
        similarities.sort(key=lambda x: x[1], reverse=True)

        return similarities[:topn]

    def similar_items(self, item, topn=5):
        """Find similar items"""
        return self.word2vec.find_similar_words(item, topn)

Word2Vec Variants and Extensions

Word2Vec with Subword Information

# Word2Vec with subword information (similar to FastText)
class Word2VecSubword:
    def __init__(self, vocab_size, embedding_dim, model_type='skipgram', n_grams=3):
        self.vocab_size = vocab_size
        self.embedding_dim = embedding_dim
        self.model_type = model_type
        self.n_grams = n_grams

        # Character n-gram embeddings
        self.char_embeddings = nn.Embedding(26 + 1, embedding_dim)  # 26 letters + padding

        # Word embeddings
        self.word_embeddings = nn.Embedding(vocab_size, embedding_dim)

    def get_subword_vectors(self, word):
        """Get subword vectors for a word"""
        # Convert word to character indices
        char_indices = [ord(c) - ord('a') + 1 for c in word.lower() if c.isalpha()]
        char_indices = torch.LongTensor(char_indices)

        # Get character embeddings
        char_embeds = self.char_embeddings(char_indices)  # [len(word), embedding_dim]

        # Compute n-grams
        n_gram_vectors = []
        for n in range(1, self.n_grams + 1):
            for i in range(len(char_indices) - n + 1):
                n_gram = char_embeds[i:i+n]
                n_gram_vectors.append(torch.mean(n_gram, dim=0))

        if len(n_gram_vectors) == 0:
            return torch.zeros(self.embedding_dim)

        return torch.mean(torch.stack(n_gram_vectors), dim=0)

    def forward(self, input_word, output_words=None):
        """Forward pass"""
        if self.model_type == 'skipgram':
            return self.forward_skipgram(input_word, output_words)
        else:
            return self.forward_cbow(input_word)

    def forward_skipgram(self, input_word, output_words):
        """Skip-gram forward pass with subword information"""
        # Get word embedding
        word_embedding = self.word_embeddings(input_word)  # [batch_size, embedding_dim]

        # Get subword embedding
        subword_embedding = self.get_subword_vectors(self.inv_vocab[input_word.item()])
        subword_embedding = subword_embedding.unsqueeze(0).expand(word_embedding.size(0), -1)

        # Combine word and subword embeddings
        combined_embedding = word_embedding + subword_embedding

        # Get output embeddings
        output_embeddings = self.word_embeddings(output_words)  # [batch_size, num_neg_samples+1, embedding_dim]

        # Compute scores
        scores = torch.bmm(output_embeddings, combined_embedding.unsqueeze(2)).squeeze(2)

        return scores

Contextual Word2Vec

# Contextual Word2Vec that considers sentence context
class ContextualWord2Vec(nn.Module):
    def __init__(self, vocab_size, embedding_dim, hidden_dim, model_type='skipgram'):
        super(ContextualWord2Vec, self).__init__()
        self.model_type = model_type

        # Word embeddings
        self.word_embeddings = nn.Embedding(vocab_size, embedding_dim)

        # Context encoder (LSTM)
        self.context_encoder = nn.LSTM(embedding_dim, hidden_dim, batch_first=True)

        # Output embeddings
        self.output_embeddings = nn.Embedding(vocab_size, hidden_dim)

    def forward(self, input_word, context_words, output_words=None):
        """Forward pass with context"""
        if self.model_type == 'skipgram':
            return self.forward_skipgram(input_word, context_words, output_words)
        else:
            return self.forward_cbow(input_word, context_words)

    def forward_skipgram(self, input_word, context_words, output_words):
        """Skip-gram forward pass with context"""
        # Get input word embedding
        input_embedding = self.word_embeddings(input_word)  # [batch_size, embedding_dim]

        # Encode context
        context_embeddings = self.word_embeddings(context_words)  # [batch_size, context_size, embedding_dim]
        _, (hidden, _) = self.context_encoder(context_embeddings)

        # Combine input and context
        combined = input_embedding + hidden.squeeze(0)

        # Get output embeddings
        output_embeddings = self.output_embeddings(output_words)  # [batch_size, num_neg_samples+1, hidden_dim]

        # Compute scores
        scores = torch.bmm(output_embeddings, combined.unsqueeze(2)).squeeze(2)

        return scores

Word2Vec vs Other Embedding Methods

Feature	Word2Vec	GloVe	FastText	BERT/Transformers
Training Method	Predictive (neural network)	Count-based (matrix factorization)	Predictive with subword info	Contextual (transformer)
Context Handling	Local context window	Global co-occurrence	Local context with subwords	Full sentence context
Subword Info	No	No	Yes (character n-grams)	Yes (WordPiece/BytePair)
Training Speed	Fast	Very fast	Moderate	Slow
Memory Usage	Low	Low	Moderate	High
Rare Words	Poor handling	Poor handling	Good handling	Excellent handling
Contextual	No	No	No	Yes
Transfer Learning	Good	Good	Good	Excellent
Implementation	Simple	Simple	Moderate	Complex
Use Case	General purpose	General purpose	Morphologically rich languages	Context-dependent tasks
Vector Arithmetic	Excellent	Good	Good	Limited
Pre-trained Models	Available	Available	Available	Widely available
Fine-tuning	Not applicable	Not applicable	Not applicable	Yes

Training Word2Vec

Best Practices

Aspect	Recommendation	Notes
Corpus Size	Large corpus (100M+ words)	More data = better embeddings
Vocabulary Size	50K-300K words	Balance between coverage and efficiency
Embedding Dim	100-300 dimensions	300 is common for good performance
Window Size	5-10 words	Larger windows capture broader context
Model Type	Skip-gram for large datasets	CBOW is faster for smaller datasets
Negative Samples	5-20	More samples = better quality
Subsampling	Use subsampling of frequent words	Improves quality and training speed
Iterations	5-50 epochs	More iterations for larger datasets
Learning Rate	0.01-0.05	Start high, decay over time
Batch Size	100-1000	Larger batches for stability
Hardware	GPU acceleration	Significant speedup for large datasets
Evaluation	Use intrinsic and extrinsic evaluation	Word similarity, analogy tasks, downstream tasks

Training Pipeline

# Complete Word2Vec training pipeline
class Word2VecPipeline:
    def __init__(self, corpus_path, embedding_dim=300, model_type='skipgram',
                 window_size=5, negative_samples=5, min_count=5):
        self.corpus_path = corpus_path
        self.embedding_dim = embedding_dim
        self.model_type = model_type
        self.window_size = window_size
        self.negative_samples = negative_samples
        self.min_count = min_count

    def load_corpus(self):
        """Load and preprocess corpus"""
        with open(self.corpus_path, 'r', encoding='utf-8') as f:
            text = f.read()

        # Basic preprocessing
        text = text.lower()
        words = text.split()

        # Filter rare words
        word_counts = Counter(words)
        words = [word for word in words if word_counts[word] >= self.min_count]

        return words

    def train_model(self, epochs=10):
        """Train Word2Vec model"""
        # Load corpus
        corpus = self.load_corpus()

        # Initialize model
        trainer = Word2VecTrainer(
            corpus,
            embedding_dim=self.embedding_dim,
            model_type=self.model_type,
            window_size=self.window_size,
            negative_samples=self.negative_samples
        )

        # Train model
        trainer.train(epochs=epochs)

        return trainer

    def evaluate_model(self, model):
        """Evaluate the trained model"""
        # Intrinsic evaluation - word similarities
        print("Word Similarities:")
        print(f"king - man + woman ≈ {model.find_similar_words('king', 1)[0][0]}")
        print(f"paris - france + germany ≈ {model.find_similar_words('paris', 1)[0][0]}")
        print(f"car - drive + fly ≈ {model.find_similar_words('car', 1)[0][0]}")

        # Extrinsic evaluation - analogy tasks
        self.evaluate_analogies(model)

    def evaluate_analogies(self, model):
        """Evaluate on analogy tasks"""
        analogies = [
            ('king', 'man', 'woman', 'queen'),
            ('paris', 'france', 'germany', 'berlin'),
            ('big', 'bigger', 'small', 'smaller'),
            ('good', 'better', 'bad', 'worse'),
            ('jump', 'jumped', 'run', 'ran')
        ]

        correct = 0
        for a, b, c, expected in analogies:
            try:
                # Compute a - b + c
                a_vec = model.get_word_vector(a)
                b_vec = model.get_word_vector(b)
                c_vec = model.get_word_vector(c)

                if a_vec is not None and b_vec is not None and c_vec is not None:
                    result_vec = a_vec - b_vec + c_vec
                    similar = model.find_similar_words_by_vector(result_vec, topn=1)[0][0]

                    if similar == expected:
                        correct += 1
                        print(f"✓ {a} - {b} + {c} = {similar}")
                    else:
                        print(f"✗ {a} - {b} + {c} = {similar} (expected {expected})")
            except:
                print(f"✗ {a} - {b} + {c} = ? (missing words)")

        print(f"Accuracy: {correct}/{len(analogies)} = {correct/len(analogies):.2f}")

    def save_model(self, model, output_path):
        """Save the trained model"""
        import pickle

        # Save model and metadata
        model_data = {
            'embeddings': model.get_embeddings(),
            'vocab': model.vocab,
            'inv_vocab': model.inv_vocab,
            'config': {
                'embedding_dim': self.embedding_dim,
                'model_type': self.model_type,
                'window_size': self.window_size,
                'negative_samples': self.negative_samples
            }
        }

        with open(output_path, 'wb') as f:
            pickle.dump(model_data, f)

    def load_model(self, model_path):
        """Load a trained model"""
        import pickle

        with open(model_path, 'rb') as f:
            model_data = pickle.load(f)

        # Create a lightweight model for inference
        class InferenceModel:
            def __init__(self, model_data):
                self.embeddings = model_data['embeddings']
                self.vocab = model_data['vocab']
                self.inv_vocab = model_data['inv_vocab']
                self.embedding_dim = model_data['config']['embedding_dim']

            def get_word_vector(self, word):
                if word not in self.vocab:
                    return None
                idx = self.vocab[word]
                return self.embeddings[idx]

            def find_similar_words(self, word, topn=5):
                if word not in self.vocab:
                    return []

                word_vec = self.get_word_vector(word)
                if word_vec is None:
                    return []

                similarities = []
                for i, vec in enumerate(self.embeddings):
                    if i != self.vocab[word]:
                        cos_sim = np.dot(word_vec, vec) / (np.linalg.norm(word_vec) * np.linalg.norm(vec))
                        similarities.append((self.inv_vocab[i], cos_sim))

                similarities.sort(key=lambda x: x[1], reverse=True)
                return similarities[:topn]

            def find_similar_words_by_vector(self, vector, topn=5):
                similarities = []
                for i, vec in enumerate(self.embeddings):
                    cos_sim = np.dot(vector, vec) / (np.linalg.norm(vector) * np.linalg.norm(vec))
                    similarities.append((self.inv_vocab[i], cos_sim))

                similarities.sort(key=lambda x: x[1], reverse=True)
                return similarities[:topn]

        return InferenceModel(model_data)

Word2Vec Research

Key Papers

"Efficient Estimation of Word Representations in Vector Space" (Mikolov et al., 2013)
- Introduced Word2Vec
- Demonstrated efficient training of word vectors
- Foundation for modern word embeddings
"Distributed Representations of Words and Phrases and their Compositionality" (Mikolov et al., 2013)
- Introduced Skip-gram with negative sampling
- Demonstrated word vector arithmetic
- Foundation for semantic relationships
"Linguistic Regularities in Continuous Space Word Representations" (Mikolov et al., 2013)
- Demonstrated semantic and syntactic regularities
- Showed vector arithmetic for analogies
- Foundation for understanding word vector properties
"word2vec Parameter Learning Explained" (Rong, 2014)
- Provided detailed explanation of Word2Vec mathematics
- Clarified training process
- Foundation for understanding implementation details
"Enriching Word Vectors with Subword Information" (Bojanowski et al., 2017)
- Introduced subword information (FastText)
- Demonstrated improved handling of rare words
- Foundation for subword embeddings

Emerging Research Directions

Contextual Word2Vec: Incorporating sentence context
Multilingual Word2Vec: Cross-lingual embeddings
Domain-Specific Word2Vec: Specialized embeddings for specific domains
Dynamic Word2Vec: Time-evolving word representations
Interpretable Word2Vec: More interpretable vector spaces
Efficient Word2Vec: Faster training algorithms
Multimodal Word2Vec: Combining text with other modalities
Cognitive Word2Vec: Brain-inspired word representations
Green Word2Vec: Energy-efficient training
Few-Shot Word2Vec: Learning from limited data
Adversarial Word2Vec: Robust word representations
Theoretical Foundations: Better understanding of word vectors
Hardware Acceleration: Specialized hardware for Word2Vec

Best Practices

Implementation Guidelines

Aspect	Recommendation	Notes
Preprocessing	Clean text, handle case, tokenize	Remove noise, normalize text
Vocabulary	Filter rare words, handle OOV	Use min_count, consider subword info
Model Selection	Skip-gram for large datasets	CBOW for smaller datasets
Hyperparameters	Tune embedding dim, window size	300 dim, window size 5-10 common
Training	Use negative sampling, subsampling	5-20 negative samples, subsampling rate 1e-3 to 1e-5
Evaluation	Use multiple metrics	Word similarity, analogy tasks, downstream tasks
Deployment	Optimize for inference	Use efficient data structures
Monitoring	Track training loss, evaluation metrics	Early stopping based on validation
Scaling	Use distributed training for large data	Consider GPU acceleration

Common Pitfalls and Solutions

Pitfall	Solution	Example
Small Corpus	Use pre-trained embeddings	GloVe, FastText pre-trained vectors
Rare Words	Use subword information	FastText, character n-grams
Overfitting	Use early stopping, regularization	Monitor validation loss
Slow Training	Use negative sampling, subsampling	Negative sampling with k=5-20
Poor Quality	Increase corpus size, tune hyperparameters	Use larger corpus, adjust window size
Memory Issues	Use memory-efficient implementations	Gensim, optimized C implementations
Context Window	Experiment with different window sizes	Try 5, 10, 15 word windows
Learning Rate	Use learning rate scheduling	Start with 0.025, decay over time
Evaluation Bias	Use multiple evaluation metrics	Word similarity + downstream tasks
Domain Mismatch	Fine-tune on domain-specific data	Continue training on domain corpus
OOV Words	Use subword models or fallback strategies	FastText, character-level models
Interpretability	Use dimensionality reduction techniques	PCA, t-SNE for visualization

Future Directions

Contextual Word Embeddings: Moving beyond static embeddings to contextual representations
Multimodal Embeddings: Combining text with images, audio, and other modalities
Dynamic Embeddings: Time-evolving word representations that capture language change
Interpretable Embeddings: More human-understandable vector spaces
Efficient Training: Faster algorithms for large-scale training
Green Embeddings: Energy-efficient training methods
Multilingual Embeddings: Better cross-lingual representations
Domain-Specific Embeddings: Specialized embeddings for specific domains
Few-Shot Learning: Learning embeddings from limited data
Adversarial Robustness: Robust embeddings against adversarial attacks
Cognitive Models: Brain-inspired word representations
Theoretical Breakthroughs: Better understanding of word vector properties
Hardware Acceleration: Specialized hardware for embedding training

External Resources

Weights & Biases

Machine learning experiment tracking, visualization, and collaboration platform.

XLNet

Generalized Autoregressive Pretraining - combines autoregressive and autoencoding approaches for superior language understanding.