Word2Vec

Word embedding technique that represents words as dense vectors capturing semantic relationships.

What is Word2Vec?

Word2Vec is a groundbreaking word embedding technique that transforms words into dense vector representations, capturing semantic and syntactic relationships between words. Developed by Tomas Mikolov and researchers at Google in 2013, Word2Vec revolutionized natural language processing by enabling machines to understand word meanings through their vector representations.

Key Characteristics

  • Dense Representations: Compact vector representations (typically 50-300 dimensions)
  • Semantic Relationships: Captures meaning through vector arithmetic
  • Efficient Training: Uses shallow neural networks for fast training
  • Contextual Understanding: Learns from word co-occurrence patterns
  • Dimensionality Reduction: Reduces high-dimensional one-hot vectors to low-dimensional dense vectors
  • Transfer Learning: Pre-trained embeddings can be used across tasks
  • Scalability: Can handle large vocabularies efficiently
  • Interpretability: Vectors capture human-interpretable relationships

Word2Vec Models

Word2Vec comes in two main architectures:

1. Continuous Bag of Words (CBOW)

graph LR
    A[Context Words] --> B[Input Layer]
    B --> C[Projection Layer]
    C --> D[Output Layer]
    D --> E[Target Word]

    style A fill:#f9f,stroke:#333
    style E fill:#f9f,stroke:#333

CBOW Architecture:

  • Predicts a target word from its context words
  • Uses the average of context word vectors as input
  • Typically faster to train than Skip-gram
  • Works well with smaller datasets

2. Skip-gram

graph LR
    A[Target Word] --> B[Input Layer]
    B --> C[Projection Layer]
    C --> D[Output Layer]
    D --> E[Context Words]

    style A fill:#f9f,stroke:#333
    style E fill:#f9f,stroke:#333

Skip-gram Architecture:

  • Predicts context words from a target word
  • Uses the target word vector to predict surrounding words
  • Typically performs better on larger datasets
  • Better at capturing rare words

Mathematical Foundations

Objective Function

Word2Vec optimizes the following objective:

For CBOW: $$ J(\theta) = \frac{1}{T} \sum_^{T} \log p(w_t | w_, \dots, w_, w_{t+1}, \dots, w_{t+n}) $$

For Skip-gram: $$ J(\theta) = \frac{1}{T} \sum_^{T} \sum_{-n \leq j \leq n, j \neq 0} \log p(w_{t+j} | w_t) $$

Where:

  • $ T $ is the number of words in the corpus
  • $ w_t $ is the target word
  • $ w_{t+j} $ are context words
  • $ n $ is the context window size

Softmax Probability

The probability of a word given its context is computed using softmax:

$$ p(w_O | w_I) = \frac{\exp(v_^\top v_)}{\sum_^{W} \exp(v_w^\top v_)} $$

Where:

  • $ w_O $ is the output word
  • $ w_I $ is the input word
  • $ v_w $ is the vector representation of word $ w $
  • $ W $ is the vocabulary size

Negative Sampling

To improve efficiency, Word2Vec uses negative sampling:

$$ J(\theta) = \log \sigma(v_^\top v_) + \sum_^{k} \mathbb{E}_{w_i \sim P_n(w)} \log \sigma(-v_^\top v_) $$

Where:

  • $ \sigma $ is the sigmoid function
  • $ k $ is the number of negative samples
  • $ P_n(w) $ is the noise distribution

Implementation

PyTorch Implementation

import torch
import torch.nn as nn
import torch.optim as optim
from collections import Counter
import numpy as np

class Word2Vec(nn.Module):
    def __init__(self, vocab_size, embedding_dim, model_type='skipgram'):
        super(Word2Vec, self).__init__()
        self.model_type = model_type

        # Embedding layers
        self.input_embeddings = nn.Embedding(vocab_size, embedding_dim)
        self.output_embeddings = nn.Embedding(vocab_size, embedding_dim)

        # Initialize weights
        self.input_embeddings.weight.data.uniform_(-0.5, 0.5)
        self.output_embeddings.weight.data.uniform_(-0.5, 0.5)

    def forward(self, input_word, output_words=None):
        if self.model_type == 'skipgram':
            return self.forward_skipgram(input_word, output_words)
        else:
            return self.forward_cbow(input_word)

    def forward_skipgram(self, input_word, output_words):
        # Get input embedding
        input_embedding = self.input_embeddings(input_word)  # [batch_size, embedding_dim]

        # Get output embeddings
        output_embeddings = self.output_embeddings(output_words)  # [batch_size, num_neg_samples+1, embedding_dim]

        # Compute scores
        scores = torch.bmm(output_embeddings, input_embedding.unsqueeze(2)).squeeze(2)  # [batch_size, num_neg_samples+1]

        return scores

    def forward_cbow(self, context_words):
        # Average context embeddings
        context_embeddings = self.input_embeddings(context_words)  # [batch_size, context_size, embedding_dim]
        input_embedding = torch.mean(context_embeddings, dim=1)  # [batch_size, embedding_dim]

        # Get output embedding
        output_embedding = self.output_embeddings.weight  # [vocab_size, embedding_dim]

        # Compute scores
        scores = torch.matmul(output_embedding, input_embedding.transpose(0, 1))  # [vocab_size, batch_size]

        return scores

class Word2VecTrainer:
    def __init__(self, corpus, embedding_dim=100, model_type='skipgram',
                 window_size=5, negative_samples=5, learning_rate=0.01):
        self.corpus = corpus
        self.embedding_dim = embedding_dim
        self.model_type = model_type
        self.window_size = window_size
        self.negative_samples = negative_samples
        self.learning_rate = learning_rate

        # Build vocabulary
        self.build_vocab()

        # Initialize model
        self.model = Word2Vec(len(self.vocab), embedding_dim, model_type)

        # Loss function and optimizer
        self.criterion = nn.CrossEntropyLoss()
        self.optimizer = optim.SGD(self.model.parameters(), lr=learning_rate)

    def build_vocab(self):
        """Build vocabulary from corpus"""
        word_counts = Counter(self.corpus)
        self.vocab = {word: i for i, (word, _) in enumerate(word_counts.most_common())}
        self.inv_vocab = {i: word for word, i in self.vocab.items()}
        self.vocab_size = len(self.vocab)

    def generate_training_data(self):
        """Generate training data for Word2Vec"""
        data = []

        if self.model_type == 'skipgram':
            for i, target_word in enumerate(self.corpus):
                # Get context window
                start = max(0, i - self.window_size)
                end = min(len(self.corpus), i + self.window_size + 1)

                # Generate positive samples
                for j in range(start, end):
                    if j != i:
                        context_word = self.corpus[j]
                        data.append((self.vocab[target_word], self.vocab[context_word]))

            # Generate negative samples
            negative_samples = []
            for target_idx, context_idx in data:
                negatives = []
                for _ in range(self.negative_samples):
                    # Sample from noise distribution (unigram distribution raised to 3/4 power)
                    neg_word = np.random.choice(
                        list(self.vocab.values()),
                        p=self.word_probs
                    )
                    while neg_word == context_idx:
                        neg_word = np.random.choice(
                            list(self.vocab.values()),
                            p=self.word_probs
                        )
                    negatives.append(neg_word)
                negative_samples.append(negatives)

            return data, negative_samples

        else:  # CBOW
            for i, target_word in enumerate(self.corpus):
                # Get context window
                start = max(0, i - self.window_size)
                end = min(len(self.corpus), i + self.window_size + 1)

                # Get context words
                context_words = []
                for j in range(start, end):
                    if j != i:
                        context_words.append(self.vocab[self.corpus[j]])

                # Only keep samples with full context
                if len(context_words) == 2 * self.window_size:
                    data.append((context_words, self.vocab[target_word]))

            return data

    def precompute_word_probs(self):
        """Precompute word probabilities for negative sampling"""
        word_counts = Counter(self.corpus)
        total_words = sum(word_counts.values())

        # Compute unigram distribution raised to 3/4 power
        word_probs = np.zeros(len(self.vocab))
        for word, idx in self.vocab.items():
            word_probs[idx] = (word_counts[word] / total_words) ** 0.75

        # Normalize
        word_probs = word_probs / word_probs.sum()
        self.word_probs = word_probs

    def train(self, epochs=5):
        """Train the Word2Vec model"""
        # Precompute word probabilities for negative sampling
        self.precompute_word_probs()

        # Generate training data
        if self.model_type == 'skipgram':
            data, negative_samples = self.generate_training_data()
        else:
            data = self.generate_training_data()

        # Convert to tensors
        if self.model_type == 'skipgram':
            input_words = torch.LongTensor([item[0] for item in data])
            output_words = torch.LongTensor([item[1] for item in data])
            negative_words = torch.LongTensor(negative_samples)
        else:
            context_words = torch.LongTensor([item[0] for item in data])
            target_words = torch.LongTensor([item[1] for item in data])

        # Training loop
        for epoch in range(epochs):
            total_loss = 0

            if self.model_type == 'skipgram':
                # Skip-gram training
                for i in range(len(input_words)):
                    # Zero gradients
                    self.optimizer.zero_grad()

                    # Get positive sample
                    pos_sample = output_words[i].unsqueeze(0)

                    # Get negative samples
                    neg_samples = negative_words[i]

                    # Combine positive and negative samples
                    output_words_batch = torch.cat([pos_sample, neg_samples])

                    # Forward pass
                    scores = self.model(input_words[i].unsqueeze(0), output_words_batch.unsqueeze(0))

                    # Create labels (0 for negative, 1 for positive)
                    labels = torch.zeros(scores.size(1), dtype=torch.long)
                    labels[0] = 1  # First sample is positive

                    # Compute loss
                    loss = self.criterion(scores, labels)

                    # Backward pass
                    loss.backward()
                    self.optimizer.step()

                    total_loss += loss.item()

            else:
                # CBOW training
                for i in range(len(context_words)):
                    # Zero gradients
                    self.optimizer.zero_grad()

                    # Forward pass
                    scores = self.model(context_words[i].unsqueeze(0))

                    # Compute loss
                    loss = self.criterion(scores, target_words[i].unsqueeze(0))

                    # Backward pass
                    loss.backward()
                    self.optimizer.step()

                    total_loss += loss.item()

            print(f"Epoch {epoch+1}/{epochs}, Loss: {total_loss/len(data):.4f}")

    def get_embeddings(self):
        """Get the learned word embeddings"""
        return self.model.input_embeddings.weight.data.cpu().numpy()

    def get_word_vector(self, word):
        """Get vector for a specific word"""
        if word not in self.vocab:
            return None
        idx = self.vocab[word]
        return self.model.input_embeddings.weight.data[idx].cpu().numpy()

    def find_similar_words(self, word, topn=5):
        """Find similar words using cosine similarity"""
        if word not in self.vocab:
            return []

        # Get word vector
        word_vec = self.get_word_vector(word)
        if word_vec is None:
            return []

        # Get all word vectors
        all_vecs = self.get_embeddings()

        # Compute cosine similarities
        similarities = []
        for i, vec in enumerate(all_vecs):
            if i != self.vocab[word]:
                cos_sim = np.dot(word_vec, vec) / (np.linalg.norm(word_vec) * np.linalg.norm(vec))
                similarities.append((self.inv_vocab[i], cos_sim))

        # Sort by similarity
        similarities.sort(key=lambda x: x[1], reverse=True)

        return similarities[:topn]

TensorFlow Implementation

import tensorflow as tf
from tensorflow.keras.layers import Embedding, Dot, Dense
from tensorflow.keras.models import Model
import numpy as np
from collections import Counter

class Word2VecTF:
    def __init__(self, corpus, embedding_dim=100, model_type='skipgram',
                 window_size=5, negative_samples=5, learning_rate=0.01):
        self.corpus = corpus
        self.embedding_dim = embedding_dim
        self.model_type = model_type
        self.window_size = window_size
        self.negative_samples = negative_samples
        self.learning_rate = learning_rate

        # Build vocabulary
        self.build_vocab()

        # Build model
        self.build_model()

    def build_vocab(self):
        """Build vocabulary from corpus"""
        word_counts = Counter(self.corpus)
        self.vocab = {word: i for i, (word, _) in enumerate(word_counts.most_common())}
        self.inv_vocab = {i: word for word, i in self.vocab.items()}
        self.vocab_size = len(self.vocab)

    def build_model(self):
        """Build the Word2Vec model"""
        if self.model_type == 'skipgram':
            self.build_skipgram_model()
        else:
            self.build_cbow_model()

    def build_skipgram_model(self):
        """Build Skip-gram model"""
        # Input layer
        input_word = tf.keras.Input(shape=(1,))
        output_word = tf.keras.Input(shape=(1 + self.negative_samples,))

        # Embedding layers
        input_embedding = Embedding(
            self.vocab_size, self.embedding_dim,
            embeddings_initializer=tf.keras.initializers.RandomUniform(-0.5, 0.5),
            name='input_embedding'
        )(input_word)

        output_embedding = Embedding(
            self.vocab_size, self.embedding_dim,
            embeddings_initializer=tf.keras.initializers.RandomUniform(-0.5, 0.5),
            name='output_embedding'
        )(output_word)

        # Dot product
        dot_product = Dot(axes=2)([output_embedding, input_embedding])
        dot_product = tf.squeeze(dot_product, axis=1)

        # Output layer
        output = Dense(1 + self.negative_samples, activation='softmax')(dot_product)

        # Create model
        self.model = Model(inputs=[input_word, output_word], outputs=output)
        self.model.compile(
            optimizer=tf.keras.optimizers.SGD(learning_rate=self.learning_rate),
            loss='sparse_categorical_crossentropy'
        )

    def build_cbow_model(self):
        """Build CBOW model"""
        # Input layer
        context_words = tf.keras.Input(shape=(2 * self.window_size,))

        # Embedding layer
        embeddings = Embedding(
            self.vocab_size, self.embedding_dim,
            embeddings_initializer=tf.keras.initializers.RandomUniform(-0.5, 0.5),
            name='embedding'
        )(context_words)

        # Average embeddings
        avg_embedding = tf.reduce_mean(embeddings, axis=1)

        # Output layer
        output = Dense(self.vocab_size, activation='softmax')(avg_embedding)

        # Create model
        self.model = Model(inputs=context_words, outputs=output)
        self.model.compile(
            optimizer=tf.keras.optimizers.SGD(learning_rate=self.learning_rate),
            loss='sparse_categorical_crossentropy'
        )

    def generate_training_data(self):
        """Generate training data for Word2Vec"""
        if self.model_type == 'skipgram':
            return self.generate_skipgram_data()
        else:
            return self.generate_cbow_data()

    def generate_skipgram_data(self):
        """Generate Skip-gram training data"""
        data = []
        labels = []

        for i, target_word in enumerate(self.corpus):
            # Get context window
            start = max(0, i - self.window_size)
            end = min(len(self.corpus), i + self.window_size + 1)

            # Generate positive samples
            for j in range(start, end):
                if j != i:
                    context_word = self.corpus[j]
                    data.append((self.vocab[target_word], self.vocab[context_word]))
                    labels.append(0)  # Positive sample

                    # Generate negative samples
                    for _ in range(self.negative_samples):
                        # Sample from noise distribution
                        neg_word = np.random.choice(
                            list(self.vocab.values()),
                            p=self.word_probs
                        )
                        while neg_word == self.vocab[context_word]:
                            neg_word = np.random.choice(
                                list(self.vocab.values()),
                                p=self.word_probs
                            )
                        data.append((self.vocab[target_word], neg_word))
                        labels.append(1)  # Negative sample

        # Convert to numpy arrays
        input_words = np.array([item[0] for item in data])
        output_words = np.array([item[1] for item in data])
        labels = np.array(labels)

        return input_words, output_words, labels

    def generate_cbow_data(self):
        """Generate CBOW training data"""
        data = []
        labels = []

        for i, target_word in enumerate(self.corpus):
            # Get context window
            start = max(0, i - self.window_size)
            end = min(len(self.corpus), i + self.window_size + 1)

            # Get context words
            context_words = []
            for j in range(start, end):
                if j != i:
                    context_words.append(self.vocab[self.corpus[j]])

            # Only keep samples with full context
            if len(context_words) == 2 * self.window_size:
                data.append(context_words)
                labels.append(self.vocab[target_word])

        return np.array(data), np.array(labels)

    def precompute_word_probs(self):
        """Precompute word probabilities for negative sampling"""
        word_counts = Counter(self.corpus)
        total_words = sum(word_counts.values())

        # Compute unigram distribution raised to 3/4 power
        word_probs = np.zeros(len(self.vocab))
        for word, idx in self.vocab.items():
            word_probs[idx] = (word_counts[word] / total_words) ** 0.75

        # Normalize
        word_probs = word_probs / word_probs.sum()
        self.word_probs = word_probs

    def train(self, epochs=5, batch_size=32):
        """Train the Word2Vec model"""
        # Precompute word probabilities for negative sampling
        self.precompute_word_probs()

        # Generate training data
        if self.model_type == 'skipgram':
            input_words, output_words, labels = self.generate_training_data()
            # For Skip-gram, we need to reshape the data
            input_words = input_words.reshape(-1, 1)
            output_words = output_words.reshape(-1, 1)
        else:
            context_words, target_words = self.generate_training_data()

        # Training loop
        for epoch in range(epochs):
            if self.model_type == 'skipgram':
                # Skip-gram training
                history = self.model.fit(
                    [input_words, output_words], labels,
                    batch_size=batch_size,
                    epochs=1,
                    verbose=1
                )
            else:
                # CBOW training
                history = self.model.fit(
                    context_words, target_words,
                    batch_size=batch_size,
                    epochs=1,
                    verbose=1
                )

            print(f"Epoch {epoch+1}/{epochs}, Loss: {history.history['loss'][0]:.4f}")

    def get_embeddings(self):
        """Get the learned word embeddings"""
        if self.model_type == 'skipgram':
            return self.model.get_layer('input_embedding').get_weights()[0]
        else:
            return self.model.get_layer('embedding').get_weights()[0]

    def get_word_vector(self, word):
        """Get vector for a specific word"""
        if word not in self.vocab:
            return None
        idx = self.vocab[word]
        embeddings = self.get_embeddings()
        return embeddings[idx]

    def find_similar_words(self, word, topn=5):
        """Find similar words using cosine similarity"""
        if word not in self.vocab:
            return []

        # Get word vector
        word_vec = self.get_word_vector(word)
        if word_vec is None:
            return []

        # Get all word vectors
        all_vecs = self.get_embeddings()

        # Compute cosine similarities
        similarities = []
        for i, vec in enumerate(all_vecs):
            if i != self.vocab[word]:
                cos_sim = np.dot(word_vec, vec) / (np.linalg.norm(word_vec) * np.linalg.norm(vec))
                similarities.append((self.inv_vocab[i], cos_sim))

        # Sort by similarity
        similarities.sort(key=lambda x: x[1], reverse=True)

        return similarities[:topn]

Word2Vec Applications

Semantic Relationships

Word2Vec captures semantic relationships through vector arithmetic:

# Example of semantic relationships
def demonstrate_semantic_relationships(model):
    """Demonstrate semantic relationships in Word2Vec"""

    # King - Man + Woman ≈ Queen
    king = model.get_word_vector('king')
    man = model.get_word_vector('man')
    woman = model.get_word_vector('woman')

    if king is not None and man is not None and woman is not None:
        queen_vector = king - man + woman
        similar = model.find_similar_words_by_vector(queen_vector, topn=1)
        print(f"king - man + woman ≈ {similar[0][0]}")

    # Paris - France + Germany ≈ Berlin
    paris = model.get_word_vector('paris')
    france = model.get_word_vector('france')
    germany = model.get_word_vector('germany')

    if paris is not None and france is not None and germany is not None:
        berlin_vector = paris - france + germany
        similar = model.find_similar_words_by_vector(berlin_vector, topn=1)
        print(f"paris - france + germany ≈ {similar[0][0]}")

    # Car - Drive + Fly ≈ Airplane
    car = model.get_word_vector('car')
    drive = model.get_word_vector('drive')
    fly = model.get_word_vector('fly')

    if car is not None and drive is not None and fly is not None:
        airplane_vector = car - drive + fly
        similar = model.find_similar_words_by_vector(airplane_vector, topn=1)
        print(f"car - drive + fly ≈ {similar[0][0]}")

def find_similar_words_by_vector(self, vector, topn=5):
    """Find similar words to a given vector"""
    # Get all word vectors
    all_vecs = self.get_embeddings()

    # Compute cosine similarities
    similarities = []
    for i, vec in enumerate(all_vecs):
        cos_sim = np.dot(vector, vec) / (np.linalg.norm(vector) * np.linalg.norm(vec))
        similarities.append((self.inv_vocab[i], cos_sim))

    # Sort by similarity
    similarities.sort(key=lambda x: x[1], reverse=True)

    return similarities[:topn]

Text Classification

# Word2Vec for text classification
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
import numpy as np

class Word2VecClassifier:
    def __init__(self, word2vec_model):
        self.word2vec = word2vec_model
        self.classifier = LogisticRegression(max_iter=1000)

    def document_to_vector(self, document):
        """Convert document to vector using Word2Vec"""
        vectors = []
        for word in document:
            vec = self.word2vec.get_word_vector(word)
            if vec is not None:
                vectors.append(vec)

        if len(vectors) == 0:
            return np.zeros(self.word2vec.embedding_dim)

        return np.mean(vectors, axis=0)

    def train(self, documents, labels):
        """Train the classifier"""
        # Convert documents to vectors
        X = np.array([self.document_to_vector(doc) for doc in documents])
        y = np.array(labels)

        # Split into train and test
        X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

        # Train classifier
        self.classifier.fit(X_train, y_train)

        # Evaluate
        y_pred = self.classifier.predict(X_test)
        accuracy = accuracy_score(y_test, y_pred)
        print(f"Accuracy: {accuracy:.4f}")

        return accuracy

    def predict(self, document):
        """Predict class for a document"""
        vec = self.document_to_vector(document)
        return self.classifier.predict([vec])[0]

    def predict_proba(self, document):
        """Predict class probabilities for a document"""
        vec = self.document_to_vector(document)
        return self.classifier.predict_proba([vec])[0]

Information Retrieval

# Word2Vec for information retrieval
from sklearn.metrics.pairwise import cosine_similarity

class Word2VecRetrieval:
    def __init__(self, word2vec_model):
        self.word2vec = word2vec_model

    def document_to_vector(self, document):
        """Convert document to vector"""
        vectors = []
        for word in document:
            vec = self.word2vec.get_word_vector(word)
            if vec is not None:
                vectors.append(vec)

        if len(vectors) == 0:
            return np.zeros(self.word2vec.embedding_dim)

        return np.mean(vectors, axis=0)

    def index_documents(self, documents):
        """Index documents for retrieval"""
        self.doc_vectors = []
        self.documents = documents

        for doc in documents:
            vec = self.document_to_vector(doc)
            self.doc_vectors.append(vec)

        self.doc_vectors = np.array(self.doc_vectors)

    def search(self, query, topn=5):
        """Search for similar documents"""
        # Convert query to vector
        query_vec = self.document_to_vector(query)

        # Compute similarities
        similarities = cosine_similarity([query_vec], self.doc_vectors)[0]

        # Get top results
        results = []
        for i in np.argsort(similarities)[::-1][:topn]:
            results.append({
                'document': self.documents[i],
                'similarity': similarities[i]
            })

        return results

    def semantic_search(self, query, topn=5):
        """Semantic search using Word2Vec"""
        return self.search(query, topn)

Recommendation Systems

# Word2Vec for recommendation systems
class Word2VecRecommender:
    def __init__(self, word2vec_model):
        self.word2vec = word2vec_model

    def item_to_vector(self, item):
        """Convert item to vector"""
        # Item can be a single word or multiple words
        if isinstance(item, str):
            vec = self.word2vec.get_word_vector(item)
            if vec is not None:
                return vec
            else:
                return np.zeros(self.word2vec.embedding_dim)
        else:
            return self.word2vec.document_to_vector(item)

    def train(self, user_history):
        """Train the recommender on user history"""
        # user_history: dict of {user_id: [item1, item2, ...]}
        self.user_history = user_history
        self.user_vectors = {}

        # Create user vectors by averaging their item vectors
        for user_id, items in user_history.items():
            vectors = []
            for item in items:
                vec = self.item_to_vector(item)
                vectors.append(vec)

            if len(vectors) > 0:
                self.user_vectors[user_id] = np.mean(vectors, axis=0)
            else:
                self.user_vectors[user_id] = np.zeros(self.word2vec.embedding_dim)

    def recommend(self, user_id, candidate_items, topn=5):
        """Recommend items to a user"""
        if user_id not in self.user_vectors:
            return []

        # Get user vector
        user_vec = self.user_vectors[user_id]

        # Compute similarities with candidate items
        similarities = []
        for item in candidate_items:
            item_vec = self.item_to_vector(item)
            sim = cosine_similarity([user_vec], [item_vec])[0][0]
            similarities.append((item, sim))

        # Sort by similarity
        similarities.sort(key=lambda x: x[1], reverse=True)

        return similarities[:topn]

    def similar_items(self, item, topn=5):
        """Find similar items"""
        return self.word2vec.find_similar_words(item, topn)

Word2Vec Variants and Extensions

Word2Vec with Subword Information

# Word2Vec with subword information (similar to FastText)
class Word2VecSubword:
    def __init__(self, vocab_size, embedding_dim, model_type='skipgram', n_grams=3):
        self.vocab_size = vocab_size
        self.embedding_dim = embedding_dim
        self.model_type = model_type
        self.n_grams = n_grams

        # Character n-gram embeddings
        self.char_embeddings = nn.Embedding(26 + 1, embedding_dim)  # 26 letters + padding

        # Word embeddings
        self.word_embeddings = nn.Embedding(vocab_size, embedding_dim)

    def get_subword_vectors(self, word):
        """Get subword vectors for a word"""
        # Convert word to character indices
        char_indices = [ord(c) - ord('a') + 1 for c in word.lower() if c.isalpha()]
        char_indices = torch.LongTensor(char_indices)

        # Get character embeddings
        char_embeds = self.char_embeddings(char_indices)  # [len(word), embedding_dim]

        # Compute n-grams
        n_gram_vectors = []
        for n in range(1, self.n_grams + 1):
            for i in range(len(char_indices) - n + 1):
                n_gram = char_embeds[i:i+n]
                n_gram_vectors.append(torch.mean(n_gram, dim=0))

        if len(n_gram_vectors) == 0:
            return torch.zeros(self.embedding_dim)

        return torch.mean(torch.stack(n_gram_vectors), dim=0)

    def forward(self, input_word, output_words=None):
        """Forward pass"""
        if self.model_type == 'skipgram':
            return self.forward_skipgram(input_word, output_words)
        else:
            return self.forward_cbow(input_word)

    def forward_skipgram(self, input_word, output_words):
        """Skip-gram forward pass with subword information"""
        # Get word embedding
        word_embedding = self.word_embeddings(input_word)  # [batch_size, embedding_dim]

        # Get subword embedding
        subword_embedding = self.get_subword_vectors(self.inv_vocab[input_word.item()])
        subword_embedding = subword_embedding.unsqueeze(0).expand(word_embedding.size(0), -1)

        # Combine word and subword embeddings
        combined_embedding = word_embedding + subword_embedding

        # Get output embeddings
        output_embeddings = self.word_embeddings(output_words)  # [batch_size, num_neg_samples+1, embedding_dim]

        # Compute scores
        scores = torch.bmm(output_embeddings, combined_embedding.unsqueeze(2)).squeeze(2)

        return scores

Contextual Word2Vec

# Contextual Word2Vec that considers sentence context
class ContextualWord2Vec(nn.Module):
    def __init__(self, vocab_size, embedding_dim, hidden_dim, model_type='skipgram'):
        super(ContextualWord2Vec, self).__init__()
        self.model_type = model_type

        # Word embeddings
        self.word_embeddings = nn.Embedding(vocab_size, embedding_dim)

        # Context encoder (LSTM)
        self.context_encoder = nn.LSTM(embedding_dim, hidden_dim, batch_first=True)

        # Output embeddings
        self.output_embeddings = nn.Embedding(vocab_size, hidden_dim)

    def forward(self, input_word, context_words, output_words=None):
        """Forward pass with context"""
        if self.model_type == 'skipgram':
            return self.forward_skipgram(input_word, context_words, output_words)
        else:
            return self.forward_cbow(input_word, context_words)

    def forward_skipgram(self, input_word, context_words, output_words):
        """Skip-gram forward pass with context"""
        # Get input word embedding
        input_embedding = self.word_embeddings(input_word)  # [batch_size, embedding_dim]

        # Encode context
        context_embeddings = self.word_embeddings(context_words)  # [batch_size, context_size, embedding_dim]
        _, (hidden, _) = self.context_encoder(context_embeddings)

        # Combine input and context
        combined = input_embedding + hidden.squeeze(0)

        # Get output embeddings
        output_embeddings = self.output_embeddings(output_words)  # [batch_size, num_neg_samples+1, hidden_dim]

        # Compute scores
        scores = torch.bmm(output_embeddings, combined.unsqueeze(2)).squeeze(2)

        return scores

Word2Vec vs Other Embedding Methods

FeatureWord2VecGloVeFastTextBERT/Transformers
Training MethodPredictive (neural network)Count-based (matrix factorization)Predictive with subword infoContextual (transformer)
Context HandlingLocal context windowGlobal co-occurrenceLocal context with subwordsFull sentence context
Subword InfoNoNoYes (character n-grams)Yes (WordPiece/BytePair)
Training SpeedFastVery fastModerateSlow
Memory UsageLowLowModerateHigh
Rare WordsPoor handlingPoor handlingGood handlingExcellent handling
ContextualNoNoNoYes
Transfer LearningGoodGoodGoodExcellent
ImplementationSimpleSimpleModerateComplex
Use CaseGeneral purposeGeneral purposeMorphologically rich languagesContext-dependent tasks
Vector ArithmeticExcellentGoodGoodLimited
Pre-trained ModelsAvailableAvailableAvailableWidely available
Fine-tuningNot applicableNot applicableNot applicableYes

Training Word2Vec

Best Practices

AspectRecommendationNotes
Corpus SizeLarge corpus (100M+ words)More data = better embeddings
Vocabulary Size50K-300K wordsBalance between coverage and efficiency
Embedding Dim100-300 dimensions300 is common for good performance
Window Size5-10 wordsLarger windows capture broader context
Model TypeSkip-gram for large datasetsCBOW is faster for smaller datasets
Negative Samples5-20More samples = better quality
SubsamplingUse subsampling of frequent wordsImproves quality and training speed
Iterations5-50 epochsMore iterations for larger datasets
Learning Rate0.01-0.05Start high, decay over time
Batch Size100-1000Larger batches for stability
HardwareGPU accelerationSignificant speedup for large datasets
EvaluationUse intrinsic and extrinsic evaluationWord similarity, analogy tasks, downstream tasks

Training Pipeline

# Complete Word2Vec training pipeline
class Word2VecPipeline:
    def __init__(self, corpus_path, embedding_dim=300, model_type='skipgram',
                 window_size=5, negative_samples=5, min_count=5):
        self.corpus_path = corpus_path
        self.embedding_dim = embedding_dim
        self.model_type = model_type
        self.window_size = window_size
        self.negative_samples = negative_samples
        self.min_count = min_count

    def load_corpus(self):
        """Load and preprocess corpus"""
        with open(self.corpus_path, 'r', encoding='utf-8') as f:
            text = f.read()

        # Basic preprocessing
        text = text.lower()
        words = text.split()

        # Filter rare words
        word_counts = Counter(words)
        words = [word for word in words if word_counts[word] >= self.min_count]

        return words

    def train_model(self, epochs=10):
        """Train Word2Vec model"""
        # Load corpus
        corpus = self.load_corpus()

        # Initialize model
        trainer = Word2VecTrainer(
            corpus,
            embedding_dim=self.embedding_dim,
            model_type=self.model_type,
            window_size=self.window_size,
            negative_samples=self.negative_samples
        )

        # Train model
        trainer.train(epochs=epochs)

        return trainer

    def evaluate_model(self, model):
        """Evaluate the trained model"""
        # Intrinsic evaluation - word similarities
        print("Word Similarities:")
        print(f"king - man + woman ≈ {model.find_similar_words('king', 1)[0][0]}")
        print(f"paris - france + germany ≈ {model.find_similar_words('paris', 1)[0][0]}")
        print(f"car - drive + fly ≈ {model.find_similar_words('car', 1)[0][0]}")

        # Extrinsic evaluation - analogy tasks
        self.evaluate_analogies(model)

    def evaluate_analogies(self, model):
        """Evaluate on analogy tasks"""
        analogies = [
            ('king', 'man', 'woman', 'queen'),
            ('paris', 'france', 'germany', 'berlin'),
            ('big', 'bigger', 'small', 'smaller'),
            ('good', 'better', 'bad', 'worse'),
            ('jump', 'jumped', 'run', 'ran')
        ]

        correct = 0
        for a, b, c, expected in analogies:
            try:
                # Compute a - b + c
                a_vec = model.get_word_vector(a)
                b_vec = model.get_word_vector(b)
                c_vec = model.get_word_vector(c)

                if a_vec is not None and b_vec is not None and c_vec is not None:
                    result_vec = a_vec - b_vec + c_vec
                    similar = model.find_similar_words_by_vector(result_vec, topn=1)[0][0]

                    if similar == expected:
                        correct += 1
                        print(f"✓ {a} - {b} + {c} = {similar}")
                    else:
                        print(f"✗ {a} - {b} + {c} = {similar} (expected {expected})")
            except:
                print(f"✗ {a} - {b} + {c} = ? (missing words)")

        print(f"Accuracy: {correct}/{len(analogies)} = {correct/len(analogies):.2f}")

    def save_model(self, model, output_path):
        """Save the trained model"""
        import pickle

        # Save model and metadata
        model_data = {
            'embeddings': model.get_embeddings(),
            'vocab': model.vocab,
            'inv_vocab': model.inv_vocab,
            'config': {
                'embedding_dim': self.embedding_dim,
                'model_type': self.model_type,
                'window_size': self.window_size,
                'negative_samples': self.negative_samples
            }
        }

        with open(output_path, 'wb') as f:
            pickle.dump(model_data, f)

    def load_model(self, model_path):
        """Load a trained model"""
        import pickle

        with open(model_path, 'rb') as f:
            model_data = pickle.load(f)

        # Create a lightweight model for inference
        class InferenceModel:
            def __init__(self, model_data):
                self.embeddings = model_data['embeddings']
                self.vocab = model_data['vocab']
                self.inv_vocab = model_data['inv_vocab']
                self.embedding_dim = model_data['config']['embedding_dim']

            def get_word_vector(self, word):
                if word not in self.vocab:
                    return None
                idx = self.vocab[word]
                return self.embeddings[idx]

            def find_similar_words(self, word, topn=5):
                if word not in self.vocab:
                    return []

                word_vec = self.get_word_vector(word)
                if word_vec is None:
                    return []

                similarities = []
                for i, vec in enumerate(self.embeddings):
                    if i != self.vocab[word]:
                        cos_sim = np.dot(word_vec, vec) / (np.linalg.norm(word_vec) * np.linalg.norm(vec))
                        similarities.append((self.inv_vocab[i], cos_sim))

                similarities.sort(key=lambda x: x[1], reverse=True)
                return similarities[:topn]

            def find_similar_words_by_vector(self, vector, topn=5):
                similarities = []
                for i, vec in enumerate(self.embeddings):
                    cos_sim = np.dot(vector, vec) / (np.linalg.norm(vector) * np.linalg.norm(vec))
                    similarities.append((self.inv_vocab[i], cos_sim))

                similarities.sort(key=lambda x: x[1], reverse=True)
                return similarities[:topn]

        return InferenceModel(model_data)

Word2Vec Research

Key Papers

  1. "Efficient Estimation of Word Representations in Vector Space" (Mikolov et al., 2013)
    • Introduced Word2Vec
    • Demonstrated efficient training of word vectors
    • Foundation for modern word embeddings
  2. "Distributed Representations of Words and Phrases and their Compositionality" (Mikolov et al., 2013)
    • Introduced Skip-gram with negative sampling
    • Demonstrated word vector arithmetic
    • Foundation for semantic relationships
  3. "Linguistic Regularities in Continuous Space Word Representations" (Mikolov et al., 2013)
    • Demonstrated semantic and syntactic regularities
    • Showed vector arithmetic for analogies
    • Foundation for understanding word vector properties
  4. "word2vec Parameter Learning Explained" (Rong, 2014)
    • Provided detailed explanation of Word2Vec mathematics
    • Clarified training process
    • Foundation for understanding implementation details
  5. "Enriching Word Vectors with Subword Information" (Bojanowski et al., 2017)
    • Introduced subword information (FastText)
    • Demonstrated improved handling of rare words
    • Foundation for subword embeddings

Emerging Research Directions

  • Contextual Word2Vec: Incorporating sentence context
  • Multilingual Word2Vec: Cross-lingual embeddings
  • Domain-Specific Word2Vec: Specialized embeddings for specific domains
  • Dynamic Word2Vec: Time-evolving word representations
  • Interpretable Word2Vec: More interpretable vector spaces
  • Efficient Word2Vec: Faster training algorithms
  • Multimodal Word2Vec: Combining text with other modalities
  • Cognitive Word2Vec: Brain-inspired word representations
  • Green Word2Vec: Energy-efficient training
  • Few-Shot Word2Vec: Learning from limited data
  • Adversarial Word2Vec: Robust word representations
  • Theoretical Foundations: Better understanding of word vectors
  • Hardware Acceleration: Specialized hardware for Word2Vec

Best Practices

Implementation Guidelines

AspectRecommendationNotes
PreprocessingClean text, handle case, tokenizeRemove noise, normalize text
VocabularyFilter rare words, handle OOVUse min_count, consider subword info
Model SelectionSkip-gram for large datasetsCBOW for smaller datasets
HyperparametersTune embedding dim, window size300 dim, window size 5-10 common
TrainingUse negative sampling, subsampling5-20 negative samples, subsampling rate 1e-3 to 1e-5
EvaluationUse multiple metricsWord similarity, analogy tasks, downstream tasks
DeploymentOptimize for inferenceUse efficient data structures
MonitoringTrack training loss, evaluation metricsEarly stopping based on validation
ScalingUse distributed training for large dataConsider GPU acceleration

Common Pitfalls and Solutions

PitfallSolutionExample
Small CorpusUse pre-trained embeddingsGloVe, FastText pre-trained vectors
Rare WordsUse subword informationFastText, character n-grams
OverfittingUse early stopping, regularizationMonitor validation loss
Slow TrainingUse negative sampling, subsamplingNegative sampling with k=5-20
Poor QualityIncrease corpus size, tune hyperparametersUse larger corpus, adjust window size
Memory IssuesUse memory-efficient implementationsGensim, optimized C implementations
Context WindowExperiment with different window sizesTry 5, 10, 15 word windows
Learning RateUse learning rate schedulingStart with 0.025, decay over time
Evaluation BiasUse multiple evaluation metricsWord similarity + downstream tasks
Domain MismatchFine-tune on domain-specific dataContinue training on domain corpus
OOV WordsUse subword models or fallback strategiesFastText, character-level models
InterpretabilityUse dimensionality reduction techniquesPCA, t-SNE for visualization

Future Directions

  • Contextual Word Embeddings: Moving beyond static embeddings to contextual representations
  • Multimodal Embeddings: Combining text with images, audio, and other modalities
  • Dynamic Embeddings: Time-evolving word representations that capture language change
  • Interpretable Embeddings: More human-understandable vector spaces
  • Efficient Training: Faster algorithms for large-scale training
  • Green Embeddings: Energy-efficient training methods
  • Multilingual Embeddings: Better cross-lingual representations
  • Domain-Specific Embeddings: Specialized embeddings for specific domains
  • Few-Shot Learning: Learning embeddings from limited data
  • Adversarial Robustness: Robust embeddings against adversarial attacks
  • Cognitive Models: Brain-inspired word representations
  • Theoretical Breakthroughs: Better understanding of word vector properties
  • Hardware Acceleration: Specialized hardware for embedding training

External Resources