Cosine Similarity

Mathematical measure that calculates the cosine of the angle between two vectors to determine their similarity regardless of magnitude.

What is Cosine Similarity?

Cosine similarity is a mathematical measure that calculates the cosine of the angle between two vectors in a multi-dimensional space. It determines how similar two vectors are regardless of their magnitude, focusing on the orientation (direction) rather than the scale of the vectors.

Key Concepts

Cosine Similarity Formula

The cosine similarity between two vectors A and B is defined as:

$$\text{cosine_similarity}(A, B) = \frac{A \cdot B}{|A| |B|} = \frac{\sum_^n A_i B_i}{\sqrt{\sum_^n A_i^2} \sqrt{\sum_^n B_i^2}}$$

Where:

  • $A \cdot B$ = dot product of vectors A and B
  • $|A|$ = magnitude (Euclidean norm) of vector A
  • $|B|$ = magnitude of vector B
  • $A_i, B_i$ = components of vectors A and B

Geometric Interpretation

graph LR
    A[Origin] --> B[Vector A]
    A --> C[Vector B]
    B --> D[Angle θ]
    C --> D

    style A fill:#f9f,stroke:#333
    style D fill:#f9f,stroke:#333

The cosine similarity measures the angle θ between two vectors:

  • 1: Vectors are identical (angle = 0°)
  • 0: Vectors are orthogonal (angle = 90°)
  • -1: Vectors are opposite (angle = 180°)

Properties

PropertyDescriptionRange
RangePossible values-1, 1
Scale InvarianceIndependent of vector magnitudeYes
Symmetrycosine(A,B) = cosine(B,A)Yes
NormalizationWorks best with normalized vectorsRecommended
DimensionalityWorks in any dimensional spaceAny
InterpretabilityIntuitive geometric interpretationHigh

Applications

Natural Language Processing

  • Semantic Search: Find similar documents
  • Text Classification: Categorize similar texts
  • Plagiarism Detection: Identify similar content
  • Recommendation Systems: Find similar items
  • Question Answering: Retrieve relevant answers

Information Retrieval

  • Document Similarity: Find similar documents
  • Query Matching: Match queries to documents
  • Clustering: Group similar items
  • Deduplication: Identify duplicate content
  • Content Moderation: Identify similar inappropriate content

Machine Learning

  • Feature Similarity: Measure feature relationships
  • Model Interpretation: Understand model decisions
  • Anomaly Detection: Identify unusual patterns
  • Collaborative Filtering: User-item similarity
  • Transfer Learning: Measure domain similarity

Computer Vision

  • Image Similarity: Find similar images
  • Face Recognition: Identify faces
  • Object Detection: Find similar objects
  • Content-Based Retrieval: Retrieve similar visual content
  • Style Transfer: Measure style similarity

Implementation

Mathematical Implementation

import numpy as np
from numpy.linalg import norm

def cosine_similarity(a, b):
    """
    Calculate cosine similarity between two vectors

    Args:
        a: First vector
        b: Second vector

    Returns:
        Cosine similarity between a and b
    """
    # Convert to numpy arrays if not already
    a = np.array(a)
    b = np.array(b)

    # Calculate dot product
    dot_product = np.dot(a, b)

    # Calculate magnitudes
    magnitude_a = norm(a)
    magnitude_b = norm(b)

    # Avoid division by zero
    if magnitude_a == 0 or magnitude_b == 0:
        return 0.0

    # Calculate cosine similarity
    similarity = dot_product / (magnitude_a * magnitude_b)

    return similarity

# Example usage
vector1 = [1, 2, 3, 4]
vector2 = [2, 4, 6, 8]  # Same direction, different magnitude
vector3 = [-1, -2, -3, -4]  # Opposite direction
vector4 = [1, 0, 0, 0]  # Orthogonal

print(f"Similarity between identical direction vectors: {cosine_similarity(vector1, vector2):.4f}")
print(f"Similarity between opposite vectors: {cosine_similarity(vector1, vector3):.4f}")
print(f"Similarity between orthogonal vectors: {cosine_similarity(vector1, vector4):.4f}")
print(f"Similarity of vector with itself: {cosine_similarity(vector1, vector1):.4f}")

Practical Implementation with Text

from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity

# Example documents
documents = [
    "The cat sits on the mat",
    "A feline is resting on the rug",
    "Dogs are playing in the park",
    "Canines are running around the garden",
    "The sun is shining brightly"
]

# Create TF-IDF vectors
vectorizer = TfidfVectorizer()
tfidf_matrix = vectorizer.fit_transform(documents)

# Calculate cosine similarity matrix
cosine_sim_matrix = cosine_similarity(tfidf_matrix)

print("Document similarity matrix:")
print(np.round(cosine_sim_matrix, 3))

# Find most similar documents
def find_most_similar(query_idx, similarity_matrix, documents, k=2):
    similarities = similarity_matrix[query_idx]
    similar_indices = np.argsort(similarities)[-k-1:-1][::-1]  # Exclude self

    print(f"\nMost similar to: '{documents[query_idx]}'")
    for i, idx in enumerate(similar_indices):
        print(f"{i+1}: '{documents[idx]}' (similarity: {similarities[idx]:.3f})")

# Example queries
find_most_similar(0, cosine_sim_matrix, documents)  # "The cat sits on the mat"
find_most_similar(2, cosine_sim_matrix, documents)  # "Dogs are playing in the park"
find_most_similar(4, cosine_sim_matrix, documents)  # "The sun is shining brightly"

Implementation with Embeddings

from sentence_transformers import SentenceTransformer
import numpy as np

# Load pre-trained sentence embedding model
model = SentenceTransformer('all-MiniLM-L6-v2')

# Example sentences
sentences = [
    "The cat sits on the mat",
    "A feline is resting on the rug",
    "Dogs are playing in the park",
    "Canines are running around the garden",
    "The sun is shining brightly",
    "Artificial intelligence is transforming industries"
]

# Generate embeddings
embeddings = model.encode(sentences, convert_to_tensor=True)
embeddings_np = embeddings.cpu().numpy()

# Calculate cosine similarity matrix
cosine_sim_matrix = cosine_similarity(embeddings_np)

print("Sentence similarity matrix:")
print(np.round(cosine_sim_matrix, 3))

# Visualize similarity
import matplotlib.pyplot as plt
import seaborn as sns

plt.figure(figsize=(10, 8))
sns.heatmap(cosine_sim_matrix, annot=True, xticklabels=sentences, yticklabels=sentences,
            cmap='coolwarm', vmin=-1, vmax=1)
plt.title("Cosine Similarity Between Sentences")
plt.xticks(rotation=45, ha='right')
plt.tight_layout()
plt.show()

Comparison with Other Metrics

MetricFormulaRangeScale InvariantMagnitude SensitiveUse Case
Cosine Similarity$\frac{A \cdot B}{|A| |B|}$-1, 1YesNoText, embeddings
Euclidean Distance$\sqrt{\sum_^n (A_i - B_i)^2}$[0, ∞)NoYesSpatial data
Dot Product$A \cdot B = \sum_^n A_i B_i$(-∞, ∞)NoYesMagnitude matters
Manhattan Distance$\sum_^n |A_i - B_i|$[0, ∞)NoYesGrid-like data
Pearson Correlation$\frac{\text{cov}(A,B)}{\sigma_A \sigma_B}$-1, 1YesNoStatistical relationships

Advantages and Limitations

Advantages

  • Scale Invariance: Independent of vector magnitude
  • Efficient Computation: Simple mathematical operations
  • Interpretability: Intuitive geometric interpretation
  • Works in High Dimensions: Effective for embeddings
  • Normalization Friendly: Works well with normalized vectors

Limitations

  • Sensitive to Vector Orientation: Only considers direction
  • Ignores Magnitude: May miss important magnitude differences
  • Curse of Dimensionality: Performance degrades in very high dimensions
  • Sparse Data: Less effective with sparse vectors
  • Negative Values: Can be counterintuitive for some applications

Best Practices

Data Preparation

  • Normalization: Normalize vectors for consistent results
  • Dimensionality: Consider dimensionality reduction for high-D vectors
  • Sparsity: Handle sparse vectors appropriately
  • Data Quality: Ensure high-quality input data
  • Preprocessing: Clean and preprocess text data

Implementation

  • Vectorization: Use appropriate vectorization methods
  • Efficiency: Optimize for large-scale computations
  • Batch Processing: Process vectors in batches
  • Caching: Cache frequent similarity computations
  • Parallelization: Use parallel processing for large datasets

Application-Specific

  • Text Similarity: Combine with TF-IDF or embeddings
  • Recommendation Systems: Use with collaborative filtering
  • Clustering: Combine with clustering algorithms
  • Search: Use with vector databases
  • Evaluation: Use appropriate benchmarks

Research and Advancements

Key Papers

  1. "A Vector Space Model for Automatic Indexing" (Salton et al., 1975)
    • Introduced vector space model for information retrieval
    • Foundation for cosine similarity in NLP
  2. "Introduction to Modern Information Retrieval" (Salton & McGill, 1983)
    • Comprehensive treatment of cosine similarity
    • Applications in information retrieval
  3. "Distributed Representations of Words and Phrases and their Compositionality" (Mikolov et al., 2013)
    • Word2Vec embeddings
    • Cosine similarity for word relationships
  4. "Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks" (Reimers & Gurevych, 2019)
    • Sentence embeddings optimized for cosine similarity
    • State-of-the-art text similarity

Emerging Research Directions

  • Learned Similarity Metrics: Adaptive similarity measures
  • Context-Aware Similarity: Similarity that considers context
  • Multimodal Similarity: Similarity across different modalities
  • Explainable Similarity: Interpretable similarity measures
  • Efficient Similarity: Optimized similarity computation
  • Dynamic Similarity: Similarity that adapts over time
  • Domain-Specific Similarity: Specialized similarity measures
  • Few-Shot Similarity: Similarity with limited examples

External Resources