Cosine Similarity
Mathematical measure that calculates the cosine of the angle between two vectors to determine their similarity regardless of magnitude.
What is Cosine Similarity?
Cosine similarity is a mathematical measure that calculates the cosine of the angle between two vectors in a multi-dimensional space. It determines how similar two vectors are regardless of their magnitude, focusing on the orientation (direction) rather than the scale of the vectors.
Key Concepts
Cosine Similarity Formula
The cosine similarity between two vectors A and B is defined as:
$$\text{cosine_similarity}(A, B) = \frac{A \cdot B}{|A| |B|} = \frac{\sum_^n A_i B_i}{\sqrt{\sum_^n A_i^2} \sqrt{\sum_^n B_i^2}}$$
Where:
- $A \cdot B$ = dot product of vectors A and B
- $|A|$ = magnitude (Euclidean norm) of vector A
- $|B|$ = magnitude of vector B
- $A_i, B_i$ = components of vectors A and B
Geometric Interpretation
graph LR
A[Origin] --> B[Vector A]
A --> C[Vector B]
B --> D[Angle θ]
C --> D
style A fill:#f9f,stroke:#333
style D fill:#f9f,stroke:#333
The cosine similarity measures the angle θ between two vectors:
- 1: Vectors are identical (angle = 0°)
- 0: Vectors are orthogonal (angle = 90°)
- -1: Vectors are opposite (angle = 180°)
Properties
| Property | Description | Range |
|---|---|---|
| Range | Possible values | -1, 1 |
| Scale Invariance | Independent of vector magnitude | Yes |
| Symmetry | cosine(A,B) = cosine(B,A) | Yes |
| Normalization | Works best with normalized vectors | Recommended |
| Dimensionality | Works in any dimensional space | Any |
| Interpretability | Intuitive geometric interpretation | High |
Applications
Natural Language Processing
- Semantic Search: Find similar documents
- Text Classification: Categorize similar texts
- Plagiarism Detection: Identify similar content
- Recommendation Systems: Find similar items
- Question Answering: Retrieve relevant answers
Information Retrieval
- Document Similarity: Find similar documents
- Query Matching: Match queries to documents
- Clustering: Group similar items
- Deduplication: Identify duplicate content
- Content Moderation: Identify similar inappropriate content
Machine Learning
- Feature Similarity: Measure feature relationships
- Model Interpretation: Understand model decisions
- Anomaly Detection: Identify unusual patterns
- Collaborative Filtering: User-item similarity
- Transfer Learning: Measure domain similarity
Computer Vision
- Image Similarity: Find similar images
- Face Recognition: Identify faces
- Object Detection: Find similar objects
- Content-Based Retrieval: Retrieve similar visual content
- Style Transfer: Measure style similarity
Implementation
Mathematical Implementation
import numpy as np
from numpy.linalg import norm
def cosine_similarity(a, b):
"""
Calculate cosine similarity between two vectors
Args:
a: First vector
b: Second vector
Returns:
Cosine similarity between a and b
"""
# Convert to numpy arrays if not already
a = np.array(a)
b = np.array(b)
# Calculate dot product
dot_product = np.dot(a, b)
# Calculate magnitudes
magnitude_a = norm(a)
magnitude_b = norm(b)
# Avoid division by zero
if magnitude_a == 0 or magnitude_b == 0:
return 0.0
# Calculate cosine similarity
similarity = dot_product / (magnitude_a * magnitude_b)
return similarity
# Example usage
vector1 = [1, 2, 3, 4]
vector2 = [2, 4, 6, 8] # Same direction, different magnitude
vector3 = [-1, -2, -3, -4] # Opposite direction
vector4 = [1, 0, 0, 0] # Orthogonal
print(f"Similarity between identical direction vectors: {cosine_similarity(vector1, vector2):.4f}")
print(f"Similarity between opposite vectors: {cosine_similarity(vector1, vector3):.4f}")
print(f"Similarity between orthogonal vectors: {cosine_similarity(vector1, vector4):.4f}")
print(f"Similarity of vector with itself: {cosine_similarity(vector1, vector1):.4f}")
Practical Implementation with Text
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity
# Example documents
documents = [
"The cat sits on the mat",
"A feline is resting on the rug",
"Dogs are playing in the park",
"Canines are running around the garden",
"The sun is shining brightly"
]
# Create TF-IDF vectors
vectorizer = TfidfVectorizer()
tfidf_matrix = vectorizer.fit_transform(documents)
# Calculate cosine similarity matrix
cosine_sim_matrix = cosine_similarity(tfidf_matrix)
print("Document similarity matrix:")
print(np.round(cosine_sim_matrix, 3))
# Find most similar documents
def find_most_similar(query_idx, similarity_matrix, documents, k=2):
similarities = similarity_matrix[query_idx]
similar_indices = np.argsort(similarities)[-k-1:-1][::-1] # Exclude self
print(f"\nMost similar to: '{documents[query_idx]}'")
for i, idx in enumerate(similar_indices):
print(f"{i+1}: '{documents[idx]}' (similarity: {similarities[idx]:.3f})")
# Example queries
find_most_similar(0, cosine_sim_matrix, documents) # "The cat sits on the mat"
find_most_similar(2, cosine_sim_matrix, documents) # "Dogs are playing in the park"
find_most_similar(4, cosine_sim_matrix, documents) # "The sun is shining brightly"
Implementation with Embeddings
from sentence_transformers import SentenceTransformer
import numpy as np
# Load pre-trained sentence embedding model
model = SentenceTransformer('all-MiniLM-L6-v2')
# Example sentences
sentences = [
"The cat sits on the mat",
"A feline is resting on the rug",
"Dogs are playing in the park",
"Canines are running around the garden",
"The sun is shining brightly",
"Artificial intelligence is transforming industries"
]
# Generate embeddings
embeddings = model.encode(sentences, convert_to_tensor=True)
embeddings_np = embeddings.cpu().numpy()
# Calculate cosine similarity matrix
cosine_sim_matrix = cosine_similarity(embeddings_np)
print("Sentence similarity matrix:")
print(np.round(cosine_sim_matrix, 3))
# Visualize similarity
import matplotlib.pyplot as plt
import seaborn as sns
plt.figure(figsize=(10, 8))
sns.heatmap(cosine_sim_matrix, annot=True, xticklabels=sentences, yticklabels=sentences,
cmap='coolwarm', vmin=-1, vmax=1)
plt.title("Cosine Similarity Between Sentences")
plt.xticks(rotation=45, ha='right')
plt.tight_layout()
plt.show()
Comparison with Other Metrics
| Metric | Formula | Range | Scale Invariant | Magnitude Sensitive | Use Case |
|---|---|---|---|---|---|
| Cosine Similarity | $\frac{A \cdot B}{|A| |B|}$ | -1, 1 | Yes | No | Text, embeddings |
| Euclidean Distance | $\sqrt{\sum_^n (A_i - B_i)^2}$ | [0, ∞) | No | Yes | Spatial data |
| Dot Product | $A \cdot B = \sum_^n A_i B_i$ | (-∞, ∞) | No | Yes | Magnitude matters |
| Manhattan Distance | $\sum_^n |A_i - B_i|$ | [0, ∞) | No | Yes | Grid-like data |
| Pearson Correlation | $\frac{\text{cov}(A,B)}{\sigma_A \sigma_B}$ | -1, 1 | Yes | No | Statistical relationships |
Advantages and Limitations
Advantages
- Scale Invariance: Independent of vector magnitude
- Efficient Computation: Simple mathematical operations
- Interpretability: Intuitive geometric interpretation
- Works in High Dimensions: Effective for embeddings
- Normalization Friendly: Works well with normalized vectors
Limitations
- Sensitive to Vector Orientation: Only considers direction
- Ignores Magnitude: May miss important magnitude differences
- Curse of Dimensionality: Performance degrades in very high dimensions
- Sparse Data: Less effective with sparse vectors
- Negative Values: Can be counterintuitive for some applications
Best Practices
Data Preparation
- Normalization: Normalize vectors for consistent results
- Dimensionality: Consider dimensionality reduction for high-D vectors
- Sparsity: Handle sparse vectors appropriately
- Data Quality: Ensure high-quality input data
- Preprocessing: Clean and preprocess text data
Implementation
- Vectorization: Use appropriate vectorization methods
- Efficiency: Optimize for large-scale computations
- Batch Processing: Process vectors in batches
- Caching: Cache frequent similarity computations
- Parallelization: Use parallel processing for large datasets
Application-Specific
- Text Similarity: Combine with TF-IDF or embeddings
- Recommendation Systems: Use with collaborative filtering
- Clustering: Combine with clustering algorithms
- Search: Use with vector databases
- Evaluation: Use appropriate benchmarks
Research and Advancements
Key Papers
- "A Vector Space Model for Automatic Indexing" (Salton et al., 1975)
- Introduced vector space model for information retrieval
- Foundation for cosine similarity in NLP
- "Introduction to Modern Information Retrieval" (Salton & McGill, 1983)
- Comprehensive treatment of cosine similarity
- Applications in information retrieval
- "Distributed Representations of Words and Phrases and their Compositionality" (Mikolov et al., 2013)
- Word2Vec embeddings
- Cosine similarity for word relationships
- "Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks" (Reimers & Gurevych, 2019)
- Sentence embeddings optimized for cosine similarity
- State-of-the-art text similarity
Emerging Research Directions
- Learned Similarity Metrics: Adaptive similarity measures
- Context-Aware Similarity: Similarity that considers context
- Multimodal Similarity: Similarity across different modalities
- Explainable Similarity: Interpretable similarity measures
- Efficient Similarity: Optimized similarity computation
- Dynamic Similarity: Similarity that adapts over time
- Domain-Specific Similarity: Specialized similarity measures
- Few-Shot Similarity: Similarity with limited examples