Weaviate
Open-source vector search engine with built-in knowledge graph capabilities for AI applications.
What is Weaviate?
Weaviate is an open-source vector search engine that combines vector similarity search with knowledge graph capabilities. It enables semantic search, question answering, and other AI-powered applications by providing a flexible, schema-based approach to storing and querying both vectors and structured data.
Key Concepts
Weaviate Architecture
graph TD
A[Weaviate] --> B[Core Engine]
A --> C[Storage Layer]
A --> D[API Layer]
A --> E[Modules]
B --> B1[Vector Index]
B --> B2[Graph Engine]
B --> B3[Query Processor]
C --> C1[Vector Storage]
C --> C2[Object Storage]
C --> C3[Metadata Storage]
D --> D1[REST API]
D --> D2[GraphQL API]
D --> D3[gRPC]
E --> E1[NLP Modules]
E --> E2[Image Modules]
E --> E3[Custom Modules]
style A fill:#f9f,stroke:#333
Core Features
- Hybrid Search: Combine vector and keyword search
- Knowledge Graph: Built-in graph capabilities
- Schema Flexibility: Dynamic schema definition
- Modular Design: Extensible with modules
- Multi-Tenancy: Support for multiple tenants
- Real-Time Updates: Instant index updates
- Multi-Language Support: SDKs for multiple languages
Approaches and Architecture
Data Model
Weaviate uses a flexible, schema-based data model:
classDiagram
class Class {
+name: string
+description: string
+vectorizer: string
+properties: Property[]
}
class Property {
+name: string
+dataType: string[]
+description: string
+indexFilterable: boolean
+indexSearchable: boolean
}
class Object {
+id: uuid
+class: string
+properties: map
+vector: float[]
}
Class "1" --> "*" Property
Class "1" --> "*" Object
Index Types
| Index Type | Description | Use Case |
|---|---|---|
| HNSW | Hierarchical Navigable Small World | Default, high performance |
| Flat | Brute-force exact search | Small datasets, exact search |
| Dynamic | Automatically selects index type | General purpose |
Module System
Weaviate's modular architecture allows for extensibility:
| Module Type | Description | Example Modules |
|---|---|---|
| Vectorizers | Convert data to vectors | text2vec, img2vec |
| Generative | Generate text from vectors | generative-openai |
| Q&A | Question answering | qna-openai |
| NER | Named entity recognition | ner-transformers |
| Summarization | Text summarization | sum-transformers |
Mathematical Foundations
Hybrid Search
Weaviate combines vector similarity with keyword search:
$$S = \alpha \cdot S_v + (1 - \alpha) \cdot S_k + \beta \cdot S_g$$
Where:
- $S$ = final score
- $S_v$ = vector similarity score
- $S_k$ = keyword relevance score
- $S_g$ = graph connectivity score
- $\alpha$, $\beta$ = weighting factors (0 ≤ α, β ≤ 1)
Graph Traversal
Weaviate uses graph algorithms for semantic search:
$$P = \text{ShortestPath}(s, t, G)$$
Where:
- $P$ = path between source and target
- $s$ = source node
- $t$ = target node
- $G$ = knowledge graph
Applications
Semantic Applications
- Semantic Search: Context-aware search
- Question Answering: AI-powered Q&A systems
- Knowledge Management: Organizational knowledge graphs
- Content Discovery: Personalized content recommendations
- Enterprise Search: Unified search across data sources
AI Systems
- Chatbots: Context-aware conversational AI
- Recommendation Systems: Personalized recommendations
- Content Moderation: Automated content filtering
- Fraud Detection: Anomaly detection
- Decision Support: Data-driven decision making
Industry-Specific
- Healthcare: Medical knowledge graphs
- Finance: Risk assessment and analysis
- E-commerce: Product recommendations
- Media: Content discovery and personalization
- Education: Personalized learning systems
Implementation
Basic Usage
import weaviate
import numpy as np
from weaviate.util import generate_uuid5
# Connect to Weaviate
client = weaviate.Client("http://localhost:8080")
# Define schema
schema = {
"classes": [
{
"class": "Article",
"description": "A news article",
"vectorizer": "text2vec-transformers",
"properties": [
{
"name": "title",
"dataType": ["text"],
"description": "The title of the article",
"indexSearchable": True
},
{
"name": "content",
"dataType": ["text"],
"description": "The content of the article",
"indexSearchable": True
},
{
"name": "category",
"dataType": ["string"],
"description": "The category of the article",
"indexFilterable": True
},
{
"name": "published_at",
"dataType": ["date"],
"description": "The publication date"
},
{
"name": "views",
"dataType": ["int"],
"description": "Number of views"
}
]
}
]
}
# Create schema
client.schema.create(schema)
# Generate sample data
articles = []
for i in range(100):
article = {
"title": f"Article {i}: AI Advancements in {['Healthcare', 'Finance', 'Education', 'Technology', 'Science'][i % 5]}",
"content": f"This is the content of article {i} discussing recent advancements in AI for {['healthcare', 'financial services', 'education', 'technology', 'scientific research'][i % 5]}. The article covers topics such as machine learning, deep learning, and natural language processing.",
"category": ["Technology", "Science", "Health", "Business", "Education"][i % 5],
"published_at": f"2023-{i % 12 + 1:02d}-{i % 28 + 1:02d}T00:00:00Z",
"views": i * 100
}
articles.append(article)
# Batch insert
with client.batch as batch:
batch.batch_size = 10
for article in articles:
batch.add_data_object(
data_object=article,
class_name="Article",
uuid=generate_uuid5(article, "Article")
)
# Vector search
response = (
client.query
.get("Article", ["title", "content", "category", "views"])
.with_near_text({"concepts": ["machine learning advancements"]})
.with_limit(5)
.do()
)
print("Vector search results:")
for article in response["data"]["Get"]["Article"]:
print(f"Title: {article['title']}")
print(f"Category: {article['category']}")
print(f"Views: {article['views']}")
print(f"Content snippet: {article['content'][:100]}...")
print()
# Hybrid search
response = (
client.query
.get("Article", ["title", "category", "views"])
.with_hybrid(
query="machine learning",
alpha=0.5 # Balance between vector and keyword search
)
.with_limit(5)
.do()
)
print("Hybrid search results:")
for article in response["data"]["Get"]["Article"]:
print(f"Title: {article['title']}")
print(f"Category: {article['category']}")
print(f"Views: {article['views']}")
print()
Graph Operations
# Add cross-references to create a knowledge graph
client.schema.property.create(
"Article",
{
"name": "related_articles",
"dataType": ["Article"],
"description": "Related articles"
}
)
# Create relationships between articles
for i in range(10):
source_id = generate_uuid5(articles[i], "Article")
target_id = generate_uuid5(articles[(i + 10) % 100], "Article")
client.data_object.reference.add(
from_uuid=source_id,
from_property_name="related_articles",
to_uuid=target_id
)
# Graph traversal - find related articles
article_id = generate_uuid5(articles[0], "Article")
response = (
client.query
.get("Article", ["title", "category"])
.with_where({
"path": ["id"],
"operator": "Equal",
"valueString": article_id
})
.with_limit(1)
.do()
)
if response["data"]["Get"]["Article"]:
current_article = response["data"]["Get"]["Article"][0]
print(f"Current article: {current_article['title']}")
# Get related articles
response = (
client.query
.get("Article", ["title", "category"])
.with_where({
"path": ["related_articles", "Article", "id"],
"operator": "Equal",
"valueString": article_id
})
.with_limit(3)
.do()
)
print("Related articles:")
for article in response["data"]["Get"]["Article"]:
print(f" - {article['title']} ({article['category']})")
Module Integration
# Enable modules in Weaviate configuration
# This would be configured in the Weaviate config file
# Example using the generative module
response = (
client.query
.get("Article", ["title", "content"])
.with_near_text({"concepts": ["machine learning"]})
.with_limit(1)
.with_generate(
single_prompt="Summarize this article in 3 bullet points: {content}"
)
.do()
)
print("Generated summary:")
if response["data"]["Get"]["Article"]:
article = response["data"]["Get"]["Article"][0]
print(f"Title: {article['title']}")
print("Summary:")
print(article["_additional"]["generate"]["singleResult"])
print()
# Example using the Q&A module
response = (
client.query
.get("Article", ["title", "content"])
.with_near_text({"concepts": ["deep learning"]})
.with_limit(1)
.with_ask({
"question": "What are the main applications of deep learning mentioned?",
"properties": ["content"]
})
.do()
)
print("Q&A results:")
if response["data"]["Get"]["Article"]:
article = response["data"]["Get"]["Article"][0]
print(f"Title: {article['title']}")
print(f"Question: {article['_additional']['ask']['question']}")
print(f"Answer: {article['_additional']['ask']['answer']}")
print(f"Certainty: {article['_additional']['ask']['certainty']:.2f}")
Filtering and Aggregations
# Filtering
response = (
client.query
.get("Article", ["title", "category", "views"])
.with_where({
"operator": "And",
"operands": [
{
"path": ["category"],
"operator": "Equal",
"valueString": "Technology"
},
{
"path": ["views"],
"operator": "GreaterThan",
"valueInt": 500
}
]
})
.with_limit(5)
.do()
)
print("Filtered results (Technology articles with >500 views):")
for article in response["data"]["Get"]["Article"]:
print(f" - {article['title']} ({article['views']} views)")
# Aggregations
response = (
client.query
.aggregate("Article")
.with_fields("category")
.with_meta_count()
.do()
)
print("\nCategory distribution:")
for group in response["data"]["Aggregate"]["Article"]:
print(f" - {group['meta']['count']} articles in {group['groupedBy']['value']}")
# Near vector search with filtering
sample_vector = np.random.rand(384).tolist() # Assuming 384-dim vectors
response = (
client.query
.get("Article", ["title", "category"])
.with_near_vector({
"vector": sample_vector,
"certainty": 0.7
})
.with_where({
"path": ["views"],
"operator": "GreaterThan",
"valueInt": 300
})
.with_limit(3)
.do()
)
print("\nNear vector search with filtering:")
for article in response["data"]["Get"]["Article"]:
print(f" - {article['title']} ({article['category']})")
Performance Optimization
Index Configuration
| Parameter | Description | Recommendation |
|---|---|---|
| vectorIndexType | Type of vector index | HNSW for most cases |
| vectorIndexConfig | HNSW configuration | Tune M and efConstruction |
| ef | Size of dynamic list during search | Start with 64, increase as needed |
| cleanupIntervalSeconds | Index cleanup interval | Adjust based on update frequency |
| maxConnections | Maximum connections in HNSW | Typically 64-128 |
| efConstruction | Size of dynamic list during construction | Typically 128-512 |
Query Optimization
- Limit Results: Use appropriate limit values
- Selective Fields: Only request needed fields
- Batch Operations: Use batch for inserts/updates
- Caching: Cache frequent queries
- Filter Early: Apply filters before vector search
- Pagination: Use cursor for large result sets
Benchmarking
import time
def benchmark_query(query_builder, iterations=10):
"""Benchmark query performance"""
# Warm-up
for _ in range(3):
query_builder.do()
# Benchmark
start_time = time.time()
for _ in range(iterations):
result = query_builder.do()
total_time = time.time() - start_time
avg_time = total_time / iterations
qps = 1 / avg_time
print(f"Benchmark results:")
print(f" Average time: {avg_time:.4f}s")
print(f" Queries per second: {qps:.2f}")
print(f" Objects returned: {len(result['data']['Get']['Article'])}")
return {
"avg_time": avg_time,
"qps": qps,
"objects_returned": len(result['data']['Get']['Article'])
}
# Benchmark different query types
print("Benchmarking vector search:")
vector_results = benchmark_query(
client.query.get("Article", ["title"]).with_near_text({"concepts": ["machine learning"]}).with_limit(10)
)
print("\nBenchmarking hybrid search:")
hybrid_results = benchmark_query(
client.query.get("Article", ["title"]).with_hybrid(query="machine learning", alpha=0.5).with_limit(10)
)
print("\nBenchmarking filtered search:")
filtered_results = benchmark_query(
client.query.get("Article", ["title"])
.with_near_text({"concepts": ["machine learning"]})
.with_where({
"path": ["views"],
"operator": "GreaterThan",
"valueInt": 500
})
.with_limit(10)
)
# Compare results
print("\nComparison:")
print(f"Vector search: {vector_results['qps']:.2f} QPS")
print(f"Hybrid search: {hybrid_results['qps']:.2f} QPS")
print(f"Filtered search: {filtered_results['qps']:.2f} QPS")
Challenges
Technical Challenges
- Schema Design: Complex schema modeling
- Vector Indexing: Balancing accuracy and performance
- Graph Operations: Efficient graph traversal
- Module Integration: Managing module dependencies
- Scaling: Horizontal scaling challenges
Practical Challenges
- Deployment: Setting up Weaviate clusters
- Monitoring: Tracking performance metrics
- Data Migration: Moving data between instances
- Module Configuration: Configuring vectorizers
- Query Optimization: Writing efficient queries
Operational Challenges
- Resource Management: Balancing CPU/memory usage
- Cost Management: Optimizing infrastructure costs
- Disaster Recovery: Ensuring data durability
- Security: Managing access controls
- Compliance: Meeting regulatory requirements
Research and Advancements
Key Features
- "Weaviate: A Vector Search Engine with Graph Capabilities" (2021)
- Introduced Weaviate architecture
- Combined vector search and knowledge graphs
- "Hybrid Search in Vector Databases" (2022)
- Combined vector and keyword search
- Improved search relevance
- "Modular Vector Search Engines" (2023)
- Introduced modular architecture
- Extensible with custom modules
Emerging Research Directions
- Adaptive Indexing: Indexes that adapt to data distribution
- Multi-Modal Search: Search across different data types
- Privacy-Preserving Search: Secure similarity search
- Explainable Search: Interpretable search results
- Real-Time Indexing: Instant index updates
- Edge Deployment: Local vector search
- Federated Search: Search across multiple instances
- AutoML Integration: Automated machine learning pipelines
Best Practices
Design
- Schema Planning: Design schema for specific use cases
- Class Hierarchy: Plan class relationships carefully
- Property Design: Define properties with appropriate data types
- Vectorizer Selection: Choose appropriate vectorizer
- Module Selection: Select modules based on requirements
Implementation
- Start Small: Begin with simple schema
- Iterative Development: Build incrementally
- Monitor Performance: Track query latency and throughput
- Optimize Queries: Tune query performance
- Use Batching: Batch operations for efficiency
Production
- Scale Gradually: Monitor and scale as needed
- Implement Monitoring: Track system health
- Plan for Disaster Recovery: Implement backup strategies
- Secure Access: Implement access controls
- Optimize Configuration: Tune Weaviate parameters
Maintenance
- Update Regularly: Keep Weaviate updated
- Monitor Indexes: Track index health
- Optimize Schema: Refine schema as requirements evolve
- Backup Data: Regularly backup important data
- Document Configuration: Document system configuration