Weaviate

Open-source vector search engine with built-in knowledge graph capabilities for AI applications.

What is Weaviate?

Weaviate is an open-source vector search engine that combines vector similarity search with knowledge graph capabilities. It enables semantic search, question answering, and other AI-powered applications by providing a flexible, schema-based approach to storing and querying both vectors and structured data.

Key Concepts

Weaviate Architecture

graph TD
    A[Weaviate] --> B[Core Engine]
    A --> C[Storage Layer]
    A --> D[API Layer]
    A --> E[Modules]

    B --> B1[Vector Index]
    B --> B2[Graph Engine]
    B --> B3[Query Processor]

    C --> C1[Vector Storage]
    C --> C2[Object Storage]
    C --> C3[Metadata Storage]

    D --> D1[REST API]
    D --> D2[GraphQL API]
    D --> D3[gRPC]

    E --> E1[NLP Modules]
    E --> E2[Image Modules]
    E --> E3[Custom Modules]

    style A fill:#f9f,stroke:#333

Core Features

Hybrid Search: Combine vector and keyword search
Knowledge Graph: Built-in graph capabilities
Schema Flexibility: Dynamic schema definition
Modular Design: Extensible with modules
Multi-Tenancy: Support for multiple tenants
Real-Time Updates: Instant index updates
Multi-Language Support: SDKs for multiple languages

Approaches and Architecture

Data Model

Weaviate uses a flexible, schema-based data model:

classDiagram
    class Class {
        +name: string
        +description: string
        +vectorizer: string
        +properties: Property[]
    }

    class Property {
        +name: string
        +dataType: string[]
        +description: string
        +indexFilterable: boolean
        +indexSearchable: boolean
    }

    class Object {
        +id: uuid
        +class: string
        +properties: map
        +vector: float[]
    }

    Class "1" --> "*" Property
    Class "1" --> "*" Object

Index Types

Index Type	Description	Use Case
HNSW	Hierarchical Navigable Small World	Default, high performance
Flat	Brute-force exact search	Small datasets, exact search
Dynamic	Automatically selects index type	General purpose

Module System

Weaviate's modular architecture allows for extensibility:

Module Type	Description	Example Modules
Vectorizers	Convert data to vectors	text2vec, img2vec
Generative	Generate text from vectors	generative-openai
Q&A	Question answering	qna-openai
NER	Named entity recognition	ner-transformers
Summarization	Text summarization	sum-transformers

Mathematical Foundations

Hybrid Search

Weaviate combines vector similarity with keyword search:

$$S = \alpha \cdot S_v + (1 - \alpha) \cdot S_k + \beta \cdot S_g$$

Where:

$S$ = final score
$S_v$ = vector similarity score
$S_k$ = keyword relevance score
$S_g$ = graph connectivity score
$\alpha$, $\beta$ = weighting factors (0 ≤ α, β ≤ 1)

Graph Traversal

Weaviate uses graph algorithms for semantic search:

$$P = \text{ShortestPath}(s, t, G)$$

Where:

$P$ = path between source and target
$s$ = source node
$t$ = target node
$G$ = knowledge graph

Applications

Semantic Applications

Semantic Search: Context-aware search
Question Answering: AI-powered Q&A systems
Knowledge Management: Organizational knowledge graphs
Content Discovery: Personalized content recommendations
Enterprise Search: Unified search across data sources

AI Systems

Chatbots: Context-aware conversational AI
Recommendation Systems: Personalized recommendations
Content Moderation: Automated content filtering
Fraud Detection: Anomaly detection
Decision Support: Data-driven decision making

Industry-Specific

Healthcare: Medical knowledge graphs
Finance: Risk assessment and analysis
E-commerce: Product recommendations
Media: Content discovery and personalization
Education: Personalized learning systems

Implementation

Basic Usage

import weaviate
import numpy as np
from weaviate.util import generate_uuid5

# Connect to Weaviate
client = weaviate.Client("http://localhost:8080")

# Define schema
schema = {
    "classes": [
        {
            "class": "Article",
            "description": "A news article",
            "vectorizer": "text2vec-transformers",
            "properties": [
                {
                    "name": "title",
                    "dataType": ["text"],
                    "description": "The title of the article",
                    "indexSearchable": True
                },
                {
                    "name": "content",
                    "dataType": ["text"],
                    "description": "The content of the article",
                    "indexSearchable": True
                },
                {
                    "name": "category",
                    "dataType": ["string"],
                    "description": "The category of the article",
                    "indexFilterable": True
                },
                {
                    "name": "published_at",
                    "dataType": ["date"],
                    "description": "The publication date"
                },
                {
                    "name": "views",
                    "dataType": ["int"],
                    "description": "Number of views"
                }
            ]
        }
    ]
}

# Create schema
client.schema.create(schema)

# Generate sample data
articles = []
for i in range(100):
    article = {
        "title": f"Article {i}: AI Advancements in {['Healthcare', 'Finance', 'Education', 'Technology', 'Science'][i % 5]}",
        "content": f"This is the content of article {i} discussing recent advancements in AI for {['healthcare', 'financial services', 'education', 'technology', 'scientific research'][i % 5]}. The article covers topics such as machine learning, deep learning, and natural language processing.",
        "category": ["Technology", "Science", "Health", "Business", "Education"][i % 5],
        "published_at": f"2023-{i % 12 + 1:02d}-{i % 28 + 1:02d}T00:00:00Z",
        "views": i * 100
    }
    articles.append(article)

# Batch insert
with client.batch as batch:
    batch.batch_size = 10
    for article in articles:
        batch.add_data_object(
            data_object=article,
            class_name="Article",
            uuid=generate_uuid5(article, "Article")
        )

# Vector search
response = (
    client.query
    .get("Article", ["title", "content", "category", "views"])
    .with_near_text({"concepts": ["machine learning advancements"]})
    .with_limit(5)
    .do()
)

print("Vector search results:")
for article in response["data"]["Get"]["Article"]:
    print(f"Title: {article['title']}")
    print(f"Category: {article['category']}")
    print(f"Views: {article['views']}")
    print(f"Content snippet: {article['content'][:100]}...")
    print()

# Hybrid search
response = (
    client.query
    .get("Article", ["title", "category", "views"])
    .with_hybrid(
        query="machine learning",
        alpha=0.5  # Balance between vector and keyword search
    )
    .with_limit(5)
    .do()
)

print("Hybrid search results:")
for article in response["data"]["Get"]["Article"]:
    print(f"Title: {article['title']}")
    print(f"Category: {article['category']}")
    print(f"Views: {article['views']}")
    print()

Graph Operations

# Add cross-references to create a knowledge graph
client.schema.property.create(
    "Article",
    {
        "name": "related_articles",
        "dataType": ["Article"],
        "description": "Related articles"
    }
)

# Create relationships between articles
for i in range(10):
    source_id = generate_uuid5(articles[i], "Article")
    target_id = generate_uuid5(articles[(i + 10) % 100], "Article")

    client.data_object.reference.add(
        from_uuid=source_id,
        from_property_name="related_articles",
        to_uuid=target_id
    )

# Graph traversal - find related articles
article_id = generate_uuid5(articles[0], "Article")
response = (
    client.query
    .get("Article", ["title", "category"])
    .with_where({
        "path": ["id"],
        "operator": "Equal",
        "valueString": article_id
    })
    .with_limit(1)
    .do()
)

if response["data"]["Get"]["Article"]:
    current_article = response["data"]["Get"]["Article"][0]
    print(f"Current article: {current_article['title']}")

    # Get related articles
    response = (
        client.query
        .get("Article", ["title", "category"])
        .with_where({
            "path": ["related_articles", "Article", "id"],
            "operator": "Equal",
            "valueString": article_id
        })
        .with_limit(3)
        .do()
    )

    print("Related articles:")
    for article in response["data"]["Get"]["Article"]:
        print(f"  - {article['title']} ({article['category']})")

Module Integration

# Enable modules in Weaviate configuration
# This would be configured in the Weaviate config file

# Example using the generative module
response = (
    client.query
    .get("Article", ["title", "content"])
    .with_near_text({"concepts": ["machine learning"]})
    .with_limit(1)
    .with_generate(
        single_prompt="Summarize this article in 3 bullet points: {content}"
    )
    .do()
)

print("Generated summary:")
if response["data"]["Get"]["Article"]:
    article = response["data"]["Get"]["Article"][0]
    print(f"Title: {article['title']}")
    print("Summary:")
    print(article["_additional"]["generate"]["singleResult"])
    print()

# Example using the Q&A module
response = (
    client.query
    .get("Article", ["title", "content"])
    .with_near_text({"concepts": ["deep learning"]})
    .with_limit(1)
    .with_ask({
        "question": "What are the main applications of deep learning mentioned?",
        "properties": ["content"]
    })
    .do()
)

print("Q&A results:")
if response["data"]["Get"]["Article"]:
    article = response["data"]["Get"]["Article"][0]
    print(f"Title: {article['title']}")
    print(f"Question: {article['_additional']['ask']['question']}")
    print(f"Answer: {article['_additional']['ask']['answer']}")
    print(f"Certainty: {article['_additional']['ask']['certainty']:.2f}")

Filtering and Aggregations

# Filtering
response = (
    client.query
    .get("Article", ["title", "category", "views"])
    .with_where({
        "operator": "And",
        "operands": [
            {
                "path": ["category"],
                "operator": "Equal",
                "valueString": "Technology"
            },
            {
                "path": ["views"],
                "operator": "GreaterThan",
                "valueInt": 500
            }
        ]
    })
    .with_limit(5)
    .do()
)

print("Filtered results (Technology articles with >500 views):")
for article in response["data"]["Get"]["Article"]:
    print(f"  - {article['title']} ({article['views']} views)")

# Aggregations
response = (
    client.query
    .aggregate("Article")
    .with_fields("category")
    .with_meta_count()
    .do()
)

print("\nCategory distribution:")
for group in response["data"]["Aggregate"]["Article"]:
    print(f"  - {group['meta']['count']} articles in {group['groupedBy']['value']}")

# Near vector search with filtering
sample_vector = np.random.rand(384).tolist()  # Assuming 384-dim vectors
response = (
    client.query
    .get("Article", ["title", "category"])
    .with_near_vector({
        "vector": sample_vector,
        "certainty": 0.7
    })
    .with_where({
        "path": ["views"],
        "operator": "GreaterThan",
        "valueInt": 300
    })
    .with_limit(3)
    .do()
)

print("\nNear vector search with filtering:")
for article in response["data"]["Get"]["Article"]:
    print(f"  - {article['title']} ({article['category']})")

Performance Optimization

Index Configuration

Parameter	Description	Recommendation
vectorIndexType	Type of vector index	HNSW for most cases
vectorIndexConfig	HNSW configuration	Tune M and efConstruction
ef	Size of dynamic list during search	Start with 64, increase as needed
cleanupIntervalSeconds	Index cleanup interval	Adjust based on update frequency
maxConnections	Maximum connections in HNSW	Typically 64-128
efConstruction	Size of dynamic list during construction	Typically 128-512

Query Optimization

Limit Results: Use appropriate limit values
Selective Fields: Only request needed fields
Batch Operations: Use batch for inserts/updates
Caching: Cache frequent queries
Filter Early: Apply filters before vector search
Pagination: Use cursor for large result sets

Benchmarking

import time

def benchmark_query(query_builder, iterations=10):
    """Benchmark query performance"""
    # Warm-up
    for _ in range(3):
        query_builder.do()

    # Benchmark
    start_time = time.time()
    for _ in range(iterations):
        result = query_builder.do()
    total_time = time.time() - start_time

    avg_time = total_time / iterations
    qps = 1 / avg_time

    print(f"Benchmark results:")
    print(f"  Average time: {avg_time:.4f}s")
    print(f"  Queries per second: {qps:.2f}")
    print(f"  Objects returned: {len(result['data']['Get']['Article'])}")

    return {
        "avg_time": avg_time,
        "qps": qps,
        "objects_returned": len(result['data']['Get']['Article'])
    }

# Benchmark different query types
print("Benchmarking vector search:")
vector_results = benchmark_query(
    client.query.get("Article", ["title"]).with_near_text({"concepts": ["machine learning"]}).with_limit(10)
)

print("\nBenchmarking hybrid search:")
hybrid_results = benchmark_query(
    client.query.get("Article", ["title"]).with_hybrid(query="machine learning", alpha=0.5).with_limit(10)
)

print("\nBenchmarking filtered search:")
filtered_results = benchmark_query(
    client.query.get("Article", ["title"])
    .with_near_text({"concepts": ["machine learning"]})
    .with_where({
        "path": ["views"],
        "operator": "GreaterThan",
        "valueInt": 500
    })
    .with_limit(10)
)

# Compare results
print("\nComparison:")
print(f"Vector search: {vector_results['qps']:.2f} QPS")
print(f"Hybrid search: {hybrid_results['qps']:.2f} QPS")
print(f"Filtered search: {filtered_results['qps']:.2f} QPS")

Challenges

Technical Challenges

Schema Design: Complex schema modeling
Vector Indexing: Balancing accuracy and performance
Graph Operations: Efficient graph traversal
Module Integration: Managing module dependencies
Scaling: Horizontal scaling challenges

Practical Challenges

Deployment: Setting up Weaviate clusters
Monitoring: Tracking performance metrics
Data Migration: Moving data between instances
Module Configuration: Configuring vectorizers
Query Optimization: Writing efficient queries

Operational Challenges

Resource Management: Balancing CPU/memory usage
Cost Management: Optimizing infrastructure costs
Disaster Recovery: Ensuring data durability
Security: Managing access controls
Compliance: Meeting regulatory requirements

Research and Advancements

Key Features

"Weaviate: A Vector Search Engine with Graph Capabilities" (2021)
- Introduced Weaviate architecture
- Combined vector search and knowledge graphs
"Hybrid Search in Vector Databases" (2022)
- Combined vector and keyword search
- Improved search relevance
"Modular Vector Search Engines" (2023)
- Introduced modular architecture
- Extensible with custom modules

Emerging Research Directions

Adaptive Indexing: Indexes that adapt to data distribution
Multi-Modal Search: Search across different data types
Privacy-Preserving Search: Secure similarity search
Explainable Search: Interpretable search results
Real-Time Indexing: Instant index updates
Edge Deployment: Local vector search
Federated Search: Search across multiple instances
AutoML Integration: Automated machine learning pipelines

Best Practices

Design

Schema Planning: Design schema for specific use cases
Class Hierarchy: Plan class relationships carefully
Property Design: Define properties with appropriate data types
Vectorizer Selection: Choose appropriate vectorizer
Module Selection: Select modules based on requirements

Implementation

Start Small: Begin with simple schema
Iterative Development: Build incrementally
Monitor Performance: Track query latency and throughput
Optimize Queries: Tune query performance
Use Batching: Batch operations for efficiency

Production

Scale Gradually: Monitor and scale as needed
Implement Monitoring: Track system health
Plan for Disaster Recovery: Implement backup strategies
Secure Access: Implement access controls
Optimize Configuration: Tune Weaviate parameters

Maintenance

Update Regularly: Keep Weaviate updated
Monitor Indexes: Track index health
Optimize Schema: Refine schema as requirements evolve
Backup Data: Regularly backup important data
Document Configuration: Document system configuration

External Resources

Weak AI (Narrow AI)

Artificial intelligence systems designed to perform specific tasks without general intelligence or consciousness.

Weights & Biases

Machine learning experiment tracking, visualization, and collaboration platform.