Weaviate

Open-source vector search engine with built-in knowledge graph capabilities for AI applications.

What is Weaviate?

Weaviate is an open-source vector search engine that combines vector similarity search with knowledge graph capabilities. It enables semantic search, question answering, and other AI-powered applications by providing a flexible, schema-based approach to storing and querying both vectors and structured data.

Key Concepts

Weaviate Architecture

graph TD
    A[Weaviate] --> B[Core Engine]
    A --> C[Storage Layer]
    A --> D[API Layer]
    A --> E[Modules]

    B --> B1[Vector Index]
    B --> B2[Graph Engine]
    B --> B3[Query Processor]

    C --> C1[Vector Storage]
    C --> C2[Object Storage]
    C --> C3[Metadata Storage]

    D --> D1[REST API]
    D --> D2[GraphQL API]
    D --> D3[gRPC]

    E --> E1[NLP Modules]
    E --> E2[Image Modules]
    E --> E3[Custom Modules]

    style A fill:#f9f,stroke:#333

Core Features

  1. Hybrid Search: Combine vector and keyword search
  2. Knowledge Graph: Built-in graph capabilities
  3. Schema Flexibility: Dynamic schema definition
  4. Modular Design: Extensible with modules
  5. Multi-Tenancy: Support for multiple tenants
  6. Real-Time Updates: Instant index updates
  7. Multi-Language Support: SDKs for multiple languages

Approaches and Architecture

Data Model

Weaviate uses a flexible, schema-based data model:

classDiagram
    class Class {
        +name: string
        +description: string
        +vectorizer: string
        +properties: Property[]
    }

    class Property {
        +name: string
        +dataType: string[]
        +description: string
        +indexFilterable: boolean
        +indexSearchable: boolean
    }

    class Object {
        +id: uuid
        +class: string
        +properties: map
        +vector: float[]
    }

    Class "1" --> "*" Property
    Class "1" --> "*" Object

Index Types

Index TypeDescriptionUse Case
HNSWHierarchical Navigable Small WorldDefault, high performance
FlatBrute-force exact searchSmall datasets, exact search
DynamicAutomatically selects index typeGeneral purpose

Module System

Weaviate's modular architecture allows for extensibility:

Module TypeDescriptionExample Modules
VectorizersConvert data to vectorstext2vec, img2vec
GenerativeGenerate text from vectorsgenerative-openai
Q&AQuestion answeringqna-openai
NERNamed entity recognitionner-transformers
SummarizationText summarizationsum-transformers

Mathematical Foundations

Weaviate combines vector similarity with keyword search:

$$S = \alpha \cdot S_v + (1 - \alpha) \cdot S_k + \beta \cdot S_g$$

Where:

  • $S$ = final score
  • $S_v$ = vector similarity score
  • $S_k$ = keyword relevance score
  • $S_g$ = graph connectivity score
  • $\alpha$, $\beta$ = weighting factors (0 ≤ α, β ≤ 1)

Graph Traversal

Weaviate uses graph algorithms for semantic search:

$$P = \text{ShortestPath}(s, t, G)$$

Where:

  • $P$ = path between source and target
  • $s$ = source node
  • $t$ = target node
  • $G$ = knowledge graph

Applications

Semantic Applications

  • Semantic Search: Context-aware search
  • Question Answering: AI-powered Q&A systems
  • Knowledge Management: Organizational knowledge graphs
  • Content Discovery: Personalized content recommendations
  • Enterprise Search: Unified search across data sources

AI Systems

  • Chatbots: Context-aware conversational AI
  • Recommendation Systems: Personalized recommendations
  • Content Moderation: Automated content filtering
  • Fraud Detection: Anomaly detection
  • Decision Support: Data-driven decision making

Industry-Specific

  • Healthcare: Medical knowledge graphs
  • Finance: Risk assessment and analysis
  • E-commerce: Product recommendations
  • Media: Content discovery and personalization
  • Education: Personalized learning systems

Implementation

Basic Usage

import weaviate
import numpy as np
from weaviate.util import generate_uuid5

# Connect to Weaviate
client = weaviate.Client("http://localhost:8080")

# Define schema
schema = {
    "classes": [
        {
            "class": "Article",
            "description": "A news article",
            "vectorizer": "text2vec-transformers",
            "properties": [
                {
                    "name": "title",
                    "dataType": ["text"],
                    "description": "The title of the article",
                    "indexSearchable": True
                },
                {
                    "name": "content",
                    "dataType": ["text"],
                    "description": "The content of the article",
                    "indexSearchable": True
                },
                {
                    "name": "category",
                    "dataType": ["string"],
                    "description": "The category of the article",
                    "indexFilterable": True
                },
                {
                    "name": "published_at",
                    "dataType": ["date"],
                    "description": "The publication date"
                },
                {
                    "name": "views",
                    "dataType": ["int"],
                    "description": "Number of views"
                }
            ]
        }
    ]
}

# Create schema
client.schema.create(schema)

# Generate sample data
articles = []
for i in range(100):
    article = {
        "title": f"Article {i}: AI Advancements in {['Healthcare', 'Finance', 'Education', 'Technology', 'Science'][i % 5]}",
        "content": f"This is the content of article {i} discussing recent advancements in AI for {['healthcare', 'financial services', 'education', 'technology', 'scientific research'][i % 5]}. The article covers topics such as machine learning, deep learning, and natural language processing.",
        "category": ["Technology", "Science", "Health", "Business", "Education"][i % 5],
        "published_at": f"2023-{i % 12 + 1:02d}-{i % 28 + 1:02d}T00:00:00Z",
        "views": i * 100
    }
    articles.append(article)

# Batch insert
with client.batch as batch:
    batch.batch_size = 10
    for article in articles:
        batch.add_data_object(
            data_object=article,
            class_name="Article",
            uuid=generate_uuid5(article, "Article")
        )

# Vector search
response = (
    client.query
    .get("Article", ["title", "content", "category", "views"])
    .with_near_text({"concepts": ["machine learning advancements"]})
    .with_limit(5)
    .do()
)

print("Vector search results:")
for article in response["data"]["Get"]["Article"]:
    print(f"Title: {article['title']}")
    print(f"Category: {article['category']}")
    print(f"Views: {article['views']}")
    print(f"Content snippet: {article['content'][:100]}...")
    print()

# Hybrid search
response = (
    client.query
    .get("Article", ["title", "category", "views"])
    .with_hybrid(
        query="machine learning",
        alpha=0.5  # Balance between vector and keyword search
    )
    .with_limit(5)
    .do()
)

print("Hybrid search results:")
for article in response["data"]["Get"]["Article"]:
    print(f"Title: {article['title']}")
    print(f"Category: {article['category']}")
    print(f"Views: {article['views']}")
    print()

Graph Operations

# Add cross-references to create a knowledge graph
client.schema.property.create(
    "Article",
    {
        "name": "related_articles",
        "dataType": ["Article"],
        "description": "Related articles"
    }
)

# Create relationships between articles
for i in range(10):
    source_id = generate_uuid5(articles[i], "Article")
    target_id = generate_uuid5(articles[(i + 10) % 100], "Article")

    client.data_object.reference.add(
        from_uuid=source_id,
        from_property_name="related_articles",
        to_uuid=target_id
    )

# Graph traversal - find related articles
article_id = generate_uuid5(articles[0], "Article")
response = (
    client.query
    .get("Article", ["title", "category"])
    .with_where({
        "path": ["id"],
        "operator": "Equal",
        "valueString": article_id
    })
    .with_limit(1)
    .do()
)

if response["data"]["Get"]["Article"]:
    current_article = response["data"]["Get"]["Article"][0]
    print(f"Current article: {current_article['title']}")

    # Get related articles
    response = (
        client.query
        .get("Article", ["title", "category"])
        .with_where({
            "path": ["related_articles", "Article", "id"],
            "operator": "Equal",
            "valueString": article_id
        })
        .with_limit(3)
        .do()
    )

    print("Related articles:")
    for article in response["data"]["Get"]["Article"]:
        print(f"  - {article['title']} ({article['category']})")

Module Integration

# Enable modules in Weaviate configuration
# This would be configured in the Weaviate config file

# Example using the generative module
response = (
    client.query
    .get("Article", ["title", "content"])
    .with_near_text({"concepts": ["machine learning"]})
    .with_limit(1)
    .with_generate(
        single_prompt="Summarize this article in 3 bullet points: {content}"
    )
    .do()
)

print("Generated summary:")
if response["data"]["Get"]["Article"]:
    article = response["data"]["Get"]["Article"][0]
    print(f"Title: {article['title']}")
    print("Summary:")
    print(article["_additional"]["generate"]["singleResult"])
    print()

# Example using the Q&A module
response = (
    client.query
    .get("Article", ["title", "content"])
    .with_near_text({"concepts": ["deep learning"]})
    .with_limit(1)
    .with_ask({
        "question": "What are the main applications of deep learning mentioned?",
        "properties": ["content"]
    })
    .do()
)

print("Q&A results:")
if response["data"]["Get"]["Article"]:
    article = response["data"]["Get"]["Article"][0]
    print(f"Title: {article['title']}")
    print(f"Question: {article['_additional']['ask']['question']}")
    print(f"Answer: {article['_additional']['ask']['answer']}")
    print(f"Certainty: {article['_additional']['ask']['certainty']:.2f}")

Filtering and Aggregations

# Filtering
response = (
    client.query
    .get("Article", ["title", "category", "views"])
    .with_where({
        "operator": "And",
        "operands": [
            {
                "path": ["category"],
                "operator": "Equal",
                "valueString": "Technology"
            },
            {
                "path": ["views"],
                "operator": "GreaterThan",
                "valueInt": 500
            }
        ]
    })
    .with_limit(5)
    .do()
)

print("Filtered results (Technology articles with >500 views):")
for article in response["data"]["Get"]["Article"]:
    print(f"  - {article['title']} ({article['views']} views)")

# Aggregations
response = (
    client.query
    .aggregate("Article")
    .with_fields("category")
    .with_meta_count()
    .do()
)

print("\nCategory distribution:")
for group in response["data"]["Aggregate"]["Article"]:
    print(f"  - {group['meta']['count']} articles in {group['groupedBy']['value']}")

# Near vector search with filtering
sample_vector = np.random.rand(384).tolist()  # Assuming 384-dim vectors
response = (
    client.query
    .get("Article", ["title", "category"])
    .with_near_vector({
        "vector": sample_vector,
        "certainty": 0.7
    })
    .with_where({
        "path": ["views"],
        "operator": "GreaterThan",
        "valueInt": 300
    })
    .with_limit(3)
    .do()
)

print("\nNear vector search with filtering:")
for article in response["data"]["Get"]["Article"]:
    print(f"  - {article['title']} ({article['category']})")

Performance Optimization

Index Configuration

ParameterDescriptionRecommendation
vectorIndexTypeType of vector indexHNSW for most cases
vectorIndexConfigHNSW configurationTune M and efConstruction
efSize of dynamic list during searchStart with 64, increase as needed
cleanupIntervalSecondsIndex cleanup intervalAdjust based on update frequency
maxConnectionsMaximum connections in HNSWTypically 64-128
efConstructionSize of dynamic list during constructionTypically 128-512

Query Optimization

  1. Limit Results: Use appropriate limit values
  2. Selective Fields: Only request needed fields
  3. Batch Operations: Use batch for inserts/updates
  4. Caching: Cache frequent queries
  5. Filter Early: Apply filters before vector search
  6. Pagination: Use cursor for large result sets

Benchmarking

import time

def benchmark_query(query_builder, iterations=10):
    """Benchmark query performance"""
    # Warm-up
    for _ in range(3):
        query_builder.do()

    # Benchmark
    start_time = time.time()
    for _ in range(iterations):
        result = query_builder.do()
    total_time = time.time() - start_time

    avg_time = total_time / iterations
    qps = 1 / avg_time

    print(f"Benchmark results:")
    print(f"  Average time: {avg_time:.4f}s")
    print(f"  Queries per second: {qps:.2f}")
    print(f"  Objects returned: {len(result['data']['Get']['Article'])}")

    return {
        "avg_time": avg_time,
        "qps": qps,
        "objects_returned": len(result['data']['Get']['Article'])
    }

# Benchmark different query types
print("Benchmarking vector search:")
vector_results = benchmark_query(
    client.query.get("Article", ["title"]).with_near_text({"concepts": ["machine learning"]}).with_limit(10)
)

print("\nBenchmarking hybrid search:")
hybrid_results = benchmark_query(
    client.query.get("Article", ["title"]).with_hybrid(query="machine learning", alpha=0.5).with_limit(10)
)

print("\nBenchmarking filtered search:")
filtered_results = benchmark_query(
    client.query.get("Article", ["title"])
    .with_near_text({"concepts": ["machine learning"]})
    .with_where({
        "path": ["views"],
        "operator": "GreaterThan",
        "valueInt": 500
    })
    .with_limit(10)
)

# Compare results
print("\nComparison:")
print(f"Vector search: {vector_results['qps']:.2f} QPS")
print(f"Hybrid search: {hybrid_results['qps']:.2f} QPS")
print(f"Filtered search: {filtered_results['qps']:.2f} QPS")

Challenges

Technical Challenges

  • Schema Design: Complex schema modeling
  • Vector Indexing: Balancing accuracy and performance
  • Graph Operations: Efficient graph traversal
  • Module Integration: Managing module dependencies
  • Scaling: Horizontal scaling challenges

Practical Challenges

  • Deployment: Setting up Weaviate clusters
  • Monitoring: Tracking performance metrics
  • Data Migration: Moving data between instances
  • Module Configuration: Configuring vectorizers
  • Query Optimization: Writing efficient queries

Operational Challenges

  • Resource Management: Balancing CPU/memory usage
  • Cost Management: Optimizing infrastructure costs
  • Disaster Recovery: Ensuring data durability
  • Security: Managing access controls
  • Compliance: Meeting regulatory requirements

Research and Advancements

Key Features

  1. "Weaviate: A Vector Search Engine with Graph Capabilities" (2021)
    • Introduced Weaviate architecture
    • Combined vector search and knowledge graphs
  2. "Hybrid Search in Vector Databases" (2022)
    • Combined vector and keyword search
    • Improved search relevance
  3. "Modular Vector Search Engines" (2023)
    • Introduced modular architecture
    • Extensible with custom modules

Emerging Research Directions

  • Adaptive Indexing: Indexes that adapt to data distribution
  • Multi-Modal Search: Search across different data types
  • Privacy-Preserving Search: Secure similarity search
  • Explainable Search: Interpretable search results
  • Real-Time Indexing: Instant index updates
  • Edge Deployment: Local vector search
  • Federated Search: Search across multiple instances
  • AutoML Integration: Automated machine learning pipelines

Best Practices

Design

  • Schema Planning: Design schema for specific use cases
  • Class Hierarchy: Plan class relationships carefully
  • Property Design: Define properties with appropriate data types
  • Vectorizer Selection: Choose appropriate vectorizer
  • Module Selection: Select modules based on requirements

Implementation

  • Start Small: Begin with simple schema
  • Iterative Development: Build incrementally
  • Monitor Performance: Track query latency and throughput
  • Optimize Queries: Tune query performance
  • Use Batching: Batch operations for efficiency

Production

  • Scale Gradually: Monitor and scale as needed
  • Implement Monitoring: Track system health
  • Plan for Disaster Recovery: Implement backup strategies
  • Secure Access: Implement access controls
  • Optimize Configuration: Tune Weaviate parameters

Maintenance

  • Update Regularly: Keep Weaviate updated
  • Monitor Indexes: Track index health
  • Optimize Schema: Refine schema as requirements evolve
  • Backup Data: Regularly backup important data
  • Document Configuration: Document system configuration

External Resources