Enhanced Full-Text Indexing in GreptimeDB v0.14! Bloom vs Tantivy Backend Analysis

Log analysis just got significantly more powerful. GreptimeDB v0.14 introduces dual-backend full-text indexing with both Bloom filter and Tantivy backends, plus new query operators that make text search more intuitive and efficient. This isn't just an incremental update - it's a complete rethinking of how observability databases handle text search at scale.

The Evolution of Full-Text Search

Traditional log analysis forces uncomfortable trade-offs. You either get fast indexing with limited search capabilities or powerful search with massive storage overhead. GreptimeDB v0.14 eliminates this choice by providing two specialized backends optimized for different use cases.

The new release brings 247 merged pull requests including 100 feature enhancements, with significant focus on full-text indexing improvements that directly impact how organizations handle log analysis workflows.

New Query Operators: `matches_term` and `@@`

Version 0.14 introduces intuitive text matching with the new matches_term function and @@ operator shorthand:

sql

-- Using matches_term function
SELECT * FROM logs WHERE matches_term(message, 'error') OR matches_term(message, 'fail');

-- Using @@ operator (shorthand)
SELECT * FROM logs WHERE message @@ 'error' OR message @@ 'fail';

These operators provide exact term matching with intelligent boundary detection:

Case-sensitive matching for precise results
Word boundary detection prevents partial matches
Multi-word phrase support for complex search patterns

Dual-Backend Architecture: Bloom vs Tantivy

GreptimeDB v0.14's dual-backend approach allows users to choose the optimal indexing strategy based on their specific workload characteristics:

Bloom Backend: Optimized for General-Purpose Search

Characteristic	Performance
Best For	General-purpose log search across diverse patterns
Storage Overhead	~10% of raw data size (extremely efficient)
Query Performance	Consistent across all query types
Memory Usage	Minimal impact on system resources

Example Storage Comparison:

Raw log data: 10GB
Bloom index: 1GB
Total storage: 11GB (10% overhead)

Tantivy Backend: Precision-Focused Architecture

Characteristic	Performance
Best For	High-selectivity queries (TraceID, unique identifiers)
Storage Overhead	~100% of raw data size (inverted index)
Selective Queries	5x faster than Bloom for unique lookups
General Queries	5x slower than Bloom for broad searches

Example Storage Comparison:

Raw log data: 10GB
Tantivy index: 10GB
Total storage: 20GB (100% overhead)

Performance Benchmarking Results

Real-world testing reveals significant performance differences based on query selectivity:

High-Selectivity Queries (TraceID, UserID)

Backend	Relative Performance
Tantivy	5x faster (baseline)
Bloom	1x (baseline)
LIKE Query	50x slower

Low-Selectivity Queries (Common Terms)

Backend	Relative Performance
Bloom	1x (baseline)
Tantivy	5x slower
LIKE Query	1x (equivalent)

Choosing the Right Backend Strategy

Decision matrix for backend selection:

Use Bloom Backend When:

Diverse query patterns across your log corpus
Storage efficiency is a primary concern
Consistent performance matters more than peak speed
Budget constraints limit infrastructure resources

Use Tantivy Backend When:

Trace ID lookups dominate your query patterns
Unique identifier searches are performance-critical
Storage costs are less constraining than query speed
High-precision matching is essential

Advanced Configuration Options

Backend-specific configuration allows fine-tuning for optimal performance:

sql

-- Bloom backend configuration
CREATE TABLE logs_bloom (
    message STRING FULLTEXT WITH (backend = 'bloom'),
    service STRING,
    ts TIMESTAMP,
    TIME INDEX(ts)
);

-- Tantivy backend configuration  
CREATE TABLE logs_tantivy (
    message STRING FULLTEXT WITH (backend = 'tantivy'),
    trace_id STRING,
    ts TIMESTAMP,
    TIME INDEX(ts)
);

Integration with Time-Series Queries

Full-text search combines seamlessly with time-series filtering:

sql

-- Hybrid query combining text search and time filtering
SELECT service, COUNT(*) as error_count
FROM logs 
WHERE ts > now() - INTERVAL '1 hour'
  AND message @@ 'timeout'
  AND service != 'health-check'
GROUP BY service
ORDER BY error_count DESC;

This unified approach eliminates the need for separate log and metrics storage systems.

Memory and Resource Optimization

GreptimeDB's columnar storage provides significant advantages for text indexing:

Memory Efficiency

Bloom filters: 400MB memory usage for 10GB dataset
Traditional solutions: Often require 12GB+ memory
32x memory efficiency compared to Elasticsearch

Compression Benefits

Structured log parsing: 13% storage usage vs raw logs
Column-specific compression: Optimized for each data type
Automatic data lifecycle: Intelligent tiering based on access patterns

Migration and Adoption Strategy

Transitioning to enhanced full-text indexing:

Evaluate query patterns to determine optimal backend choice
Start with Bloom backend for general-purpose workloads
Migrate high-selectivity queries to Tantivy backend
Monitor performance metrics to validate configuration choices

Real-World Performance Impact

Organizations migrating to GreptimeDB v0.14's enhanced indexing report:

50-80% reduction in query response times
70% decrease in storage costs compared to Elasticsearch
Simplified operations with unified metrics and logs storage

The dual-backend architecture enables organizations to optimize for their specific use cases without compromising on functionality or performance.

Ready to enhance your log analysis capabilities? GreptimeDB v0.14's full-text indexing delivers the performance and flexibility needed for modern observability workloads.

About Greptime

GreptimeDB is an open-source, cloud-native database purpose-built for real-time observability. Built in Rust and optimized for cloud-native environments, it provides unified storage and processing for metrics, logs, and traces—delivering sub-second insights from edge to cloud —at any scale.

GreptimeDB OSS – The open-sourced database for small to medium-scale observability and IoT use cases, ideal for personal projects or dev/test environments.
GreptimeDB Enterprise – A robust observability database with enhanced security, high availability, and enterprise-grade support.
GreptimeCloud – A fully managed, serverless DBaaS with elastic scaling and zero operational overhead. Built for teams that need speed, flexibility, and ease of use out of the box.

🚀 We’re open to contributors—get started with issues labeled good first issue and connect with our community.

The Evolution of Full-Text Search ​

New Query Operators: matches_term and @@ ​

Dual-Backend Architecture: Bloom vs Tantivy ​

Bloom Backend: Optimized for General-Purpose Search ​

Tantivy Backend: Precision-Focused Architecture ​

Performance Benchmarking Results ​

High-Selectivity Queries (TraceID, UserID) ​

Low-Selectivity Queries (Common Terms) ​

Choosing the Right Backend Strategy ​

Use Bloom Backend When: ​

Use Tantivy Backend When: ​

Advanced Configuration Options ​

Integration with Time-Series Queries ​

Memory and Resource Optimization ​

Memory Efficiency ​

Compression Benefits ​

Migration and Adoption Strategy ​

Real-World Performance Impact ​

About Greptime ​

加入我们的社区