欢迎参与 8 月 1 日中午 11 点的线上分享,了解 GreptimeDB 联合处理指标和日志的最新方案! 👉🏻 点击加入

Skip to content
On this page

Scaling Log Management! Migrating from Loki to GreptimeDB at OB Cloud

Background

OceanBase, launched in 2010, is a natively distributed database independently developed by Ant Group. Its unified architecture that combines distributed scalability with centralized performance, delivers full Oracle/MySQL compatibility, supports diverse workloads including Transaction Processing (TP) and real-time Analytical Processing (AP), and natively integrates vector search and multi-modal data hybrid search capabilities, serving more than 2000 customers upgrade their database from different industries, including Financial Services, Telecom, Retail, Internet and more.

In 2022, OceanBase launched its cloud database service, OB Cloud, to help customers build modern data architectures and simplify their tech stack with an integrated cloud database architecture. Operating across 170+ availability zones in 50+ regions, OB Cloud leverages infrastructure from Alibaba Cloud, Huawei Cloud, Tencent Cloud, AWS, and Google Cloud to provide consistent global performance, meeting diverse business growth needs.

The Loki Performance Challenge

OB Cloud initially deployed Grafana Loki across multiple cloud environments to unify log storage and streamline operational experiments. Inspired by Prometheus, Loki is an efficient log aggregation system which indexes only log metadata (labels), not raw log content. Its support for object storage also contributes to lower storage costs, making it a popular choice. OB Cloud's log storage architecture based on Loki is shown below:

(Figure 1: OB Cloud's Loki-based Log Storage Architecture)
(Figure 1: OB Cloud's Loki-based Log Storage Architecture)

Fluent Bit agents deployed on each node collect application pod logs and ingest them into Loki. The log viewer invokes the log query service to retrieve logs based on search conditions (e.g. keywords) and render results. The log query service constructs Loki-compatible queries and executes them via Loki's API.

As workloads scaled, Loki's significant limitations emerged. Queries against large log volumes frequently timed out. Furthermore, Loki's indexing is restricted to labels, offering no acceleration for searching within the actual log body text. The query service had to restrict the default query ranges to just minutes.

Migrating to GreptimeDB

After evaluating alternatives, OB Cloud migrated to GreptimeDB for log management. In the new architecture, Fluent Bit agents write directly to GreptimeDB while the query service leverages GreptimeDB's SQL interface for retrieval. This transition yielded immediate improvements: queries that previously timed out on Loki now resolve consistently within sub-second to single-second latency, enabling users to search across hours or days of logs rather than minutes.

(Figure 2: OB Cloud's GreptimeDB-based Log Storage Architecture)
(Figure 2: OB Cloud's GreptimeDB-based Log Storage Architecture)

Technical Practices

Multi-Cloud Native Deployment Architecture

To support global major cloud vendor's like Alibaba Cloud, Huawei Cloud, Tencent Cloud, AWS, and Google Cloud, OB Cloud needs an internal log service that supports multi-cloud. Its GreptimeDB deployment architecture is illustrated below:

(Figure 3: GreptimeDB Multi-Cloud Deployment Architecture with OB Cloud)
(Figure 3: GreptimeDB Multi-Cloud Deployment Architecture with OB Cloud)

OB Cloud deploys dedicated GreptimeDB clusters within each cloud environment, directly integrating each cloud vendor's native object storage (S3, OSS, COS). Furthermore, GreptimeDB natively supports multi-cloud object storage, which is well-suited for OB Cloud's requirements, while its unified SQL interface simplified integration efforts. Combined with Kubernetes-native deployment and management, and its built-in dashboard component simplifies usage and debugging, significantly enhancing operational convenience.

Pipeline-Based Log Processing

Fluent Bit outputs JSON-formatted logs that GreptimeDB processes through customizable pipelines. These pipelines extract critical fields like hostnames and filenames into indexed columns for efficient filtering.

During implementation, we identified that some OB Cloud applications produce logs containing multiple new lines, making simple dissect processor failed. For such type of log files, we used regex processor to extract log fields.

yaml
processors:
  - regex:
    fields:
     - message
    patterns:
     - ‘^(?P<timestamp>\d{4}-\d{2}-\d{2} \d{2}:\d{2}:\d{2}(\.\d+)?)[ ,](?P<message>(?s:.*))$’
    ignore_missing: true
  - date:
    fields:
     - message_timestamp
    formats:
     - ‘%Y-%m-%d %H:%M:%S%.3f’
     - ‘%Y-%m-%d %H:%M:%S%.6f’
     - ‘%Y-%m-%d %H:%M:%S%.9f’
     - ‘%Y-%m-%d %H:%M:%S%.f’
     - ‘%Y-%m-%d %H:%M:%S%’
    timezone: ‘Asia/Shanghai’
    ignore_missing: true
 transform:
  - fields:
    - message
   type: string
  - fields:
    - file
    - host
   type: string
   index: tag
  - fields:
    - message_timestamp,timestamp
   type: epoch, ns
   index: timestamp

Tuning Fluent Bit

During traffic spikes, we observed mem_buf overlimit warnings indicating backpressure in Fluent Bit's logs. After the diagnosis, we discovered that the issue was related to Fluent Bit's backpressure mechanism and our configuration.

  • Fluent Bit uses an in-memory buffer (mem_buf) to hold collected logs. If this buffer fills up, Fluent Bit pauses collection, triggering the mem buf overlimit message – indicating backpressure.
  • The mem_buf_limit parameter controls the buffer size and thus Fluent Bit's memory usage.
  • Fluent Bit flushes data from the mem_buf to its outputs at intervals defined by the flush parameter.
  • If flush is set too high relative to mem_buf_limit, Fluent Bit might not send data fast enough to keep pace with high log generation rates, causing delays during peaks.
  • Log file rotation could cause delays because Fluent Bit's default interval for checking the file list is 60 seconds (Refresh_Interval). Reducing this interval minimizes collection latency after rotation.

Fine-tuning Fluent Bit's flush, mem_buf_limit, and Refresh_Interval parameters proved highly effective in reducing log delays and backpressure occurrences.

Advanced Indexing Strategies

Unlike Loki's brute-force text scanning, GreptimeDB offers a more powerful indexing capability, enabling users to select the optimal index type for specific query patterns. A common pattern in OB Cloud involves searching log text for keywords. Loki could only perform brute-force text matching for this, which scaled poorly.

While GreptimeDB's brute-force text search speed is fast, it also provides indexes to accelerate phrase matching. Users can enable them as needed:

sql
CREATE TABLE db_log (
  ts TIMESTAMP TIME INDEX,
  message TEXT FULLTEXT INDEX
);

Query using the matches_term function to find logs containing system failure:

sql
SELECT * FROM db_log WHERE matches_term(message, 'system failure');

OB Cloud creates Indexes on relevant log text fields to accelerate keyword searches.

For structured logs, commonly searched fields can be extracted into dedicated columns. And Indexes (like secondary indexes) can then be created on these columns, allowing direct filtering and significantly boosting query speed.

Results: Scaling Without Compromise

The new log architecture powered by GreptimeDB is now live across all OB Cloud environments, processing hundreds of millions of log entries daily. The migration from Loki to GreptimeDB achieved remarkable performance and cost improvements: query response times improved by 10x, previously timeout-prone queries now execute in sub-second timeframes, and Total Cost of Ownership (TCO) realized a 30% reduction.

Enhanced Log Management at Scale

Log query response times and reliability have dramatically improved. By supporting diverse indexing and efficient keyword search, GreptimeDB enables faster location of specific logs within massive datasets, dramatically boosting troubleshooting efficiency. This enhancement makes it easier for OB Cloud users to manage large-scale business data, particularly during troubleshooting and performance monitoring.

Cloud-Native Deployment Simplified

GreptimeDB's cloud-native design and native object storage compatibility ensure operational flexibility in heterogeneous environments, simplifying OB Cloud's multi-cloud deployment and management. Crucially, it also allows OB Cloud to deliver a consistent service experience worldwide. For enterprise users, this multi-cloud flexibility enhances system resilience and scalability while maintaining high availability.

Improved User Experience and Scalability

Optimizations in OB Cloud's log processing, particularly the fine-tuning of Fluent Bit, have further boosted system scalability and stability. The seamless integration between GreptimeDB and Fluent Bit maintains efficient log collection and storage even under heavy-load scenarios. By tuning caching and ingestion configurations, OB Cloud ensures the logging system remains responsive during traffic spikes, which enhances the overall user experience.

Reference

加入我们的社区

获取 Greptime 最新更新,并与其他用户讨论。