向量检索实现智能化数据查询 — GreptimeDB v0.10 新功能详解#3

在数字化浪潮下，数据如潮水般涌来，我们时刻被海量数据环绕。假设你置身一座超大型图书馆，馆内藏有数以亿计的书籍、文档、图片及视频等资料。此时若要快速找出与某张特定图片内容相似的资料，或是主题与某段文字描述相近的文档，传统检索方式往往难以胜任。

传统检索多基于关键词匹配，它在处理简单、明确的查询时或许有效，但面对复杂的语义理解和相似性判断就会捉襟见肘。比如，当你想查找含义相近但用词不同的文本、日志，或者风格类似的图片时，传统检索常常无法给出精准结果。而向量检索正是这一问题的根本解决之道。

向量检索：迈向更智能的信息获取

向量检索不同于传统关键词精准匹配，通过将文本、图像、音频等各类数据转化为数字向量，这些向量如同数据的“数字指纹”，蕴含关键特征信息。通过计算向量间的相似度，能迅速定位与目标数据相似的其他数据。例如，搜索图片时，向量检索先将图片转化为向量，并在海量的图片向量库中进行比对，找到与原图最相似的图像。

向量检索为高效、精准获取信息开辟了新途径。GreptimeDB v0.10 基于蚂蚁集团开源的 VSAG 向量检索库（https://github.com/antgroup/vsag），引入了向量数据类型和强大的向量检索能力，下面让我们一同深入了解，看它如何在数据海洋中崭露头角。

GreptimeDB 同步开源了 VSAG 的 Rust binding 库，感兴趣的朋友可以前往 GitHub 查看代码： https://github.com/GreptimeTeam/VSAG-sys

本文将基于 Python3.12 来举例，写作例子参考了《Vector Databases: A Beginner’s Guide!》，在此致谢。

文章源：https://medium.com/data-and-beyond/vector-databases-a-beginners-guide-b050cbbe9ca0f

安装依赖

首先安装 Python 一些必要的依赖库：

shell

pip3 install wget --quiet
pip3 install openai==1.3.3 --quiet
pip3 install sentence-transformers --quiet
pip3 install pandas --quiet
pip3 install sqlalchemy --quiet
pip3 install mysql-connector-python --quiet

其中 sentence-transformers 用于文本的向量化，sqlalchemy 和 mysql-connector-python 用于和 GreptimeDB 连接并进行读写操作，其他都是工具库。

下载模型和数据集

我们采用 AG 语料库的新闻作为数据集，总共 2000 条记录。首先，导入必要的依赖包：

python

import json
import os
import pandas as pd
import wget
from sentence_transformers import SentenceTransformer
import sqlalchemy as sa
from sqlalchemy import create_engine

下载模型：

Python

model = SentenceTransformer('flax-sentence-embeddings/all_datasets_v3_mpnet-base')

下载并解析数据集：

Python


cvs_file_path = 'https://raw.githubusercontent.com/openai/openai-cookbook/main/examples/data/AG_news_samples.csv'
file_path = 'AG_news_samples.csv'

if not os.path.exists(file_path):
    wget.download(cvs_file_path, file_path)
    print('File downloaded successfully.')
else:
    print('File already exists in the local file system.')

df = pd.read_csv('AG_news_samples.csv')
data = df.to_dict(orient='records')

创建表

假设你已经按照 GreptimeDB 安装指南正确安装了单机版本的 GreptimeDB（也可以用 GreptimeCloud 托管服务来测试本例子），我们通过 MySQL 客户端连上数据库或者打开 Dashboard，创建表。

GreptimeDB 安装指南：https://docs.greptime.cn/getting-started/installation/greptimedb-standalone

GreptimeCloud 免费注册：https://greptime.com/product/cloud

sql

CREATE TABLE IF NOT EXISTS news_articles (
    title STRING FULLTEXT,
    description STRING FULLTEXT,
    genre STRING,
    embedding VECTOR(768),
    ts timestamp default current_timestamp(),
    PRIMARY KEY(title),
    TIME INDEX(ts)
);

其中：

title、 description 和 genre 对应文章的标题、描述和类型信息，都是 STRING 类型，并且标题和描述做了全文索引。
embedding 设置为 768 维的向量类型 VECTOR。
GreptimeDB 的表模型强制要求有时间戳列，因为我们测试的数据集没有文章的创建时间，因此这里设置 ts 默认值为 current_timestamp。

写入数据

接下来我们对数据集的 description 做 embedding：

python

descriptions = [row['description'] for row in data]
all_embeddings = model.encode(descriptions)

GreptimeDB 在使用 SQL 插入向量数据类型的时候，需要将向量转为字符串来处理，因此我们编写一个函数来将向量数组字符串化，并处理数据集：

python

def embedding_s(embedding):
    return f"[{','.join(map(str, embedding))}]"
    
for row, embedding in zip(data, all_embeddings):
    row['embedding'] = embedding_s(embedding)

如果您使用 gRPC 客户端来写入就不需要这种转换操作，可以直接写入二进制。

连接数据库：

python

connection_string = "mysql+mysqlconnector://root:@0.0.0.0:4002/public"
conn = create_engine(connection_string, echo=True).connect()

这里我们连接的是本地的 0.0.0.0:4002 端口的 GreptimeDB，并且数据库是 public，用户名为 root，密码为空。如果您使用 GreptimeCloud 测试，请替换为正确的参数。

数据写入：

python

statement = sa.text('''
    INSERT INTO news_articles (
        title,
        description,
        genre,
        embedding
    )
    VALUES (
        :title,
        :description,
        :label,
        :embedding
    )
''')

for i in range(0, len(data), 100):
    conn.execute(statement, data[i:i + 100])

我们按照 100 每批的方式将数据写入到 GreptimeDB。如果一切顺利，我们可以尝试在 MySQL 客户端或者 dashboard 执行一条查询：

sql

SELECT title, description, genre, vec_to_string(embedding) 
   FROM news_articles LIMIT 1\G;

你将看到如下结果：

我们通过 vec_to_string 将向量类型字符串化并输出。

向量检索

接下来我们尝试执行向量检索，搜索近似语义的文章：

python

search_query = 'China Sports'
search_embedding = embedding_s(model.encode(search_query))

query_statement = sa.text('''
    SELECT
        title,
        description,
        genre,
        vec_dot_product(embedding, :embedding) AS score
    FROM news_articles
    ORDER BY score DESC
    LIMIT 10
''')

results = pd.DataFrame(conn.execute(query_statement, dict(embedding=search_embedding)))
print(results)

如果一切顺利，你将看到输出：

bash

                                               title                                        description     genre     score
0  Yao Ming, Rockets in Shanghai for NBA #39;s fi...  SHANGHAI, China The Houston Rockets have arriv...    Sports  0.487205
1         Day 6 Roundup: China back on winning track  After two days of gloom, China was back on the...    Sports  0.438824
2                     NBA brings its game to Beijing  BEIJING The NBA has reached booming, basketbal...    Sports  0.434785
3                  China supreme heading for Beijing  ATHENS: China, the dominant force in world div...    Sports  0.414838
4          IBM, Unisys work to rejuvenate mainframes  Big Blue adds features, beefs up training effo...  Sci/Tech  0.403031
5                        China set for F1 Grand Prix  A bird #39;s eye view of the circuit at Shangh...    Sports  0.401599
6      China Computer Maker Acquires IBM PC Biz (AP)  AP - China's biggest computer maker, Lenovo Gr...  Sci/Tech  0.382631
7  Wagers on oil price prove a slippery slope for...  State-owned, running a monopoly on imports of ...  Business  0.374331
8               Microsoft Order Cancelled by Beijing  The Chinese city of Beijing has cancelled an o...  Sci/Tech  0.359765
9     Trading Losses at Chinese Firm Coming to Light  The disclosure this week that a Singapore-list...  Business  0.352431

姚明是头条新闻！

我们这里先将搜索关键字做了 embedding，然后使用了 vec_dot_product 函数来计算两个向量的点积作为相似度的分数并排序，限制输出结果为 10 个。

更多向量函数请参考文档 https://docs.greptime.com/nightly/reference/sql/functions/vector/

我们再试下基于全文索引的匹配结果：

python

search_query = 'China Sports'
query_statement = sa.text('''
    SELECT
        title,
        description,
        genre
    FROM news_articles
    WHERE matches(description, :search_query)
    LIMIT 10
''')

results = pd.DataFrame(conn.execute(query_statement, dict(search_query=search_query)))
print(results)

可以看到结果和之前完全不一样：

Java


                                               title                                        description     genre
0                 Beijing signs pact for Asean trade  VIENTIANE, Laos China moved yet another step c...     World
1               China to strengthen coal mine safety  China will take tough measures this winter to ...     World
2               Grace Park, Koch share lead in Korea  Jeju Island, South Korea (Sports Network) - Gr...    Sports
3  T. rex #39;s smaller ancestor was covered with...  Fossil remains of the oldest and smallest know...  Sci/Tech
4           Boston Red Sox Team Report - September 6  (Sports Network) - Two of the top teams in the...    Sports
5           China's Economic Boom Still Roaring (AP)  AP - China's economic boom is still roaring de...  Business
6                      Guo tucks away gold for China  China's Guo Jingjing easily won the women's 3-...    Sports
7               Microsoft Order Cancelled by Beijing  The Chinese city of Beijing has cancelled an o...  Sci/Tech
8  Powell wins backing for new NKorea pressure, b...  AFP - US Secretary of State Colin Powell wrap...     World
9  U.S. to Urge China to Push for More N.Korea Talks   BEIJING (Reuters) - Secretary of State Colin ...     World