向量存储和检索器

本教程将帮助您熟悉LangChain的向量存储(vector store)和检索器(retriever)抽象概念。

这些抽象设计用于支持从向量数据库和其他来源检索数据，以便无缝集成到LLM工作流中。

它们对于需要在模型推理过程中获取外部数据的应用至关重要，例如检索增强生成(RAG)应用。

概念

本指南专注于文本数据的检索，涵盖以下核心概念：

Documents（文档）
Vector stores（向量存储）
Retrievers（检索器）

设置

安装

要完成本教程，您需要安装以下包：

pip

pip install langchain langchain-chroma langchain-openai

conda

conda install langchain langchain-chroma langchain-openai -c conda-forge

Documents（文档）

LangChain提供了Document抽象，用于表示文本单元及其关联元数据。它包含两个主要属性：

page_content：存储文档的文本内容
metadata：包含与文档相关的元数据的键值对字典

metadata可以包含任何相关信息，如文档标题、作者、创建日期等。

值得注意的是，单个Document对象通常代表更大文档的一个片段或块。

下面创建一些示例文档：

from langchain_core.documents import Document

documents = [
    Document(
        page_content="Dogs are great companions, known for their loyalty and friendliness.",
        metadata={"source": "mammal-pets-doc"},
    ),
    Document(
        page_content="Cats are independent pets that often enjoy their own space.",
        metadata={"source": "mammal-pets-doc"},
    ),
    Document(
        page_content="Goldfish are popular pets for beginners, requiring relatively simple care.",
        metadata={"source": "fish-pets-doc"},
    ),
    Document(
        page_content="Parrots are intelligent birds capable of mimicking human speech.",
        metadata={"source": "bird-pets-doc"},
    ),
    Document(
        page_content="Rabbits are social animals that need plenty of space to hop around.",
        metadata={"source": "mammal-pets-doc"},
    ),
]

API参考：Document

在此示例中，我们创建了5个文档，每个文档都有指向3个不同"来源"的元数据。

Vector stores（向量存储）

向量搜索是存储和检索非结构化数据（如文本）的流行方法。

其核心思想是将文本转换为数值向量并存储。当有查询请求时，系统会将查询内容同样转换为向量，然后使用向量相似度度量来识别数据库中最相关的内容。

LangChain的VectorStore对象提供了向存储中添加文本和Document对象以及使用各种相似度指标查询它们的方法。这些向量存储通常使用嵌入模型初始化，该模型决定如何将文本数据转换为数值向量。

LangChain支持多种向量存储技术集成：

一些是由云提供商托管的服务，需要特定凭证才能使用
一些（如Postgres）运行在独立基础设施上，可以在本地或通过第三方部署
还有一些可以在内存中运行，适合轻量级工作负载

在本教程中，我们将演示使用Chroma的LangChain向量存储，它提供了内存中的实现，便于快速开始。

要实例化向量存储，我们通常需要提供一个嵌入模型来指定如何将文本转换为数值向量。在这里，我们将使用 OpenAI 嵌入。

提示

对于不能使用OpenAI嵌入的用户，LangChain还支持其他嵌入模型。

from langchain_chroma import Chroma
from langchain_openai import OpenAIEmbeddings

# 使用你可以访问的嵌入模型
api_key = "sk-xxxxxxxxxxxxxxxxxxxxxxxxxxxxxx"
model_name = "BAAI/bge-large-zh-v1.5"
base_url = "https://api.siliconflow.cn/v1/"

vectorstore = Chroma.from_documents(
    documents,
    embedding=OpenAIEmbeddings(
        model=model_name,
        openai_api_key=api_key,
        openai_api_base=base_url,
    ),
)

API参考：OpenAIEmbeddings

在此处调用 .from_documents 会将文档添加到向量存储中。

VectorStore 实现了用于添加文档的方法，这些方法也可以在对象实例化后调用。

大多数实现都允许您连接到现有的 vector store —— 例如，通过提供 client、索引名称或其他信息。有关更多详细信息，请参阅特定集成的文档。

一旦我们实例化了包含文档的 VectorStore，我们就可以查询它。VectorStore 包括用于查询的方法：

同步和异步
按字符串查询和按向量
返回和不返回相似性分数
通过相似性和最大边际相关性（以平衡相似性与查询与检索结果中的多样性）

这些方法的输出中通常包括 Document 对象的列表。

例子

根据与字符串查询的相似性返回文档：

vectorstore.similarity_search("cat")

[Document(page_content='Cats are independent pets that often enjoy their own space.', metadata={'source': 'mammal-pets-doc'}),
 Document(page_content='Dogs are great companions, known for their loyalty and friendliness.', metadata={'source': 'mammal-pets-doc'}),
 Document(page_content='Rabbits are social animals that need plenty of space to hop around.', metadata={'source': 'mammal-pets-doc'}),
 Document(page_content='Parrots are intelligent birds capable of mimicking human speech.', metadata={'source': 'bird-pets-doc'})]

异步查询:

await vectorstore.asimilarity_search("cat")

[Document(page_content='Cats are independent pets that often enjoy their own space.', metadata={'source': 'mammal-pets-doc'}),
 Document(page_content='Dogs are great companions, known for their loyalty and friendliness.', metadata={'source': 'mammal-pets-doc'}),
 Document(page_content='Rabbits are social animals that need plenty of space to hop around.', metadata={'source': 'mammal-pets-doc'}),
 Document(page_content='Parrots are intelligent birds capable of mimicking human speech.', metadata={'source': 'bird-pets-doc'})]

返回分数：

# Note that providers implement different scores; Chroma here
# returns a distance metric that should vary inversely with
# similarity.

vectorstore.similarity_search_with_score("cat")

[(Document(page_content='Cats are independent pets that often enjoy their own space.', metadata={'source': 'mammal-pets-doc'}),
  0.3751849830150604),
 (Document(page_content='Dogs are great companions, known for their loyalty and friendliness.', metadata={'source': 'mammal-pets-doc'}),
  0.48316916823387146),
 (Document(page_content='Rabbits are social animals that need plenty of space to hop around.', metadata={'source': 'mammal-pets-doc'}),
  0.49601367115974426),
 (Document(page_content='Parrots are intelligent birds capable of mimicking human speech.', metadata={'source': 'bird-pets-doc'}),
  0.4972994923591614)]

根据与嵌入式查询的相似性返回文档：

embedding = OpenAIEmbeddings(
    model=model_name,
    openai_api_key=api_key,
    openai_api_base=base_url,
).embed_query("cat")

vectorstore.similarity_search_by_vector(embedding)

[Document(page_content='Cats are independent pets that often enjoy their own space.', metadata={'source': 'mammal-pets-doc'}),
 Document(page_content='Dogs are great companions, known for their loyalty and friendliness.', metadata={'source': 'mammal-pets-doc'}),
 Document(page_content='Rabbits are social animals that need plenty of space to hop around.', metadata={'source': 'mammal-pets-doc'}),
 Document(page_content='Parrots are intelligent birds capable of mimicking human speech.', metadata={'source': 'bird-pets-doc'})]

Retrievers

LangChain VectorStore 对象不是 Runnable 的子类，因此无法立即集成到 LangChain 表达式语言链中。

LangChain Retriever 是 Runnables，因此它们实现了一组标准的方法（例如，同步和异步invoke和batch作），并被设计为合并到 LCEL 链中。

我们可以自己创建一个简单的版本，而无需子类化 Retriever。如果我们选择希望使用什么方法来检索文档，我们可以轻松地创建一个 runnable。下面我们将围绕 similarity_search 方法构建一个：

from langchain_core.documents import Document
from langchain_core.runnables import RunnableLambda

retriever = RunnableLambda(vectorstore.similarity_search).bind(k=1)  # select top result

retriever.batch(["cat", "shark"])

API参考：Document | RunnableLambda

[[Document(page_content='Cats are independent pets that often enjoy their own space.', metadata={'source': 'mammal-pets-doc'})],
 [Document(page_content='Goldfish are popular pets for beginners, requiring relatively simple care.', metadata={'source': 'fish-pets-doc'})]]

```bash
[[Document(page_content='Cats are independent pets that often enjoy their own space.', metadata={'source': 'mammal-pets-doc'})],
 [Document(page_content='Goldfish are popular pets for beginners, requiring relatively simple care.', metadata={'source': 'fish-pets-doc'})]]

Vectorstore 实现了一个 as_retriever 方法，该方法将生成一个 Retriever，特别是一个 VectorStoreRetriever。这些检索器包括特定的 search_type 和 search_kwargs 属性，用于标识要调用的基础向量存储的方法以及如何参数化它们。例如，我们可以使用以下命令复制上述内容：

retriever = vectorstore.as_retriever(
    search_type="similarity",
    search_kwargs={"k": 1},
)

retriever.batch(["cat", "shark"])

[[Document(page_content='Cats are independent pets that often enjoy their own space.', metadata={'source': 'mammal-pets-doc'})],
 [Document(page_content='Goldfish are popular pets for beginners, requiring relatively simple care.', metadata={'source': 'fish-pets-doc'})]]

VectorStoreRetriever 支持"similarity"（默认）、"mmr"（如上所述的最大边际相关性）和 "similarity_score_threshold" 的搜索类型。我们可以使用后者通过相似性分数对检索器输出的文档进行阈值限制。

检索器可以很容易地合并到更复杂的应用程序中，例如检索增强生成（RAG）应用程序，这些应用程序将给定问题与检索到的上下文合并到 LLM 的提示中。下面我们展示了一个最小示例。

pip install -qU langchain-openai

from langchain_openai import ChatOpenAI

llm = ChatOpenAI(
    base_url=base_url,
    api_key=api_key,
    model="Pro/deepseek-ai/DeepSeek-V3",
)

from langchain_core.prompts import ChatPromptTemplate
from langchain_core.runnables import RunnablePassthrough

message = """
Answer this question using the provided context only.

{question}

Context:
{context}
"""

prompt = ChatPromptTemplate.from_messages([("human", message)])

rag_chain = {"context": retriever, "question": RunnablePassthrough()} | prompt | llm

API参考：ChatPromptTemplate | ChatPromptTemplate

response = rag_chain.invoke("tell me about cats")

print(response.content)

Cats are independent pets that often enjoy their own space.

检索策略可能很丰富，也很复杂。例如：

我们可以从查询中推断出硬规则和过滤器（例如，“使用 2020 年之后发布的文档”）;
我们可以以某种方式返回链接到检索到的上下文的文档（例如，通过某些文档分类法）;
我们可以为每个上下文单元生成多个 embedding;
我们可以集成来自多个检索器的结果;
我们可以为文档分配权重，例如，将最近的文档的权重提高。

作指南的检索器部分介绍了这些策略和其他内置检索策略。

扩展 BaseRetriever 类以实现自定义检索器也很简单。在此处查看我们的作指南。