从任何文本中提取知识图谱的AI工具kg-gen

项目简介

欢迎！ kg-gen 帮助您从任何纯文本中提取知识图谱，使用 AI。它可以处理小型和大型文本输入，还可以处理对话格式的消息。
为什么生成知识图谱？ kg-gen 如果你想：

创建一个图来辅助 RAG（检索增强生成）
创建用于模型训练和测试的图合成数据
将任何文本结构化为图
分析源文本中概念之间的关系

我们通过 LiteLLM 支持基于 API 和本地模型提供商，包括 OpenAI、Ollama、Anthropic、Gemini、Deepseek 等，还使用 DSPy 进行结构化输出生成。
尝试通过运行 tests/ 中的脚本来试用。
运行我们的 KG 基准测试 MINE 的说明在 MINE/ 。
阅读论文：KGGen：使用语言模型从纯文本中提取知识图谱Quick

快速开始

安装模块：

pip install kg-gen

然后导入并使用 kg-gen 。您可以以两种格式之一提供您的文本输入：

A single string
消息对象列表（每个对象具有角色和内容）
以下是一些示例片段：

from kg_gen import KGGen
# Initialize KGGen with optional configurationkg = KGGen(  model="openai/gpt-4o",  # Default model  temperature=0.0,        # Default temperature  api_key="YOUR_API_KEY"  # Optional if set in environment)
# EXAMPLE 1: Single string with contexttext_input = "Linda is Josh's mother. Ben is Josh's brother. Andrew is Josh's father."graph_1 = kg.generate(  input_data=text_input,  context="Family relationships")# Output: # entities={'Linda', 'Ben', 'Andrew', 'Josh'} # edges={'is brother of', 'is father of', 'is mother of'} # relations={('Ben', 'is brother of', 'Josh'), #           ('Andrew', 'is father of', 'Josh'), #           ('Linda', 'is mother of', 'Josh')}
# EXAMPLE 2: Large text with chunking and clusteringwith open('large_text.txt', 'r') as f:  large_text = f.read()
# Example input text:# """# Neural networks are a type of machine learning model. Deep learning is a subset of machine learning# that uses multiple layers of neural networks. Supervised learning requires training data to learn# patterns. Machine learning is a type of AI technology that enables computers to learn from data.# AI, also known as artificial intelligence, is related to the broader field of artificial intelligence.# Neural nets (NN) are commonly used in ML applications. Machine learning (ML) has revolutionized# many fields of study.# ...# """
graph_2 = kg.generate(  input_data=large_text,  chunk_size=5000,  # Process text in chunks of 5000 chars  cluster=True      # Cluster similar entities and relations)# Output:# entities={'neural networks', 'deep learning', 'machine learning', 'AI', 'artificial intelligence', #          'supervised learning', 'unsupervised learning', 'training data', ...} # edges={'is type of', 'requires', 'is subset of', 'uses', 'is related to', ...} # relations={('neural networks', 'is type of', 'machine learning'),#           ('deep learning', 'is subset of', 'machine learning'),#           ('supervised learning', 'requires', 'training data'),#           ('machine learning', 'is type of', 'AI'),#           ('AI', 'is related to', 'artificial intelligence'), ...}# entity_clusters={#   'artificial intelligence': {'AI', 'artificial intelligence'},#   'machine learning': {'machine learning', 'ML'},#   'neural networks': {'neural networks', 'neural nets', 'NN'}#   ...# }# edge_clusters={#   'is type of': {'is type of', 'is a type of', 'is a kind of'},#   'is related to': {'is related to', 'is connected to', 'is associated with'#  ...}# }
# EXAMPLE 3: Messages arraymessages = [  {"role": "user", "content": "What is the capital of France?"},   {"role": "assistant", "content": "The capital of France is Paris."}]graph_3 = kg.generate(input_data=messages)# Output: # entities={'Paris', 'France'} # edges={'has capital'} # relations={('France', 'has capital', 'Paris')}
# EXAMPLE 4: Combining multiple graphstext1 = "Linda is Joe's mother. Ben is Joe's brother."
# Input text 2: also goes by Joe."text2 = "Andrew is Joseph's father. Judy is Andrew's sister. Joseph also goes by Joe."
graph4_a = kg.generate(input_data=text1)graph4_b = kg.generate(input_data=text2)
# Combine the graphscombined_graph = kg.aggregate([graph4_a, graph4_b])
# Optionally cluster the combined graphclustered_graph = kg.cluster(  combined_graph,  context="Family relationships")# Output:# entities={'Linda', 'Ben', 'Andrew', 'Joe', 'Joseph', 'Judy'} # edges={'is mother of', 'is father of', 'is brother of', 'is sister of'} # relations={('Linda', 'is mother of', 'Joe'),#           ('Ben', 'is brother of', 'Joe'),#           ('Andrew', 'is father of', 'Joe'),#           ('Judy', 'is sister of', 'Andrew')}# entity_clusters={#   'Joe': {'Joe', 'Joseph'},#   ...# }# edge_clusters={ ... }

功能

大文本分块
对于长文本，您可以指定一个 `chunk_size` 参数以将文本分块处理：

graph = kg.generate(  input_data=large_text,  chunk_size=5000  # Process in chunks of 5000 characters)

聚类相似实体和关系
您可以聚类相似实体和关系，无论是在生成过程中还是之后：

# During generationgraph = kg.generate(  input_data=text,  cluster=True,  context="Optional context to guide clustering")
# Or after generationclustered_graph = kg.cluster(  graph,  context="Optional context to guide clustering")

聚合多个图
您可以使用聚合方法组合多个图表：

graph1 = kg.generate(input_data=text1)graph2 = kg.generate(input_data=text2)combined_graph = kg.aggregate([graph1, graph2])

消息数组处理
处理消息数组时，kg-gen：

保留每条消息的角色信息
维护消息顺序和边界
能提取实体和关系：

消息中提到的概念之间
演讲者（角色）与概念之间
在对话中的多条消息

例如，给定这个对话：

messages = [  {"role": "user", "content": "What is the capital of France?"},  {"role": "assistant", "content": "The capital of France is Paris."}]

生成的图形可能包括以下实体：

“user”

“assistant”

“France”
“Paris”

并且关系如下：

(user, “asks about”, “France”)

(assistant, “states”, “Paris”)

(Paris, “is capital of”, “France”)

API 参考

KGGen 类

构造函数参数

`model` : str = “openai/gpt-4o” – 使用的生成模型

temperature : 浮点数 = 0.0 – 模型采样的温度
api_key : Optional[str] = None – 模型访问的 API 密钥

生成()方法参数

: Union[str, List[Dict]] – 文本字符串或消息字典列表

model : Optional[str] – 覆盖默认模型
api_key : Optional[str] – 覆盖默认 API 密钥
context : str = “” – 数据上下文描述
chunk_size : 可选[int] – 处理文本块的大小

cluster : 布尔型 = False - 是否在生成后对图进行聚类
temperature : Optional[float] – 覆盖默认温度
output_folder:可选的路径以保存部分进度

cluster() 方法参数

`graph 聚类图`

context : str = “” – 数据上下文描述

model : Optional[str] – 覆盖默认模型

temperature : Optional[float] – 覆盖默认温度

api_key : Optional[str] – 覆盖默认 API 密钥

aggregate() 方法参数

graphs : 图列表 – 要组合的图列表

项目链接

http://github.com/stair-lab/kg-gen

扫码加入技术交流群，备注「开发语言-城市-昵称」

（文：GitHubStore）

一	二	三	四	五	六	日
						1
2	3	4	5	6	7	8
9	10	11	12	13	14	15
16	17	18	19	20	21	22
23	24	25	26	27	28

项目简介

功能

大文本分块对于长文本，您可以指定一个 chunk_size 参数以将文本分块处理：

聚类相似实体和关系您可以聚类相似实体和关系，无论是在生成过程中还是之后：

聚合多个图您可以使用聚合方法组合多个图表：

消息数组处理处理消息数组时，kg-gen：