ART：构建可靠智能体的强化学习新框架

构建能够可靠执行复杂多步任务的AI智能体，已成为当前大模型应用最前沿的方向。利用现有的编排框架，开发者可以快速搭建一个功能原型。然而，令人沮丧的是，这些智能体在真实场景中常常表现得像一个不稳定的“实习生”——面对稍有变化的指令就无所适-从，频繁犯错，并且难以从失败中吸取教训。这种不可靠性问题，正是阻碍当下AI智能体应用难以真正走向生产的“拦路虎”。

AI智能体训练的困境

理想的AI智能体应该像一个经验丰富的员工，能自主规划、使用工具并从结果中学习。然而，现有的RL训练框架在面对真实的智能体应用时，却暴露了三大核心局限：

难以处理“多轮对话”：现有框架大多聚焦于单轮互动（输入->输出->评分），这对于学习解数学题等任务尚可，但无法有效支持智能体天然的、需要多步骤决策和与环境持续交互的“多轮”工作流。
GPU效率低下：在智能体训练的“推演（rollout）”阶段，模型常常需要等待真实世界的动作完成（如浏览网页、调用API）。在这期间，宝贵的GPU资源被闲置，导致整体训练效率低下，成本高昂。
与现有代码库集成困难：真实的智能体往往深度嵌入在复杂的业务逻辑和各类编排框架（如CrewAI、Mastra）中。现有的RL训练流程要求开发者进行大量代码重构，难以“即插即用”，集成成本极高。

ART 介绍

为了精准地解决这些问题，一个名为 Agent Reinforcement Trainer (ART) 的开源强化学习（RL）框架诞生了，它为我们解决这一难题提供了一个全新的范式——为AI智能体提供“在职培训”。

这个项目锚定现有智能体训练框架的局限性。ART通过两大架构创新，巧妙地化解了上述困境：

分离的“前端”与“后端”：ART将训练循环一分为二。“前端” 负责定义智能体的行为逻辑和奖励计算，可以运行在开发者的本地环境中，无缝对接所有现有工具和代码库。“后端” 则专注于LLM的推理和训练，可以部署在任何云端GPU服务器上。这种解耦设计，彻底解决了集成困难和环境依赖的问题。
兼容OpenAI的推理端点：为了最大化生态兼容性，ART在训练时提供了一个行业标准的、与OpenAI兼容的聊天API端点。这意味着开发者几乎无需修改现有的推理代码，就能直接将ART接入，无论是使用原生OpenAI SDK还是LiteLLM等包装器。

在底层，ART整合了vLLM（推理）、TRL和Unsloth（训练）等业界顶级工具，并通过在训练期间将vLLM的KV缓存卸载到CPU内存等优化手段，进一步节省了显存使用，甚至让7B模型在Google Colab的免费套餐上也能流畅训练。

使用方法

下图展示了使用ART优化智能体的简化流程。用户代码负责在环境中运行智能体并对其表现打分，ART则利用这些轨迹和分数来迭代地训练和改进智能体。

以下是使用ART的简单示例（2048 游戏）。

1. 定义智能体

from dotenv import load_dotenv
import random
from typing import TypedDict
from typing import Literal
import string
import xml.etree.ElementTree as ET

load_dotenv()

WINNING_VALUE = 128


# Class that keeps track of state for a single game of 2048
class TwentyFortyEightGame(TypedDict):
    id: str
    board: list[list[int | None]]


# Randomly populates a cell on the board with a 2 or 4
def populate_random_cell(game: TwentyFortyEightGame) -> None:
    all_clear_coordinates = [
        (i, j)
        for i in range(len(game["board"]))
        for j in range(len(game["board"][i]))
        if game["board"][i][j] is None
    ]
    random_clear_coordinates = random.choice(all_clear_coordinates)
    # 90% chance to populate a 2, 10% chance to populate a 4
    game["board"][random_clear_coordinates[0]][random_clear_coordinates[1]] = (
        2 if random.random() < 0.9 else 4
    )


# Generates a new game of 2048
def generate_game(board_length: int = 4) -> TwentyFortyEightGame:
    # random 6 character string
    id = "".join(random.choices(string.ascii_letters + string.digits, k=6))
    game = {
        "id": id,
        "board": [[None for _ in range(board_length)] for _ in range(board_length)],
    }

    # populate two random cells
    populate_random_cell(game)
    populate_random_cell(game)

    return game


# Renders the board in a human-readable format
def render_board(game: TwentyFortyEightGame) -> str:
    board = game["board"]
    # print something like this:
    # _    | 2    | _    | 4
    # 4    | 8    | 2    | 16
    # 16   | 32   | 64   | 128
    # _    | 2    | 2    | 4
    # where _ is an empty cell

    max_cell_width = max(
        [len(str(cell)) for row in board for cell in row if cell is not None]
    )

    board_str = ""
    for row in board:
        # pad the cells with spaces to make them the same width
        board_str += "|".join(
            [
                str(cell).rjust(max_cell_width)
                if cell is not None
                else"_".rjust(max_cell_width)
                for cell in row
            ]
        )
        board_str += "\n"
    return board_str


# condense, privileging matches at the start of the sequence
# sequences should be passed starting with cells that are the furthest in the direction in which the board is being condensed
def condense_sequence(sequence: list[int | None]) -> list[int | None]:
    condensed_sequence = []

    gapless_sequence = [cell for cell in sequence if cell is not None]

    i = 0
    while i < len(gapless_sequence):
        if (
            i + 1 < len(gapless_sequence)
            and gapless_sequence[i] == gapless_sequence[i + 1]
        ):
            condensed_sequence.append(gapless_sequence[i] * 2)
            i += 2
        else:
            condensed_sequence.append(gapless_sequence[i])
            i += 1

    # pad the sequence with None at the end
    return condensed_sequence + [None] * (4 - len(condensed_sequence))


# Condenses the board in a given direction
def condense_board(
    game: TwentyFortyEightGame, direction: Literal["left", "right", "up", "down"]
) -> None:
    if direction == "left":
        for row in game["board"]:
            condensed_row = condense_sequence(row)
            for i in range(len(row)):
                row[i] = condensed_row[i]

    if direction == "right":
        for row in game["board"]:
            reversed_row = row[::-1]
            # reverse the row before and after condensing
            condensed_row = condense_sequence(reversed_row)[::-1]
            for i in range(len(row)):
                row[i] = condensed_row[i]

    if direction == "up":
        for col_index in range(len(game["board"][0])):
            column = [row[col_index] for row in game["board"]]

            condensed_column = condense_sequence(column)
            for row_index in range(len(column)):
                game["board"][row_index][col_index] = condensed_column[row_index]

    if direction == "down":
        for col_index in range(len(game["board"][0])):
            column = [row[col_index] for row in game["board"]]
            reversed_column = column[::-1]
            condensed_column = condense_sequence(reversed_column)[::-1]
            for row_index in range(len(column)):
                game["board"][row_index][col_index] = condensed_column[row_index]


# Applies an agent move to the game board
def apply_agent_move(game: TwentyFortyEightGame, move_xml: str) -> None:
    direction = None
    # parse the move
    try:
        root = ET.fromstring(move_xml)
        direction = root.text
    except Exception:
        raise ValueError("Invalid xml")

    if direction not in ["left", "right", "up", "down"]:
        raise ValueError("Invalid direction")

    condense_board(game, direction)

    populate_random_cell(game)


# Returns the maximum cell value on the board
def max_cell_value(game: TwentyFortyEightGame) -> int:
    return max([cell for row in game["board"] for cell in row if cell is not None])


# Returns True if the game is finished
def check_game_finished(game: TwentyFortyEightGame) -> bool:
    if max_cell_value(game) >= WINNING_VALUE:
        return True

    # check if any cell is empty
    if any(cell is None for row in game["board"] for cell in row):
        return False

    return True


# Returns the sum of all the cell values on the board
def total_board_value(game: TwentyFortyEightGame) -> int:
    return sum([cell for row in game["board"] for cell in row if cell is not None])

2. 创建模型

import art
from art.local import LocalBackend
from dotenv import load_dotenv


load_dotenv()

random.seed(42)

# Declare the model
model = art.TrainableModel(
    name="agent-002",
    project="2048-multi-turn",
    base_model="Qwen/Qwen2.5-3B-Instruct",
)
# To run on a T4, we need to override some config defaults.
model._internal_config = art.dev.InternalModelConfig(
    init_args=art.dev.InitArgs(
        max_seq_length=8192,
    ),
    engine_args=art.dev.EngineArgs(
        enforce_eager=True,
        gpu_memory_utilization=0.8,
    ),
)

# Initialize the server
backend = LocalBackend(
    # Normally we don't want to run the server in-process, but for the output
    # to show up properly on Google Colab we'll enable this.
    in_process=True,
    path="./.art",
)

# Register the model with the local Backend (sets up logging, inference, and training)
await model.register(backend)

3. 定义“Rollout”过程：Rollout 指的是代理执行任务的单个完整过程，在本次任务中就是指一局 2048 游戏。在这个函数中，智能体会一直玩游戏直到结束。游戏结束后，系统会根据棋盘上的最高分来计算奖励，并和游戏过程中的所有信息一同记录下来。

import art

import math
import requests
from pydantic import BaseModel
import weave

if os.getenv("WANDB_API_KEY", ""):
    weave.init(model.project, settings={"print_call_link": False})


class Scenario2048(BaseModel):
    step: int


@weave.op
@art.retry(exceptions=(requests.ReadTimeout))
async def rollout(model: art.Model, scenario: Scenario2048) -> art.Trajectory:
    client = model.openai_client()
    game = generate_game()

    move_number = 0

    trajectory = art.Trajectory(
        messages_and_choices=[
            {
                "role": "system",
                "content": "You are an excellent 2048 player. Always choose the move most likely to lead to combine cells to eventually reach the number 2048. Optional moves are 'left', 'right', 'up', 'down'. Return your move as an XML object with a single property 'move', like so: <move>left</move>",
            }
        ],
        metadata={
            "game_id": game["id"],
            "notebook-id": "2048",
            "step": scenario.step,
        },
        reward=0,
    )

    while True:
        trajectory.messages_and_choices.append(
            {"role": "user", "content": render_board(game)}
        )

        try:
            messages = trajectory.messages()
            chat_completion = await client.chat.completions.create(
                max_completion_tokens=128,
                messages=messages,
                model=model.name,
                stream=False,
            )
        except Exception as e:
            print("caught exception generating chat completion", e)
            raise e

        choice = chat_completion.choices[0]
        content = choice.message.content
        assert isinstance(content, str)
        trajectory.messages_and_choices.append(choice)

        try:
            apply_agent_move(game, content)
            move_number += 1
        except ValueError:
            trajectory.reward = -1
            break

        if check_game_finished(game):
            max_value = max_cell_value(game)
            board_value = total_board_value(game)
            trajectory.metrics["max_value"] = max_value
            trajectory.metrics["board_value"] = board_value
            trajectory.metrics["move_number"] = move_number

            # try to get as close to the winning value as possible
            # otherwise, try to maximize number of high cells on board
            # but above all else: WIN THE GAME!
            if max_value < WINNING_VALUE:
                # scale max value logarithmically between 0 for 2 and 1 for WINNING_VALUE
                max_value_reward = (math.log(max_value, 2) - 1) / (
                    math.log(WINNING_VALUE, 2) - 1
                )
                # scale board value logarithmically between 0 for 2 * 16 and 1 for WINNING_VALUE * 16
                board_value_reward = (math.log(board_value, 2) - 1) / (
                    math.log(WINNING_VALUE * 16, 2) - 1
                )
                # combine the two rewards, with max value having a higher weight
                trajectory.reward = max_value_reward + (board_value_reward * 0.2)
            else:
                # double reward if the agent wins
                trajectory.reward = 2
            break

    return trajectory

4. 训练：启动一个训练，智能体不断地进行游戏，收集数据和奖励，然后将这些信息发送到 ART 服务器。服务器使用这些数据通过 GRPO 算法来训练模型，并将更新后的模型返回，用于下一轮的游戏，从而不断迭代优化。

for i in range(await model.get_step(), 10):
    train_groups = await art.gather_trajectory_groups(
        (
            art.TrajectoryGroup(rollout(model, Scenario2048(step=i)) for _ in range(18))
            for _ in range(1)
        ),
        pbar_desc="gather",
        max_exceptions=18,
    )
    await model.delete_checkpoints()
    await model.train(
        train_groups,
        config=art.TrainConfig(learning_rate=1e-5),
        # Lowering the logprob_calculation_chunk_size is a memory saving measure
        # to allow longer sequences (up to 8192 tokens) to be processed on a T4.
        _config={"logprob_calculation_chunk_size": 8},
    )

加载模型，评估

import torch
from unsloth import FastLanguageModel


# example: .art/2048-multi-turn/models/001/0003
lora_model_path = (
    f".art/{model.project}/models/{model.name}/{await model.get_step():04d}"
)

print(f"loading model from {lora_model_path}\n")

peft_model, tokenizer = FastLanguageModel.from_pretrained(
    model_name=lora_model_path,
    max_seq_length=16384,
    dtype=torch.bfloat16,
    load_in_4bit=True,
)
FastLanguageModel.for_inference(peft_model)

game = generate_game()
move_number = 0

messages = [
    {
        "role": "system",
        "content": "You are an excellent 2048 player. Always choose the move most likely to lead to combine cells to eventually reach the number 2048. Optional moves are 'left', 'right', 'up', 'down'. Return your move as an XML object with a single property 'move', like so: <move>left</move>",
    },
]

while not check_game_finished(game):
    rendered_board = render_board(game)
    messages.append({"role": "user", "content": rendered_board})

    inputs = tokenizer.apply_chat_template(
        messages, return_tensors="pt", add_generation_prompt=True
    ).to("cuda")

    content = ""

    def get_completion() -> str:
        with torch.no_grad():
            outputs = peft_model.generate(
                input_ids=inputs,
                max_new_tokens=100,
                do_sample=True,
                temperature=0.7,
                top_p=0.9,
            )
            return tokenizer.decode(
                outputs[0][inputs.shape[1] :], skip_special_tokens=True
            )

    try:
        content = get_completion()
    except Exception as e:
        print("caught exception generating chat completion", e)
        raise e

    messages.append({"role": "assistant", "content": content})

    try:
        apply_agent_move(game, content)
        move_number += 1
    except ValueError:
        raise ValueError(f"Invalid move on move {move_number}: {content}")

    # print the board every 10 moves
    if move_number % 10 == 0:
        print(f"\nmove {move_number}")
        print(f"board:\n{rendered_board}")
        print(f"agent move: {content}")
        print(f"updated board:\n{render_board(game)}")


print(f"game finished in {move_number} moves")

max_value = max_cell_value(game)
board_value = total_board_value(game)

if max_value >= WINNING_VALUE:
    print("game won! 💪")
else:
    print("game lost! 😢")


print(f"final board:\n\n{render_board(game)}")
print(f"max value: {max_value}")
print(f"board value: {board_value}")

至此，从训练到使用的全过程就完成了。

loading model from .art/2048-multi-turn/models/agent-002/0010

==((====))==  Unsloth 2025.5.1: Fast Qwen2 patching. Transformers: 4.51.3. vLLM: 0.8.5.post1.
   \\   /|    NVIDIA H100 PCIe. Num GPUs = 1. Max memory: 79.179 GB. Platform: Linux.
O^O/ \_/ \    Torch: 2.6.0+cu124. CUDA: 9.0. CUDA Toolkit: 12.4. Triton: 3.2.0
\        /    Bfloat16 = TRUE. FA [Xformers = 0.0.29.post2. FA2 = False]
 "-____-"     Free license: http://github.com/unslothai/unsloth
Unsloth: Fast downloading is enabled - ignore downloading bars which are red colored!
The attention mask is not set and cannot be inferred from input because pad token is same as eos token. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.

move 10
board:
_|2|4|2
_|_|4|4
2|_|4|2
_|_|_|_

agent move: <move>down</move>
updated board:
_|_|_|_

4|_|_|2
_|_|4|4
2|2|8|2
...

性能表现

在名为“ART·E”的真实世界案例中，研究团队挑战了通过搜索邮件收件箱来回答自然语言问题的任务。他们利用公开的安然（Enron）邮件数据集，并创新地使用GPT-4.1合成了约4000个高质量的问答对，解决了缺乏训练数据的难题。智能体被赋予了搜索邮件、阅读邮件和返回答案三个基本工具。在训练中，ART的GRPO循环为每个问题生成4个不同的执行轨迹，并基于一个复合奖励函数进行优化——该函数不仅奖励正确答案，还惩罚“幻觉”并鼓励用更少的步骤完成任务。

完整代码：https://colab.research.google.com/github/openpipe/art/blob/main/examples/art-e/art-e.ipynb#scrollTo=thQ4e6tqk_I1

最终，经过训练的模型在准确率和效率上均超越了强大的o3模型，而整个训练过程在单个H100 GPU上仅耗时不到一天，成本约为80美元，充分展示了ART的强大效能与经济性。

ART很棒，但在开始使用之前，开发者建议先确认任务是否满足以下条件：

已有一定的成功率：开源模型在你的任务上至少已经能达到30%左右的成功率。RL擅长优化和提升，但如果任务对模型来说过于困难，它可能无法有效学习。
可量化的成功标准：你需要能够清晰地判断一次任务尝试是成功还是失败，并量化其表现。这个“奖励”可以是客观的（如“答案是否匹配标准答案”）或主观的（如“裁判LLM是否认为结果满意”）。
可重复执行的环境：训练过程需要智能体在同一场景下并行尝试多次。因此，该任务不应在训练中对真实世界产生不可逆的改变。

小结

就像当下上下文工程（上下文工程的崛起：提示工程已是过去式）以及Prompt优化（DSPY（DSPy（声明式自改进语言程序），面向大模型编程的新方法，克服当前LLM应用开发的诸多缺点），APE（Weavel Ape超过DSPy，或将成为最好用的提示（prompt）优化工具））思路一样，如何闭环整个过程，简化实施的复杂度，持续迭代改进，是提升工程化水平的关键，Agent Reinforcement Trainer (ART) 的出现，瞄准当前AI智能体困境，通过一套高效、易于集成且能从经验中学习的训练机制，持续提升表现。

github：https://github.com/OpenPipe/ART

公众号回复“进群”入群讨论。

（文：AI工程化）

2025 年 7 月
一	二	三	四	五	六	日
	1	2	3	4	5	6
7	8	9	10	11	12	13
14	15	16	17	18	19	20
21	22	23	24	25	26	27
28	29	30	31

AI智能体训练的困境

ART 介绍

使用方法

性能表现

小结

发表评论 取消回复

发表评论取消回复