Journey to Agents

We are entering the second half of AI [1]. Environments and evals are becoming as important as algorithms. In my vision, a real useful consumer AI will be a general agent supporting both GUI and bash environments with real-time voice support. In this post, I’m going to list relevant pointers for building such an agent.

Agent framework design

Agent Lightning by Microsoft Research [4]

I don’t understand all its broad claims. But here are somethings I understand:

breaking multi-step agent trajectories into transitions (state, action, reward, next state) would be better for learning than masking approaches. From the paper [4], “masking-based approaches not only require tight coupling between training and agent execution logic, but also disrupt the continuity of tokens in LLMs, which is assumed in the widely used position encoding approaches, such as Rotary Positional Embeddings (RoPE). Additionally, masking introduces significant complexity in code verification, debugging, and kernel design, often resulting in reduced efficiency when masks become intricate”.
The RL training framework stays standard (inference and training). The additional agent logic can be implemented separately from GPU nodes, and their communication with inference nodes are done through OpenAI-like API [3].
It mentions OpenTelemetry [5], an open-source library to log statistics. Probably this is useful to log rich features in agent trajectories.

ReTool: Reinforcement Learning for Strategic Tool Use in LLMs [20, 21]

ReTool teaches models to use code during reasoning. It is exciting that they share engineering details. They follow VeRL agentic loops design [22]

It remains a question to me, in step 4, when a worker is waiting for a tool execution result for a co-routine, whether it can run token generation for another co-routine in parallel. It will be nice if it could. Otherwise, it is not truly fully asynchronous.

For tool execution, they use SandboxFusion library [27] to deploy the tool execution environment.

They claimed that they are still bubble time in inference [22], although I think we can mitigate that with a good replay buffer design.

VerlTool: Towards Holistic Agentic Reinforcement Learning with Tool Use [23]

It supports multi-turn asynchronous RL. However, the asynchronous generation is truely asynchronous. It seems the model cannot do next round of token generation unless all tool calls for a turn finish.

Moreover, environment design seems very simple. It is not docker-based, which may eventually be needed for GUI Agents.

AWorld: Orchestrating the Training Recipe for Agentic AI [25]

This work is from AntGroup. They use Kubernetes to manage environments. In the paper, they improved the GAIA task from 21% to 32%. But in their github repo [26], they even reached 81%. Based on their github repo [26], their training is based on verl. That means it is not true asynchronous RL. See the explanation in Slime section below.

Slime [28]

Slime is actually close-to-optimal in RL training efficiency in my view. It explains [29] that why engine-based RL training frameworks (e.g., VeRL) do not have the concept of “continuous batching” thus is not efficient enough. Slime has another advantage that it has a sglrouter component, with which complex agent environments can directly interact through an OpenAI-compatible API [30].

[24] claims that SGLang is faster than vLLM in multi-turn environments.

AREAL: A Large-Scale Asynchronous Reinforcement Learning System for Language Reasoning [32, 33]

AREAL looks like a legit alternative to slime. In its paper [33], it clears pointing out a true asynchronous solution to achieve maximum RL efficiency.

Some other similar projects:

rStar2-Agent: Agentic Reasoning Technical Report [31]

Environments for Executing Agents

Most likely, we will use containers [19] to execute agents. Here is an intro to containers (and their comparison to virtual environments) [15]. Popular container techniques are docker [18] and enroot [16]. Here is another comparison between enroot and docker [17].

Docker-based environments include:
1. ComputerRL [34] (based on OSWorld [35])

Communication between LLMs and Agent Logics

We could follow OpenAI’s standards of APIs [2,3, 6, 7]. Alternatively, we may need to set up a MCP client-server paradigm [9, 10]. Detailed comparisons between OpenAI API and Anthropic MCP can be found here [8]. For computer use, there are specific tutorials from OpenAI API [11] and Anthropic MCP [12,14]. There is another non MCP computer use implementation example from Anthropic [13] (very good example for computer use implementation btw!).

RL Algorithm

Because agent tasks are usually long horizon tasks, simple policy gradient methods like GRPO may not work well. We need to either reduce the variance of GRPO [34,38] or adopt actor-critic methods [36]. The UI-TARS paper [37] by Bytedance also mentions that PPO has consistent advantage over GRPO. So does another paper [39].

Memory

Long-horizon agent tasks may also benefit from using memory. As of 10/23/2025, there are two memory papers I appreciate: the first one is called reasoning bank [40], where the model keeps exploring, retrieving old memory, self-judging, and appending new memory during test time; the second one is also similar, but condensing memory into skills [41].

References

[1] https://ysymyth.github.io/The-Second-Half/

[2] https://cookbook.openai.com/examples/agents_sdk/app_assistant_voice_agents

[3] https://platform.openai.com/docs/guides/agents

[4] Agent Lightning: Train ANY AI Agents with Reinforcement Learning: https://arxiv.org/pdf/2508.03680

[5] OpenTelemetry: https://opentelemetry.io/docs/getting-started/dev/

[6] https://platform.openai.com/docs/guides/function-calling?ref=jeffreybowdoin.com

[7] https://platform.openai.com/docs/api-reference/responses

[8] https://jeffreybowdoin.com/blog/openai-responses-api-vs-mcp/

[9] https://modelcontextprotocol.io/specification/2025-06-18/server

[10] https://github.com/modelcontextprotocol/python-sdk

[11] https://platform.openai.com/docs/guides/tools-computer-use

[12] https://github.com/CursorTouch/Windows-MCP

[13] https://docs.anthropic.com/en/docs/agents-and-tools/tool-use/computer-use-tool

[14] https://github.com/domdomegg/computer-use-mcp

[15] https://data-intelligence.hashnode.dev/navigating-machinedeep-learning-environments-virtual-environments-vs-containers

[16] https://github.com/NVIDIA/enroot/tree/main

[17] https://www.pugetsystems.com/labs/hpc/run-docker-containers-with-nvidia-enroot-2142/?srsltid=AfmBOoqVSN0HSMBiDtShMmEMiqIYZT0tVa0dKs40u4y_VTqfXT2sSeD7

[18] https://docs.docker.com/

[19] https://aws.amazon.com/what-is/containerization/

[20] https://arxiv.org/abs/2504.11536

[21] https://www.notion.so/verl-reTool-recipe-Using-multi-round-conversations-and-code-sandboxing-to-improve-the-math-of-large-23a8b5b7feba80b386b2e5b5e3c1cde0

[22] https://verl.readthedocs.io/en/latest/advance/agent_loop.html

[23] https://arxiv.org/html/2509.01055v1

[24] https://www.runpod.io/blog/sglang-vs-vllm-kv-cache

[25] https://arxiv.org/pdf/2508.20404

[26] https://github.com/inclusionAI/AWorld

[27] https://github.com/bytedance/SandboxFusion

[28] https://github.com/THUDM/slime

[29] https://www.notion.so/Agent-Oriented-Design-An-Asynchronous-and-Decoupled-Framework-for-Agentic-RL-2278e692d081802cbdd5d37cef76a547

[30] https://lmsys.org/blog/2025-07-09-slime/

[31] https://www.arxiv.org/abs/2508.20722

[32] https://github.com/inclusionAI/AReaL

[33] AREAL: A Large-Scale Asynchronous Reinforcement Learning System for Language Reasoning: https://arxiv.org/pdf/2505.24298

[34] ComputerRL: Scaling End-to-End Online Reinforcement Learning for Computer Use Agents: https://arxiv.org/abs/2508.14040

[35] OSWorld Benchmarking Multimodal Agents for Open-Ended Tasks in Real Computer Environments: https://arxiv.org/abs/2404.07972

[36] DigiRL: Training In-The-Wild Device-Control Agents with Autonomous Reinforcement Learning: https://arxiv.org/abs/2406.11896

[37] UI-TARS-2 Technical Report: Advancing GUI Agent with Multi-Turn Reinforcement Learning: https://arxiv.org/abs/2509.02544

[38] Group-in-Group Policy Optimization for LLM Agent Training: https://arxiv.org/abs/2505.10978

[39] A Practitioner’s Guide to Multi-turn Agentic Reinforcement Learning: https://arxiv.org/abs/2510.01132

[40] ReasoningBank: Scaling Agent Self-Evolving with Reasoning Memory: https://arxiv.org/abs/2509.25140

[41] Metacognitive Reuse: Turning Recurring LLM Reasoning Into Concise Behaviors:
https://arxiv.org/abs/2509.13237