Beyond the Wrapper: Building High-Throughput Reasoning Agents with Async Kernels

Community Article Published December 26, 2025

A critique of the mainstream agent abstraction — backed by a concrete implementation journey and validated by how serious RL systems actually work.

By Jen Wei

The "Latency Disguised as Intelligence" Problem

The current application layer of AI is suffering from a massive abstraction error. We have confused "Agentic Behavior" with "API Plumbing."

If you look at the source code of most popular agent frameworks today, you will find a Python while loop wrapping an HTTP client. The workflow looks like this:

Send Prompt to OpenAI.
Wait 2s (Network Latency).
Receive "Call Tool" token.
Wait 3s (Tool Execution).
Send Tool Output back.
Wait 2s (Network Latency).

From a Systems Engineering perspective, this is a disaster. Your GPU utilization is effectively 0% for the majority of the wall-clock time. You are paying for a Ferrari (H100) to sit at red lights.

While the industry builds wrappers, the research community (ByteDance, DeepSeek, AI2) has moved on to Systems Engineering. This is the story of how I tried to bridge that gap—from manual KV management to reverse-engineering the async kernels that power state-of-the-art Reasoning Models.

The Typical work flow

Figure 1: Search-guided rubric deep research rollout (AI2)

Figure 2: Text-based sandbox code rollout (ByteDance Seed)

This pattern shows up consistently in modern RL training stacks, regardless of the specific model or lab:

A shared core loop: interleaved token generation with tool execution, delayed feedback, and reward computation.
As a result, the policy does not execute in a single uninterrupted forward pass; it repeatedly suspends, waits, and resumes.

At scale, an “agent” is just a workload with unpredictable suspension points.

Once you view agents this way, the limitations of synchronous inference abstractions become obvious.

Phase 1: The "Manual Transmission" Implementation

Influenced by the ReTool (ByteDance) and VerlTool papers, I set out to build a training loop that owned the compute rather than renting it.

The goal was to implement **Group Relative Policy Optimization (GRPO)**—the algorithm behind DeepSeek-R1—for an agent that could use a Python sandbox. To do this efficiently, I needed to implement two specific optimizations from the literature:

Interpreter Feedback Masking: Blocking external tool outputs from the loss calculation so the model isn't penalized for the sandbox's text.
KV-Cache Reuse: Caching the "Thought" tokens so we don't re-compute attention after every tool execution.

I built a custom generation loop on top of Hugging Face's TRL GRPO_trainer structure, manually managing the past_key_values and DynamicCache.

The Implementation: (Here is the core logic where I handle the generation cycle and interpreter injection)

def _retool_generate_with_interpreter(self, prompt_ids_batch, attention_mask_batch, eos_id, interpreter_id, code_id, max_turns=10):
    # ... setup ...
    for turn_idx in range(max_turns):
        # 1. Generate tokens using custom loop with KV Cache
        newly_generated_tokens, current_kv = self._custom_generate(
            input_ids=current_input_id,
            past_key_values=current_kv,
            # ...
        )
        
        # 2. Check for Code Block detection
        if last_token_id == code_id[1]:
             code_match = re.search(r'<code>(.*?)</code>', full_text, re.DOTALL)
             if code_match:
                 # 3. Execute Code (Synchronous blocking here!)
                 interpreter_text = self._execute_code(code_match.group(1))
                 
                 # 4. Format Feedback & Reuse KV Cache for next turn
                 formatted_feedback = f"{...}{interpreter_text}{...}"
                 interpreter_ids = self.processing_class(formatted_feedback, ...).input_ids
                 current_input_id = interpreter_ids

The Post-Mortem: This code works. It correctly masks the loss and reuses the KV cache. But it has a fatal flaw.

Look at this line in my _custom_generate function:

current_ids = torch.cat([current_ids, next_token], dim=1)

In the deep learning world, torch.cat is a silent killer. Every time the model generates a token or appends tool output, PyTorch must allocate a new, larger contiguous block of VRAM and copy the data over. As reasoning traces grew long (10k+ tokens), my VRAM became fragmented "Swiss Cheese." I hit OOM (Out Of Memory) errors long before the GPU memory was actually full.

I had built a Manual Transmission. This failure mode is unavoidable in userland generation loops.

Phase 2: The "Split Brain" Reactor Pattern

While I was debugging memory fragmentation, AI2 released Deep Research Tulu. I dug into their infrastructure (specifically tool_vllm.py) to see how they handled the async tooling problem during RL training.

What I found was a a scheduler-level override that turns the inference engine into an Async Reactor.

The Problem: GRPO + Tool Divergence

GRPO requires generating a group of outputs (e.g., n=3 or n=16) for every prompt to estimate the baseline reward.

If you send n=3 to a standard inference engine (like vLLM), it bundles them into a single batch.
The Conflict: If Sequence A calls a tool (needs to wait) but Sequence B wants to think (needs GPU), the engine stalls. You cannot "pause" one-third of a tensor batch.

The Solution: The "Split Brain" Hack

The solution is to intercept the request before it hits the engine and unroll the parallelism.

Instead of one request for 3 samples, the system manually splits it into 3 independent "Strangers."

# The Logic (Simplified from AI2's implementation)
def _validate_and_add_requests(self, prompts, params, ...):
    # 1. Intercept the Group Size (n=3)
    # 2. Hack the params to force n=1
    self.single_n_sampling_params.n = 1 
    
    # 3. Unroll the loop manually
    for j in range(params.n): 
        # Create unique IDs for the "Strangers"
        request_id = f"{i}-{j}"
        
        # Submit them as independent agents
        self.llm_engine.add_request(request_id, ..., self.single_n_sampling_params)

Why This Is Brilliant (The Async "How")

This unrolling allows the implementation of a Client-Side Event Loop that acts as a Traffic Controller between the CPU and GPU.

The Fast Lane (GPU): All 3 agents start generating tokens.
The Divergence: Agent 1 emits </code>. The Event Loop detects the stop token.
The Offload: Agent 1 is pulled out of the GPU queue and sent to a ThreadPoolExecutor to run the Python code.
The Saturation: Crucially, Agent 2 and Agent 3 keep running. The GPU never waits.
The Re-Injection: When Agent 1 finishes its tool execution, the system constructs a new request containing the history + tool output and re-submits it to vLLM.
The Re-Injection: When Agent 1 finishes its tool execution (the future completes), the system constructs a new request containing the history + tool output and re-submits it to vLLM. Because vLLM uses PagedAttention (non-contiguous memory), re-injecting this request is nearly free ($O(1)$) compared to my torch.cat nightmare ($O(N)$).

Here is the snippet showing how the Python Event Loop manages this handoff without blocking the main process:

Initialize executor for non-blocking I/O
self.executor = ThreadPoolExecutor(max_workers=20)

# Inside the main generation loop
if o.text.endswith(stop_str):
    # 1. Identify the tool
    tool = self.tools[stop_str]
    
    # 2. Submit the work to the thread pool (NON-BLOCKING)
    future = self.executor.submit(tool, o.text)
    
    # 3. Store the 'Future' object and the current state
    self.pending_tool_futures[output.request_id] = (future, o, output)
    
    # 4. Break out of the loop to let the GPU continue processing other requests
    output_processed = True
    break

Conclusion: Own the System

The transition from "API Wrapper" to "Systems Engineer" is painful. You have to care about memory fragmentation, scheduler states, and event loops.

But the performance difference isn't 20% or 30%. It is an order of magnitude.

Wrapper Agents: ~0% GPU Utilization (Network Bound).
My Manual Implementation: ~40% GPU Utilization (Memory Bound).
Async Reactor Pattern: ~95% GPU Utilization (Compute Bound).

If we want to build the next generation of Reasoning Models, we need to stop treating the LLM as a black box function call and start treating it as a system resource we schedule against.

Code

Reference Paper

Community

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment

Upvote