LLM & Agentic Systems > Agent orchestration > Agent loop controller
What This Covers
This scenario focuses on an agent loop controller that alternates LLM reasoning and tool execution. The design task is to assess whether the loop’s completion behavior is robust for multi-step work, not just whether it performs well on happy-path benchmarks.
The System
Current State
- A production agent handles internal analyst tasks such as document lookup, SQL generation, and lightweight synthesis across 3 tools.
- It processes about 18,000 runs/day, with a maximum loop budget of 12 steps and a median of 3 steps per successful run.
- The controller currently supports step limits, token budgets, and a configurable completion detector that can end a run when the model appears to have produced a user-ready response.
- Internal evaluation shows 94% success on single-tool tasks and 81% success on multi-step benchmark tasks.
Proposed Change
- The team wants to reduce average latency from 8.4s to under 6.5s by ending runs as soon as the model appears done, instead of always requiring an explicit controller-managed finalize step.
- They also want one shared completion policy across all task types to simplify orchestration and reduce prompt complexity.
- Product wants this shipped in the next release because perceived responsiveness is now a top customer complaint.
Worked Example: Design Tradeoff
In a previous revision, the team optimized for persistence: if a tool result looked incomplete or unhelpful, the controller allowed the model to try again as long as the step budget had not been exhausted. The original reasoning was that complex tasks often require iteration, and overly restrictive controls would cut off valid recovery behavior.
During rollout, usage looked healthy overall, but a trace review surfaced a pattern in a small but expensive slice of runs: token burn rose sharply with little additional output, and some traces showed the same tool call being issued many times in a row with effectively unchanged arguments. Final outputs were often empty or very thin despite high spend. The issue was hard to spot at first because repeated tool use also occurs in legitimate multi-step work.
The design was adjusted by adding lightweight progress checks to the loop controller: repeated actions were compared at the argument level, ambiguous tool responses were normalized, and the loop required some evidence of state change before permitting additional retries. The team kept flexibility for genuine iterative work, but added guardrails so “more steps” no longer automatically meant “more progress.”
The Design Question
The team now wants to make the controller more responsive by letting it stop whenever the model appears to have produced a complete answer, even if that output occurs before the loop reaches its usual terminal structure. This could improve latency and simplify prompts, but it may also change behavior on tasks where intermediate outputs look polished before the work is actually complete.
What design would you recommend for deciding when the loop should end, and what safeguards would you put in place before rolling it out broadly?
Anchor Data
| Task type | Avg steps | P50 latency | P95 latency | Benchmark pass rate | % runs ending on model-produced answer |
|---|---|---|---|---|---|
| Single lookup | 1.8 | 3.1s | 5.4s | 96% | 88% |
| Retrieval + summary | 2.6 | 5.2s | 8.7s | 91% | 74% |
| Retrieval + transform + summary | 4.9 | 9.6s | 14.8s | 79% | 41% |
| Tool chain with validation | 5.4 | 10.1s | 15.9s | 77% | 38% |
Projected with new policy:
| Metric | Current | Proposed |
|---|---|---|
| Overall P50 latency | 8.4s | 6.3s |
| Overall token cost/run | 14.2k | 11.1k |
| Overall benchmark pass rate | 86% | 85% |
Current Observations
- Product review sessions found the proposed outputs “faster and usually complete enough” for common user queries.
- In offline testing, simpler tasks retained quality, while some longer workflows produced answers that were plausible but shorter than expected.
- One engineer noted that a subset of benchmark traces stopped at nearly the same point in the workflow across different prompts.
- Another team member believes the remaining gap is mostly due to task difficulty, not controller behavior.
Constraints
- The release branch is cut in 3 weeks.
- The team can afford one additional evaluation pass and limited trace instrumentation, but not a large new benchmark build.
- Any solution that adds more than 300ms median latency is unlikely to be approved by product.
What You’ll Be Evaluated On
- Tradeoffs: How well you identify competing concerns and justify your design choices
- Gaps: Whether you find what’s missing in the data or proposal
- Prevention: Whether you anticipate what could go wrong and propose safeguards
- Clarity: How clearly and concretely you explain your thinking
- Prioritization: Whether you focus on the most important decisions first
- Reasoning quality: Whether your conclusions follow from the evidence and constraints provided