The Agentic Loop from Scratch: Building Function-Calling AI Without a Framework
There are good reasons to reach for LangChain, LlamaIndex, or one of the other agent frameworks when you are building an LLM-powered application. They solve real problems. The reason to build without one is educational: frameworks hide complexity, and the complexity they hide is exactly the complexity you need to understand if you want to build reliable agentic systems.
I built a full agentic loop in Go from scratch: 21 built-in tools, a voice pipeline from wake word to speech output, automatic context summarization, and a background monitoring daemon. This post is about what I learned, where the frameworks earn their keep, and where rolling your own gives you capabilities you cannot get any other way.
Why Build Without a Framework
The honest answer: I wanted to understand what was actually happening. When an agent built on LangChain calls a tool, retries on failure, and eventually produces a response, there are several layers of abstraction between the LLM API call and your code. When you build it yourself, you write every one of those layers. You understand every failure mode because you have implemented every failure mode's handler, or discovered it by watching the system fail.
The other reason: Go. The major agent frameworks are Python-first. Go's strong typing, goroutine model, and performance characteristics are genuinely useful for the kind of always-on, latency-sensitive system I wanted to build. The voice pipeline in particular benefits from Go's concurrency model. Translating a Python framework to Go idioms would have produced worse Go than writing it natively.
The Tool Definition Contract
Before writing any agent code, you need to solve a representation problem: how do you describe to an LLM what tools are available, what arguments they take, and what they return?
OpenAI's JSON schema format won this as a lingua franca. Anthropic supports it. Google supports a proto-compatible version of it. Every tool in this system is defined using OpenAI-format JSON schema:
type SkillDefinition struct {
Name string `json:"name"`
Description string `json:"description"`
Parameters json.RawMessage `json:"parameters"` // OpenAI JSON schema
}
The Parameters field is raw JSON because tool schemas can be arbitrarily complex (nested objects, arrays, enums, required fields) and trying to represent that structure in Go types would either be too rigid or too loose. Keeping it as raw JSON means the schema is exactly what the LLM documentation says it should be, and the LLM will parse it correctly.
Twenty-one tools are registered at startup. Each tool provides its own schema and a handler function. The executor maps tool names to handlers at initialization time.
The Agentic Loop
The loop is simpler than most descriptions of it suggest. Here is the actual pattern:
- Append the user message to the conversation history.
- Call the LLM with the full conversation history and all tool definitions.
- If the response contains tool calls, execute each tool call and append the results as tool-role messages.
- Go back to step 2 with the updated conversation history.
- When the LLM returns a response with no tool calls, that response is the final answer.
func (a *Agent) RunLoop(ctx context.Context, userMessage string) (string, error) {
a.state.Messages = append(a.state.Messages, ChatMessage{
Role: "user",
Content: userMessage,
})
for {
result, err := a.backend.SendMessageSync(ctx, a.state.Messages, a.tools)
if err != nil {
return "", err
}
if len(result.ToolCalls) == 0 {
// Final response — no more tool calls
a.state.Messages = append(a.state.Messages, ChatMessage{
Role: "assistant",
Content: result.Content,
})
return result.Content, nil
}
// Execute tool calls and append results
for _, call := range result.ToolCalls {
toolResult := a.executor.Execute(ctx, call)
a.state.Messages = append(a.state.Messages, ChatMessage{
Role: "tool",
ToolCallID: call.ID,
Content: toolResult,
})
}
}
}
The critical detail is the message accumulation pattern. Each tool call and its result must be appended to the conversation before the next LLM call. The LLM needs to see its own previous tool call requests alongside the results, because that is how it knows what it already tried and what to do next. Dropping tool call messages from the history causes the LLM to repeat tool calls or become confused about the state of the conversation.
The role: tool message type is not a nice-to-have; it is part of the protocol. Tool results sent as role: assistant or role: user messages will cause unpredictable behavior because the LLM interprets message roles as part of the conversational context.
Multi-Step Tool Chaining
A single user request often requires multiple tool calls in sequence. "What is the weather in the city where the most recent earthquake over magnitude 5.0 occurred?" requires at minimum: a seismic data lookup, extracting the city from that result, then a weather API call. The agent handles this naturally because the loop continues until there are no more tool calls.
Parallel tool calls are a feature of the OpenAI API format. The LLM can return multiple tool calls in a single response when it determines they can be executed independently. The executor handles this:
func (e *Executor) ExecuteAll(ctx context.Context, calls []ToolCall) []ToolResult {
results := make([]ToolResult, len(calls))
var wg sync.WaitGroup
for i, call := range calls {
wg.Add(1)
go func(idx int, tc ToolCall) {
defer wg.Done()
handler, ok := e.handlers[tc.Name]
if !ok {
results[idx] = ToolResult{Error: fmt.Sprintf("unknown tool: %s", tc.Name)}
return
}
results[idx] = handler(ctx, tc.Arguments)
}(i, call)
}
wg.Wait()
return results
}
Parallel execution matters for latency. If the LLM requests three independent API calls, running them concurrently reduces the round-trip time from three serial waits to one. For a voice interface where total latency is the dominant user experience metric, this is meaningful.
Context Window Management
Every conversation produces tokens. Context windows are finite. The system needs to handle this gracefully rather than failing with a context length error in production.
The StreamState struct tracks token counts and cost estimates throughout a session:
type StreamState struct {
Messages []ChatMessage
TotalTokens int
EstimatedCost float64
SummaryCount int
MaxTokens int // Model-specific context limit
}
When TotalTokens approaches 80% of MaxTokens, the loop triggers a summarization pass before processing the next message. The summarizer calls the LLM directly with a prompt asking it to compress the conversation history into a concise summary, then replaces the message history with a single assistant message containing that summary.
This is lossy compression. The LLM's compressed summary will omit details from earlier in the conversation. For most conversational tasks, this is acceptable because the recent context matters most, and the summary preserves the key facts and decisions from the earlier part of the session. For tasks that require precise recall of earlier conversation content (legal analysis, complex code review), you need a different strategy: either a larger context window, external memory with retrieval, or explicit user-controlled checkpointing.
The first time I hit the context limit without proper handling, the API returned a hard error and the session was lost. Implementing the summarization trigger properly means sessions can run indefinitely without hitting that failure mode.
The Voice Pipeline
The voice pipeline adds a latency-constrained dimension that changes how you think about every component. The pipeline is:
- Wake word detection (local, ~50ms)
- Audio capture until silence detected
- OpenAI Whisper STT (1–2 seconds)
- LLM + tool execution (1–3 seconds depending on tool count)
- ElevenLabs TTS synthesis (0.5–1 second)
- Audio playback
End-to-end latency is typically 4.5–6.5 seconds from the end of the user's utterance to the start of the spoken response. That is acceptable for information retrieval. It is marginal for back-and-forth conversational use. It is too slow for anything requiring rapid interaction.
Wake word detection runs entirely locally. This matters for two reasons: privacy (no audio is transmitted until after the wake word is detected) and latency (no network round trip for the most frequent event, which is the non-wake-word audio that is correctly ignored).
The Whisper STT step is the most variable. Short utterances transcribe in under a second. Long, complex utterances with technical terminology can take two to three seconds. I experimented with a local Whisper model (whisper.cpp) as an alternative to the API call, but the transcription quality on domain-specific technical terms was noticeably worse without fine-tuning.
ElevenLabs TTS produces noticeably more natural speech than alternatives I tested, but their API latency is non-trivial. For long responses, streaming TTS reduces perceived latency significantly by starting audio playback before the full synthesis is complete. The implementation maintains a buffer of synthesized audio chunks and begins playback when the buffer has enough content to prevent stuttering.
The Background Daemon Pattern
One component in the system runs as a persistent background goroutine rather than as a one-off tool call. The agent can spawn it, configure it, and query it for status, but the monitoring work happens continuously outside the main request loop.
type MonitorDaemon struct {
ctx context.Context
cancel context.CancelFunc
alertCh chan Alert
mu sync.RWMutex
}
The key challenge: how does background work feed back into the agent's context? The solution is an alert channel that the agent checks at the start of each loop iteration. If there are pending alerts from background workers, they are injected as system messages before the user's request is processed. This gives the agent awareness of background state without requiring a separate polling mechanism.
What Frameworks Would Have Given Me
After building this, I have a much clearer picture of what agent frameworks actually provide.
Retry logic with exponential backoff. When an LLM API call fails with a rate limit error, the correct behavior is to wait and retry with exponential backoff and jitter. I implemented this, but it took a few production failures to get the parameters right. Frameworks provide this out of the box, tuned to each provider's specific rate limit behavior.
Timeout handling across nested async operations. A tool call that involves an external API can hang indefinitely. Propagating context cancellation correctly through tool execution, through LLM calls, through the loop itself, requires careful attention to Go's context propagation patterns. Getting this wrong means hung goroutines in production. Frameworks typically handle this well.
Streaming with tool calls. Streaming and tool calls interact in a non-obvious way. When the LLM is streaming a response that contains tool calls, the tool call data arrives in chunks that must be reassembled before the executor can process them. Handling partial tool call JSON across multiple stream chunks is fiddly to get right. This is one of the areas where framework code has had more time to mature than my implementation.
Observability and tracing. Frameworks like LangChain have dedicated tracing tools for visualizing the full loop, each tool call, and the LLM inputs and outputs at each step. Building equivalent observability from scratch requires significant additional instrumentation work.
When to Use a Framework vs. Roll Your Own
Use a framework when: you need to ship quickly, your team is more comfortable in Python, you need observability tooling that frameworks provide, or your use case fits within the patterns the framework was designed for.
Roll your own when: you need tight latency control, you are building in a language the frameworks do not support well, you need behavior that the framework abstracts away, or you are building something where understanding the internals is itself the goal.
The best outcome from building this from scratch was not the finished system. It was the clear mental model of what an agentic loop actually is: a simple state machine over a message history, with tool calls as a mechanism for the LLM to gather information it needs before producing a final response. That clarity makes every framework I use now more comprehensible, and every debugging session faster.
Part of an ongoing series on building AI systems from scratch in Go.
Related Articles
Building a Provider-Agnostic LLM Interface in Go: Nine Providers, One Abstraction
How I built a clean Go interface abstraction supporting nine LLM providers, what the streaming problem taught me about real-world API behavior, and where leaky abstraction is acceptable.
Rewriting a Python Moderation Service in Go: From 3GB to 50MB
Why I rewrote a real-time moderation and TTS service from Python to Go, what the memory and startup time differences actually looked like, and how the concurrent pipeline architecture changed what the service could do.
RAG for Security: Evolution and Modern Implementation
How RAG for security has evolved from research to practice. Building on SuriCon 2024 work, this post explores modern approaches with Claude, a 2,400-document knowledge base spanning MITRE ATT&CK, CISA KEV, and Suricata, and an interactive browser-based demo.