Building a Provider-Agnostic LLM Interface in Go: Nine Providers, One Abstraction

Provider lock-in in LLM applications is a real cost that compounds over time. You build your application against OpenAI's SDK, then a newer model from Anthropic consistently outperforms it on your specific task, and suddenly you are looking at a refactor that touches every piece of code that calls an LLM. I built a personal AI engineering project to solve this problem properly, a provider-agnostic interface in Go that currently supports nine providers, and the design decisions along the way taught me a lot about where abstraction helps and where it creates more problems than it solves.

The Provider Landscape and Why Lock-In Happens

When you start an LLM project, you typically reach for whatever SDK your language community has best support for. In Go, that usually means the OpenAI Go client or Anthropic's SDK. The first integration works fine. Then you want to try a different provider. You import their SDK, and now you have two entirely different mental models in your codebase: different message structs, different streaming patterns, different error handling, different tool call formats. The integration cost for every additional provider is not additive; it is multiplicative, because every place in your application that calls an LLM now has to know which provider it is talking to.

The solution is a common interface, but designing that interface well requires understanding what these providers actually have in common and where they genuinely differ.

The Interface

The core abstraction ended up being four methods:

type LLMBackend interface {
    SendMessageStream(ctx context.Context, messages []ChatMessage,
        tools []SkillDefinition, callback StreamCallback) error
    SendMessageSync(ctx context.Context, messages []ChatMessage,
        tools []SkillDefinition) (*ChatCompletionResult, error)
    SetSystemPrompt(prompt string)
    Close() error
}

This is deliberately minimal. The interface does not include methods for listing models, checking rate limits, managing embeddings, or any of the other capabilities these providers offer. Those are provider-specific operations that belong in provider-specific code. The interface only captures what every LLM in this system needs to do: accept a conversation history with optional tools, return a response, and support streaming.

The ChatMessage type is a simple role/content struct. The SkillDefinition type is deliberately kept in OpenAI JSON schema format rather than trying to create a provider-neutral tool schema format. This was a deliberate decision I will explain in detail below.

The Three-Backend Architecture

Nine providers, three backend implementations. This is the key structural insight.

OpenAI-compatible SDK handles OpenAI, Anthropic (via their OpenAI compatibility endpoint), Venice.ai, DigitalOcean AI, ElevenLabs, and OpenRouter. The OpenAI-compatible API has become the de facto standard for LLM inference. Almost every provider either natively supports it or offers a compatibility layer. This backend covers six providers with a single implementation.

Native Google GenAI SDK handles Gemini and Vertex AI. Google's API has enough structural differences, particularly around how it handles system instructions, tool definitions, and streaming, that trying to shim it through the OpenAI-compatible path would have required more adapter code than just writing a native backend. The Google GenAI SDK also unlocks Vertex AI-specific features like grounding and enterprise billing that are not available through compatibility endpoints.

Native xAI SDK handles Grok with Collections and RAG support. xAI's SDK includes capabilities specific to their platform that are not available through their OpenAI-compatible endpoint, particularly around their Collections feature for retrieval-augmented generation. This warranted a native backend.

The lesson here: do not fight provider-specific capabilities. Write a thin native backend for providers where their unique features matter to your use case, and use the compatibility layer for everything else.

URL-Based Provider Detection

Rather than requiring callers to specify which backend to use via configuration flags, I implemented a ProviderCapabilities registry that detects the provider from the API endpoint URL.

// Detection happens at initialization time
// https://api.openai.com/v1                    -> OpenAI-compatible backend
// https://api.anthropic.com/v1                 -> OpenAI-compatible backend (via compat)
// https://generativelanguage.googleapis.com    -> Google GenAI backend
// https://api.x.ai/v1                          -> xAI backend

The practical benefit: you can switch providers by changing one environment variable (the API endpoint URL) rather than changing code. The tradeoff is that URL-based detection is a heuristic, not a contract. If a provider changes their endpoint structure or you are running a local proxy, the detection can fail or misfire.

For this project, the heuristic works reliably enough. In a production system where correctness matters more than convenience, I would add an explicit provider override field that takes precedence over URL detection.

The Streaming Problem

This is where the real engineering work was. Server-sent events streaming for LLM responses sounds straightforward: the provider sends tokens as they are generated and your client receives and displays them progressively. The reality is significantly messier.

The specific problem I encountered: several providers that advertise streaming support do not actually stream in the way the term implies. They batch 80% or more of the tokens into the first SSE chunk, which arrives within 500 milliseconds of the request. From a technical standpoint, they are streaming because the connection stays open and events arrive over time. From a user experience standpoint, the screen is blank for half a second and then most of the response appears at once with a small trickle at the end.

I call this "dump detection." The detection algorithm measures the ratio of tokens in the first chunk to total tokens and the timing of that first chunk. If more than 70% of the content arrives in the first event within 500ms, the provider is dumping rather than streaming.

The solution for dump providers: intercept the content, buffer it, and simulate character-by-character output at a configurable speed. The default is 40 characters per second, which feels natural for reading. This is a UX shim, not a functional one. The content is complete before display begins, but it reads as if it is streaming in real time.

This felt slightly wrong to implement, but user testing confirmed it was the right call. Showing the response all at once after a half-second blank screen reads as slower and more jarring than character-by-character output, even though the total time to first character is longer with the simulated approach.

Tool Calling Compatibility

Tool calling (function calling) is where provider compatibility breaks down most visibly. The core concept is consistent: you provide a schema for functions the LLM can call, the LLM returns a structured tool call request, you execute the function, you return the result.

The incompatibilities are in the details. Anthropic's native tool format uses a different schema structure than OpenAI's. Google's function calling uses proto-compatible types. The response format for tool calls differs across providers. Error handling when tool calls fail varies.

My approach: keep all tool definitions in OpenAI JSON schema format throughout the application. The backend implementations are responsible for translating from OpenAI format to whatever the provider natively expects. For providers using the OpenAI-compatible endpoint, no translation is needed. For Google's native backend, I maintain a translator function that converts OpenAI tool schemas to Google's FunctionDeclaration format.

Creating a provider-neutral tool schema format would require every caller to use the neutral format and every backend to translate from it. Adding one translation layer per provider does not reduce complexity; it moves complexity from callers to backends while adding a new translation format that nobody outside this codebase understands.

Context Window Management

Context window management lives at the abstraction layer because it is provider-agnostic behavior. Every conversation accumulates tokens. Every provider has a limit. Exceeding that limit causes hard failures.

The approach: maintain a running token count in the conversation state. When the count approaches the model's context limit (configurable threshold, default 80%), trigger a summarization pass before the next message. The summarizer calls the LLM itself to produce a compressed summary of conversation history, then replaces the history with the summary as a single assistant message.

type ConversationState struct {
    Messages        []ChatMessage
    TotalTokens     int
    EstimatedCost   float64
    SummaryCount    int
}

Token counting at the abstraction layer means using approximate counting (word-based heuristics) rather than exact tokenizer counts. This is acceptable for threshold triggering because being 10% off on token counts does not matter when your trigger threshold has a 20% safety margin. For cost estimation, approximate counting is also fine since these are estimates anyway.

Leaky Abstraction: Where I Accepted It

The abstraction is not watertight, and in a few specific cases I chose not to make it so.

Model names are provider-specific strings. The abstraction does not try to map "gpt-4o" to "claude-3-5-sonnet-20241022" to a provider-neutral "best-available" model. Callers specify model names directly, and those names are provider-specific. Adding a model mapping layer would be fragile because providers rename and deprecate models frequently, and callers usually have a reason for choosing a specific model.

Tool schemas stay in OpenAI format rather than a neutral format, as described above. If you use a provider whose native tool format differs significantly from OpenAI's and the translation layer has edge cases, you might encounter issues with complex nested schemas. I documented this as a known limitation rather than trying to solve every edge case.

Provider-specific errors surface up through the abstraction as generic errors. If Anthropic returns a specific error code for content policy violations and your caller needs to handle that case differently than a rate limit error, the current abstraction loses that signal. For my use case, all errors are handled the same way (log, retry, or surface to user), so this is acceptable.

The principle: accept leaky abstraction where the cost of plugging the leak exceeds the cost of the leak itself. Abstracting away model names would cost more in maintenance than it would save in portability. Keeping OpenAI tool format saves translation complexity at the cost of potential edge cases with non-OpenAI providers.

What This Buys You

With this abstraction in place, switching providers in the application requires changing one configuration value: the API endpoint URL and key. The rest of the application is unaffected. Benchmarking different providers for a specific task means changing configuration per test run without touching application logic. New providers that support the OpenAI-compatible API are zero-code additions.

The total backend implementation code is around 1,200 lines across three files. The application code that uses the interface does not know or care which backend is running beneath it. That separation has been worth the initial investment.

Part of an ongoing series on building AI systems from scratch in Go.