Prompt Caching: Cut LLM Costs Up to 90% (Dev Guide)

By

Komy A.

April 23, 2026

10 min read

Prompt Caching: The Developer's Guide to Cutting LLM Costs by Up to 90%

What Prompt Caching Actually Is

Every time you send a request to an LLM, the model has to process every token in your prompt from scratch. For most applications, a huge chunk of that prompt is identical across requests: the system message, tool definitions, a long document you're analyzing, retrieved context from a RAG pipeline. You're paying full price to recompute something the model just finished computing two seconds ago.

Prompt caching solves this by reusing the intermediate computations from a previous request rather than redoing the work. Under the hood, what's being cached is the KV (key-value) cache from the attention mechanism. When a transformer processes tokens, it generates a key and value matrix for each layer, and those matrices are exactly what gets reused when the prompt prefix matches a cached entry. The model picks up mid-computation instead of starting over.

The savings are real. Cached input tokens on both OpenAI and Anthropic are priced at roughly 10x less than regular input tokens. On latency, AWS reports up to 85% reduction in response time when the cache hits. For an AI agent making dozens of tool calls per session, each one re-sending a 4,000-token system prompt, the math gets compelling fast.

How Each Provider Implements It

Anthropic (Claude)

Anthropic's implementation uses explicit cache control markers. You add "cache_control": {"type": "ephemeral"} to the content blocks you want cached. The minimum cacheable block is 1,024 tokens, and you can define up to four cache breakpoints per request. Anthropic charges a 25% premium on the first cache write (the compute to store it), then discounts cached reads at 90% off regular input pricing.

Anthropic also introduced automatic prompt caching in early 2026, where the API automatically identifies and caches long, stable prefixes without requiring explicit markers. This helps significantly for teams already in production who haven't restructured their prompts for caching. Official documentation: Anthropic Prompt Caching Docs.

OpenAI

OpenAI's approach is automatic and transparent. No special markers or API changes needed: once a prompt prefix exceeds 1,024 tokens, OpenAI's infrastructure automatically checks for a cached prefix and applies a 50% discount on those tokens if found. You can see the savings reflected in the usage object in the API response under cached_tokens.

The tradeoff is less control. You cannot force a cache write, cannot inspect what's cached, and the cache is per-model, per-organization. Cache entries are evicted after periods of inactivity, with typical lifetimes of five to ten minutes. For long-running agent sessions, a cache that was warm at the start of a session may be cold by turn 20. Details: OpenAI Prompt Caching Guide.

Google Gemini

Gemini's implementation, called Context Caching, works differently. You explicitly create a cached content object via the API with a specified time-to-live, then reference that cache in subsequent requests. The minimum size is 32,768 tokens, which makes it better suited for large stable documents than typical system prompts. You pay a per-hour storage cost for the cached content, plus a reduced per-token rate when it's used.

This model works well for analyzing a large codebase across many requests, processing a long legal document in multiple passes, or running repeated queries against a fixed dataset. Less useful for short system prompts or conversational agents. Google's documentation: Gemini Context Caching.

Why Your Cache Hit Rate Is Probably Zero

The most common failure mode is prompt structure. For caching to work, the prefix of your prompt must be identical to a previously cached prefix. That sounds obvious, but a lot of real application code defeats this without anyone realizing it.

Consider a system prompt that includes the current date and time. You're probably doing this so the model knows when today is. Every request now has a unique prefix, and your cache hit rate is exactly 0%. Same problem if you inject the user's name, a session ID, or any dynamic value into the system prompt or early in the context. Anything that varies between requests will break cache reuse for everything that follows it in the prompt.

The fix is to structure your prompts so all static content comes first, and all dynamic content comes last. Static system instructions, tool definitions, and large reference documents should occupy the beginning of the context. User-specific details, conversation history, and the actual user message go at the end. This is the opposite of how many prompts are intuitively written, and changing it is often the single highest-leverage optimization available.

A second common issue appears in agents. Many agent frameworks naively append every tool call result to the conversation, letting context grow unbounded. After several turns, the prefix is no longer consistent across requests because the accumulated history keeps changing. Strategies like summarizing older turns or using a fixed-size sliding window over history preserve the cacheable prefix.

Practical Patterns That Work

RAG with a Stable Document Corpus

If you retrieve the same documents repeatedly across requests (internal documentation, product specs, a reference knowledge base), those documents are excellent cache candidates. Place them in a fixed position in the prompt, before any user-specific content. For Anthropic, wrap the document block with a cache control marker. For OpenAI, ensure the documents always appear at the same position so the prefix stays consistent.

Retrieval order matters here. If your RAG pipeline retrieves documents in different orders depending on the query, the prefix changes and you lose the cache. Consider sorting retrieved documents by a stable key (document ID, for example) before assembling the prompt. This one change can dramatically improve cache hit rates in production RAG systems.

Tool Definitions in Agentic Systems

Tool definitions are high-value caching targets. A moderately complex AI agent might define 15 to 20 tools, which can easily consume 2,000 to 4,000 tokens. Since tool definitions almost never change between requests in a session, they should be positioned early in the prompt and marked for caching.

In multi-agent setups where agents share a common tool registry, caching the tool definitions once and reusing across all agent calls compounds the savings. In a system making 50 tool calls per user session with a 3,000-token tool definition block, caching alone could reduce input costs by 30% or more on that component.

This is one of the first optimizations applied when deploying agents in production at Genta, especially in high-frequency agentic loops where LLM calls repeat with largely identical context on every turn.

Long System Prompts and Personas

Customer-facing AI applications often have extensive system prompts: brand voice guidelines, compliance instructions, domain-specific rules, worked examples. These can run to several thousand tokens. If the system prompt is the same across all users, it's a perfect cache candidate.

The failure mode here is personalization. If you add user-specific context to the system prompt, you've broken the cache for every user. Move personalization to a user message or a late-stage turn in the conversation instead.

Cache Lifetimes and What Expires When

The documentation doesn't stress this enough: caches expire. Anthropic's ephemeral caches last five minutes after the last use, with a maximum of one hour. OpenAI's automatic caches have similar short lifetimes. Google's context cache is more explicit because you set the TTL yourself and pay for storage time.

For latency-sensitive applications, the first request in a cold session always pays full price. Some teams run lightweight warming requests at session start if they know a user is about to begin an expensive multi-turn interaction. Others run periodic background requests against their most common prompt configurations to keep caches warm between real user sessions.

Cache invalidation also happens automatically when the model version changes. Anthropic and OpenAI both note that prompt caches are model-specific. Any model version upgrade clears the cache, so plan for a temporary cost spike whenever you upgrade in production.

How to Tell If It's Actually Working

OpenAI surfaces cache hits directly in the API response. In the usage object, look for prompt_tokens_details.cached_tokens. A value close to your full prompt token count means the cache is working. A value of zero on repeated, identical requests means it is not, and the prompt structure needs investigation.

Anthropic's response includes a cache_read_input_tokens field. If you're sending requests that should be hitting the cache and this field is consistently zero, check: is your prompt prefix changing between requests? Are you falling below the 1,024 minimum token threshold? Is there a timing gap between requests that causes cache expiry?

Track your average cached token ratio over time as a production metric. Even a 40% cache hit rate on input tokens significantly changes the cost profile of a high-volume LLM application. Teams that haven't measured this often discover that structural changes to their prompts would have saved meaningful money from the start.

What Prompt Caching Cannot Do

Prompt caching only covers input tokens. Output tokens are always billed at full price, so applications that generate long outputs won't see the same cost leverage. The economics are strongest when input tokens dominate output tokens, which is common in analysis, classification, and structured extraction tasks.

Caching also doesn't help with the first request in any session, or after cache expiry. For very low-traffic applications making only a few requests per hour, the cache may never be warm when you need it. The strategy matters most for applications with consistent, repetitive access patterns.

Finally, caching is not semantic. It's byte-for-byte prefix matching. A single character difference in the cached prefix results in a cache miss. You cannot cache similar prompts, only identical prefixes. Templating systems and prompt construction utilities need to produce deterministic output for caching to be reliable in production. Whitespace, token ordering, field ordering in JSON tool schemas: all of it counts.

For teams building production AI systems, combining prompt caching with other cost control strategies (model routing, structured output parsing, batching) is where the real leverage is. These optimizations compound. A system with good cache utilization, a sensible model routing policy, and efficient context management can cost 60 to 70% less than a naive implementation using the same underlying models. That's before any change to what the system actually does. If you want to see how this fits into a broader production architecture, the post on context engineering for AI agents covers the surrounding patterns in depth.

View all

Tell us where the manual work hurts

We’ll tell you straight whether AI can fix it, what it costs, and what it should return. Whatever we build, you own.

Let's Connect

Tell us where the manual work hurts

We’ll tell you straight whether AI can fix it, what it costs, and what it should return. Whatever we build, you own.

Let's Connect

Tell us where the manual work hurts

We’ll tell you straight whether AI can fix it, what it costs, and what it should return. Whatever we build, you own.

Let's Connect

By

Komy A.

April 23, 2026

10 min read

Prompt Caching: The Developer's Guide to Cutting LLM Costs by Up to 90%

What Prompt Caching Actually Is

Every time you send a request to an LLM, the model has to process every token in your prompt from scratch. For most applications, a huge chunk of that prompt is identical across requests: the system message, tool definitions, a long document you're analyzing, retrieved context from a RAG pipeline. You're paying full price to recompute something the model just finished computing two seconds ago.

Prompt caching solves this by reusing the intermediate computations from a previous request rather than redoing the work. Under the hood, what's being cached is the KV (key-value) cache from the attention mechanism. When a transformer processes tokens, it generates a key and value matrix for each layer, and those matrices are exactly what gets reused when the prompt prefix matches a cached entry. The model picks up mid-computation instead of starting over.

The savings are real. Cached input tokens on both OpenAI and Anthropic are priced at roughly 10x less than regular input tokens. On latency, AWS reports up to 85% reduction in response time when the cache hits. For an AI agent making dozens of tool calls per session, each one re-sending a 4,000-token system prompt, the math gets compelling fast.