What does this writing signal mean?

DigitalOcean (GradientAI) Writing: Prompt Caching for Anthropic and OpenAI Models: Building Cost-Efficient AI Systems

Captured source

digitalocean.com/digitalocean.com/blog/prompt-caching-with-digital-ocean

Prompt Caching for Anthropic and OpenAI Models: Building Cost-Efficient AI Systems

published Mar 17, 2026seen 5dcaptured 3dhttp 200method plain

Prompt Caching for Anthropic and OpenAI Models: Building Cost-Efficient AI Systems | DigitalOcean

Dark mode is coming soon. Engineering Prompt Caching for Anthropic and OpenAI Models: Building Cost-Efficient AI Systems

By Satyam Namdeo

Updated: March 17, 2026 9 min read

<- Back to blog home

Large Language Models (LLMs) have become a foundational component for modern AI applications, from developer copilots and documentation assistants to advanced troubleshooting tools. As these applications scale, one challenge quickly becomes apparent: token costs can grow rapidly when large prompts are repeatedly sent to the model .

A common architecture for production AI systems includes long system instructions, tool schemas, retrieved knowledge base documents, and conversation history. These components can easily add thousands of tokens per request. When applications handle thousands or millions of requests per day, repeatedly processing the same static prompt content becomes expensive.

To address this problem, prompt caching has emerged as an essential optimization technique supported by major model providers such as Anthropic and OpenAI .

Prompt caching allows repeated prompt segments to be reused across requests, significantly reducing both latency and cost. In this article, we will explore:

What prompt caching is and how it works

How Anthropic and OpenAI implement caching

The billing implications and cost advantages

Real-world use cases

A realistic production architecture that can reduce token costs by 70–90%

We will also show how prompt caching can be implemented when using models via DigitalOcean .

What is Prompt Caching?

Prompt caching is a mechanism where large portions of a prompt that remain identical across requests are stored and reused , instead of being reprocessed every time.

Since information like System instructions, tool schemas, guardrails, documentations ,etc rarely changes, repeatedly sending it wastes computation and increases token usage costs. Prompt caching solves this by:

Storing previously processed prompt segments.

Reusing those segments when identical requests appear again.

Charging a much lower price for cached tokens.

This optimization is especially powerful in production systems where large static prompts are combined with small dynamic queries.

How Prompt Caching Works

At a high level, prompt caching works by identifying prefix tokens that remain identical across multiple requests.

If a request begins with the same sequence of tokens as a previous request, the model provider can reuse the previously processed representation rather than recomputing it.

The workflow looks like this:

In Initial request, full prompt is processed and static segments are stored in the cache

Whereas in Subsequent request,

The model detects identical prefix tokens

Cached tokens are reused

Only the new tokens are processed

This approach reduces compute work significantly because LLM inference is most expensive when processing large prompts .

Advantages of Prompt Caching

Prompt caching provides several important benefits for production AI systems.

1. Major Cost Reduction

Prompt caching can significantly reduce the cost of running LLM applications because tokens reused from earlier requests are billed at a much lower rate than newly processed tokens. For example, in GPT-5, standard input tokens cost about $1.25 per million tokens, while cached input tokens cost only $0.125 per million tokens, making cached tokens around 10× cheaper .

2. Reduced Latency

Since cached prompt segments do not need to be recomputed, the model can process requests faster. This improves user experience in interactive applications such as Chat Assistants, Coding Copilots and Documentation Search tools

3. Improved Scalability

Applications handling large traffic volumes benefit significantly because caching prevents redundant computation across thousands of requests.

This makes AI systems more economically viable at scale.

Common Use Cases Where Prompt Caching Helps

Prompt caching is most effective when large prompt segments remain identical across requests . Most AI apps that commonly use this include ChatGPT, Cursor, Perplexity AI, Notion AI

Retrieval-Augmented Generation (RAG)

RAG systems retrieve documents and inject them into prompts. If the retrieved documents are reused frequently, caching can significantly reduce token costs.

Typical examples include Knowledge Base Assistants, Documentation search, Internal support chatbots ,etc

AI Troubleshooting Systems

Enterprise support assistants often include system instructions, operational playbooks, and technical documentation.

These prompts can exceed several thousand tokens and are ideal for caching.

A Realistic Production Prompt Caching Architecture

A common architecture used in production AI systems organizes prompts into static and dynamic sections.

The key idea is simple: Place all large, static prompt components at the beginning of the prompt. This creates a large prefix that can be cached.

Cached Prefix

The following prompt components typically remain identical across requests:

System prompt (large instructions)

Tool schemas

RAG documents

Dynamic Portion

The following components change per request:

user query

conversation history

tool outputs

Production Prompt Structure

Example Production AI System

Consider a Kubernetes troubleshooting assistant . Example request structure:

{ "model" : "gpt-5" , "input" : [ { "role" : "system" , "content" : "You are a senior Kubernetes networking engineer..." } , { "role" : "system" , "content" : "TOOLS AVAILABLE:\n1. search_k8s_docs(query)..." } , { "role" : "system" , "content" : "DOCUMENT: CoreDNS runs as a deployment in Kubernetes..." } , { "role" : "user" , "content" : "How does CoreDNS know which pod IPs belong to a service?" } ] , "max_output_tokens" : 200 }

Component Tokens Cacheble

System instructions 1,500 Yes

Tool schema definitions 1,000 Yes

RAG documentation 3,500 Yes

Conversation history 300 No

User question 50 No

Total input tokens: 6,350 (6000 cacheble) Model output tokens: 200

Cost Comparison

Scenario 1 — Without Prompt Caching

Every request processes the full prompt.

Input cost: 6,350 × $1.25 / 1,000,000 = $0.00794

Output cost: 200 × $10 / 1,000,000 = $0.002

Total cost per request: $0.00994

Scenario 2 — With Prompt Caching

Cached tokens: 6,000 Non-cached tokens: 350

With caching…

Excerpt shown — open the source for the full document.

Notability

notability 4.0/10

Routine technical blog post, not a major release or breakthrough