Usage and Cost Optimisation for Foundation Models

PublishedMay 21, 2026ByAlec Vishmidt

(Intro)

A team that treats AI as just another API will either bleed money at scale or restrict usage so heavily that the product loses its value proposition. The way out is intuition for how architectural choices, prompt design, and product decisions cascade into cost — a nuanced understanding of how the formula actually works.

Foundation Model Cost Optimisation ⊹ Blog ⊹ BN Digital — Fig. 0

A note on freshness — as of May 2026. Model names, list prices, and provider feature availability move on a quarterly cadence in this market. The figures and product references in this article reflect what was current at publication. The calculations, ratios, and architectural patterns are durable, but verify list prices against provider pricing pages before sizing your own architecture.

Understanding the Cost Structure

Unlike traditional software, where you might pay a flat subscription fee, AI model pricing operates on consumption — specifically, token consumption. Every word you send to a model, and every word it generates in return, costs money.

Foundation models typically charge based on tokens processed: both input tokens (what you send) and output tokens (what the model generates). Larger, more capable models such as Opus cost considerably more per token than smaller models like Haiku or Sonnet.

Token-Based Pricing Fundamentals

The heart of AI model pricing sits in a simple unit: the token. Tokens are the atomic units of text that models process.

They are not quite words. Typically, about four characters in English, a token might be a word, part of a word, or even a punctuation mark. In English, one token roughly equals four characters or three-quarters of a word — meaning 100 words translate to about 133 tokens.

Why tokens instead of words or characters? Models do not understand words the way humans do. They break text into smaller chunks during processing, and the computational cost scales with token count, not word count.

Token counting varies slightly between providers, but the differences are mostly marginal. A 1,000-word article will generally consume 1,300–1,400 tokens regardless of which model processes it. Tokenisation can also vary by language. Different character sets or more complex scripts may use tokens less efficiently.

Input vs Output Token Costs

Pricing gets interesting here: not all tokens cost the same. Models charge different rates for input and output, and output tokens are more expensive — often three to five times the input rate. For Claude Sonnet 4.6, you might pay around $3 per million input tokens and $15 per million output tokens.

This disparity has a simple explanation: generating text requires significantly more computation than reading it. When a model processes your input, it is essentially reading and understanding. When it generates output, it predicts the next token thousands of times in sequence, and each prediction requires the full model to run.

Model Tier Pricing

AI providers offer multiple models at different price points, typically organised into tiers based on capability. The same applies to the Claude 4 family, with models priced very differently:

Haiku 4.5 is the fastest, most economical option — roughly 15x cheaper than Opus on input. Best suited for high-volume, straightforward tasks like classification, simple extraction, or routing. Around $1 per million input tokens and $5 per million output tokens.
Sonnet 4.6 is the balanced middle tier, offering strong performance at moderate cost. About 5x cheaper than Opus on input ($3 per million input / $15 per million output), it handles most real-world tasks effectively.
Opus 4.6 is the most capable and most expensive option, used for tasks that require maximum reasoning ability, nuance, or creativity. You pay a premium for that performance: $15 per million input tokens and $75 per million output tokens.

Exact pricing varies across providers, but this tiered structure persists. OpenAI offers the GPT-5 family — GPT-5 (flagship), GPT-5 mini (efficient), and GPT-5 nano (lightweight) — alongside legacy GPT-4o and o-series models. Google's Gemini is now on the 2.5 generation with Pro, Flash, and Flash-Lite variants. More capability always equals higher cost.

Hidden Cost Factors

Beyond the headline token prices, several less obvious factors drive actual spending.

Context Window Usage

Every API call processes your entire prompt. If you send 50,000 tokens of context with each request (perhaps a large document plus conversation history), you pay for those 50,000 input tokens every single time. Over thousands of requests, this compounds quickly. This is why context management is so critical.

Long context also comes with premium pricing on some models. Anthropic's Claude Sonnet 4.6 supports up to 1 million tokens of context, but requests exceeding 200,000 tokens are charged at premium rates (roughly 2x input, 1.5x output above that threshold). Context window management becomes a cost optimisation strategy in its own right.

Prompt Caching Economics

When you send the same content repeatedly — a system prompt, documentation, or a large dataset — the provider can cache it. Subsequent requests using that cached content pay considerably less.

Claude offers prompt caching where cached tokens cost about 90% less than regular input tokens on a cache read. A few nuances:

Cache writes cost more than regular input tokens — 1.25x for the standard 5-minute cache, 2x for the extended 1-hour cache.
Cache reads (subsequent uses) get the 90% discount in both tiers.
The 5-minute cache expires after five minutes of inactivity; the 1-hour cache holds for an hour regardless of traffic.
Minimum token thresholds apply to enable caching (1,024 tokens for Sonnet and Opus, lower for Haiku).

OpenAI takes a different approach. It automatically caches prompt prefixes longer than 1,024 tokens, with no manual configuration required. The cache lasts about 5–10 minutes, and cache hits provide roughly 50–75% cost savings depending on model. Google's Gemini offers "context caching" with similar mechanics but requires explicit setup.

Failed Requests and Errors

Most providers do not charge for failed requests at the API level: if your request returns a 500 error or times out, you do not pay. But requests that succeed from the API's perspective do count toward your bill, even if the response is not what you wanted.

If your application makes a request, the model generates 1,000 tokens, your code then rejects the response format and retries, you have paid for those 1,000 output tokens with nothing to show. This includes requests that hit content filters, produce unusable output due to poor prompts, or exceed token limits mid-generation.

Some providers also charge for certain tool calls and features — OpenAI's web search and code interpreter tools, Anthropic's extended thinking and computer use features, or AWS Bedrock's Flow for workflow orchestration. These extra charges can change your cost equation, especially for agentic applications that make many tool calls or require extensive reasoning chains.

Real-World Cost Modelling

Take a document Q&A system where users upload documents and ask questions about them. This is a common pattern — internal knowledge bases, legal document review, research assistants.

Naive Approach

A standard algorithm for how AI-powered logic works on task completion:

Send the entire 10,000-word document (≈13,000 tokens) with each question.
User asks a 5-word question (≈7 tokens).
Model generates a 200-word answer (≈267 tokens).
Using Sonnet at $3 per million input tokens and $15 per million output tokens.
Cost per query: (13,007 × $3 + 267 × $15) / 1,000,000 ≈ $0.043.

That seems reasonable until you scale it. At 100,000 queries per month, you spend $4,300. And this example uses a medium-sized document. Many applications deal with much larger contexts.

Optimised Approach

One alternative is Retrieval-Augmented Generation (RAG), which analyses only relevant sections. You break the document into chunks, convert chunks to embeddings, store them in a vector database, and retrieve only the most relevant chunks when a query comes in.

Use embeddings to retrieve only the relevant 500-word chunk (≈667 tokens).
Same question and answer length.
Cost per query: (674 × $3 + 267 × $15) / 1,000,000 ≈ $0.006.

That is an 85% reduction in per-query costs. Over 100,000 queries, the savings reach $3,700.

Volume Discounts and Enterprise Pricing

Most providers offer volume-based discounts beyond certain thresholds. The numbers are not always published openly, but industry discussions and leaked contracts reveal patterns.

Once you process billions of tokens monthly, negotiated rates can differ significantly from published list prices. Discounts can reach 40–60% off list pricing. For startups, these are not immediately accessible, but planning for scale includes knowing when you might qualify.

Cost per Use Case Economics

Different applications have radically different cost profiles, even when using the same models. A few scenarios illustrate the point.

Chatbots accumulate context over long conversations. Each subsequent turn becomes more expensive unless you implement summarisation or context pruning.
Batch document processing benefits enormously from caching. The key is to apply the same analysis instructions to many documents.
Code generation often produces lengthy outputs — many lines of code. Output token costs dominate.
Classification tasks need minimal output, just a category label. They are cheap per call even with moderate input.

Understanding your usage pattern helps determine the right architecture, predict costs, and identify optimisation points early.

The ROI Perspective

Sometimes higher costs are justified. Focusing purely on cost per token misses the bigger picture. The relevant question is not "how much does this cost?" but "what value does this enable?"

Suppose using Opus instead of Sonnet raises accuracy from 85% to 95% on a critical task. Meanwhile, errors cost customer trust or require expensive human review. If the premium solves the problem, it is worth considering. Cost optimisation is not about minimising spend but about maximising value per dollar spent.

When evaluating AI costs, answer the following:

What does this enable that was previously impossible or impractical?
What is the unit economics — cost per interaction relative to value per interaction?
At what scale does this become profitable or break even?
What is the trajectory — are costs falling as providers compete and optimise?

Core Optimisation Strategies

The difference between expensive and sustainable AI applications rarely comes down to picking the right provider. What makes the real difference is how you use the chosen model.

Model Selection and Routing

The most impactful optimisation is using the smallest model that can accomplish each task effectively. With Claude, several tiers can be applied to different tasks:

Route simple classification or extraction tasks to Haiku.
Route moderate complexity tasks to Sonnet.
Route only the most demanding reasoning or creative tasks to Opus.

This cascading approach can reduce costs by 5–10x while applying the most expensive solution only where it makes a real difference.

The Model Capability-Cost Tradeoff

Different models are tools with varying levels of precision.

Opus is the master craftsman — expensive, but it handles complex, nuanced work beautifully. Haiku is a specialised power tool: fast, economical, excellent at specific tasks, but limited in reasoning depth. Sonnet sits in the middle, a skilled generalist handling most tasks well at a reasonable cost.

Most applications do not need maximum capability for every task.

Task Categorisation Framework

The first step to efficiency is honest task classification. Not everything is complex just because it involves AI.

High-complexity tasks: Opus territory

Opus excels when tasks require deep reasoning, nuanced judgement, or creative synthesis. A few examples:

Writing original long-form content with specific stylistic requirements.
Complex multi-step reasoning problems.
Sophisticated code architecture decisions.
Sensitive content requiring careful ethical judgement.
Ambiguous situations requiring contextual understanding and inference.

These tasks justify premium pricing because failure costs more than the marginal model expense.

Medium-complexity tasks: Sonnet territory

Sonnet handles the majority of production work. It tackles tasks that require solid reasoning and language understanding well, while following more predictable patterns — all without Opus pricing. Sonnet fits:

Generating structured reports from data.
Answering customer questions requiring some inference.
Code generation for standard implementations.
Content editing and improvement.
Moderate-complexity analysis and summarisation.

Most teams find 70–80% of their queries land here. Sonnet is the sweet spot between capability and cost for everyday use.

Low-complexity tasks: Haiku territory

Haiku works for high-volume, straightforward operations — well-defined, pattern-based work:

Classification into predefined categories.
Simple entity extraction.
Sentiment analysis.
Routing and triage decisions.
Format conversion and validation.
Straightforward template filling.

These tasks might represent 60–70% of total request volume in mature applications. Getting them off expensive models compounds savings quickly.

The same logic works for other LLMs. OpenAI, for instance, offers more diversity with dozens of models and a detailed breakdown of which models fit which tasks. Ongoing research aims to optimise it further by proposing new strategies to improve performance without increasing spending.

Routing Architecture Patterns

Routing architecture patterns are strategies for directing AI queries to different models based on the complexity or characteristics of each request.

The core idea: multiple AI models are available at different price points (Haiku 4.5 at $1/million input tokens, Sonnet 4.6 at $3/million, Opus 4.6 at $15/million). Instead of sending every request to the most expensive model, you route queries intelligently based on what they actually need.

There are several ways to arrange this architecturally.

Sequential Cascade (Waterfall)

Start with the cheapest model. Escalate only when needed.

The Sequential Cascade works well when simpler models can handle most cases. A customer support system might find that Haiku successfully handles 70% of inquiries, Sonnet handles 25%, and only 5% require Opus. The blended cost per query is far lower than using Opus for everything.

The challenge is detecting when escalation is needed. How do you know if Haiku's answer is good enough? Several options:

Train a small classifier to evaluate response quality.
Use the model's own probability scores (when available).
Compare responses from multiple models.
Check for specific quality markers (completeness, factual consistency).

Parallel Ensemble

For critical decisions, run multiple models simultaneously and compare outputs.

If Haiku and Sonnet agree on a classification, trust it.
If they disagree, use Opus as a tiebreaker or authoritative judge.

This adds redundancy, but it catches errors that single-model approaches miss. Parallel Ensemble costs more upfront, but it improves accuracy for critical tasks. Worthwhile for high-stakes decisions where mistakes are costly.

Pre-Routing/Classification

Use a fast, cheap classifier upfront to predict query complexity before routing requests to appropriate models. This could be a simple keyword-based system, a specialised small model, or even Haiku itself acting as a router.

Customer inquiries might be classified as "simple FAQ" (Haiku), "product issue" (Sonnet), or "complex complaint" (Opus) before hitting the main processing pipeline. Frontloading the routing decision makes the rest of the pipeline efficient.

Hybrid Decomposition

Break complex tasks into subtasks and assign each to the most appropriate model. A document analysis workflow might use:

Haiku to extract structured metadata.
Sonnet to summarise each section.
Opus to synthesise insights across sections.

Each component uses the minimum necessary model tier. Powerful, but it requires careful orchestration. Clear task boundaries and methods to combine outputs coherently are essential.

Implementing Smart Routing

In theory, routing is straightforward: send simple queries to cheap models and complex ones to expensive models. The challenge comes at execution. How do you determine which queries are simple versus complex in real time? How do you build confidence that Haiku's answer is good enough, or recognise when escalation is necessary?

Smart routing requires mechanisms to evaluate response quality, monitor performance across model tiers, and adapt routing logic based on what you learn. Three options.

Confidence-Based Escalation

When using cascade routing, clear escalation triggers are needed. Several approaches work:

Have the model output a confidence score alongside its answer. If confidence falls below a threshold, escalate.
For structured outputs, validate the response programmatically. If it does not match the expected format or constraints, escalate.
Include explicit uncertainty markers in prompts: "If you are uncertain, respond with 'UNCERTAIN: [reason]'."
Use response length as a proxy. Unusually short responses might indicate the model could not fully address the query.

Performance Monitoring

Track accuracy by model and task type. If Haiku performs at 95% accuracy for a particular classification task, it is probably fine to keep it there. If it drops to 75%, consider upgrading that task to Sonnet.

Other metrics to track include the escalation rate (what percentage requires tier-up), cost per query by final model used, and latency across routing paths. Continuously evaluate whether routing decisions remain optimal as the application evolves.

Dynamic Routing Logic

Routing does not have to be static — it can adapt based on context. Factors that might influence routing decisions:

User tier. Premium users might get Opus by default, while free users get Sonnet.
Task urgency. Time-sensitive requests use faster (and often cheaper) models.
Historical context. If a user's previous questions were complex, subsequent ones might warrant a more capable model.
Resource availability. During peak load, route more aggressively to smaller models to manage capacity.

High-value transactions (enterprise sales, legal review) justify premium models immediately.

Cost Analysis of Routing

The impact is easiest to see with realistic numbers. Processing 1 million requests monthly (≈600 input + 300 output tokens per request) using Opus 4.6 for everything:

1M requests × $0.03/request = $30,000

Routing changes the math. Assume 70% of requests work with Haiku, 25% need Sonnet, 5% require Opus:

700K Haiku 4.5 requests × $0.002 = $1,400.
250K Sonnet 4.6 requests × $0.006 = $1,500.
50K Opus 4.6 requests × $0.03 = $1,500.
Total: $4,400.

An 85% cost reduction with the same quality for most tasks. The FrugalGPT research showed even more impressive opportunities: up to 98% cost reduction while matching premium model quality.

Common Pitfalls

Most teams go through multiple stages of trial and error. The shortcut is avoiding the following from the start:

Over-optimisation. Do not route so aggressively that quality suffers noticeably. User experience matters more than marginal savings.
Routing overhead. Complex routing logic adds latency and maintenance burden. Keep it as simple as possible while achieving the goal.
False economy. Cheaper models may require more API calls (retries, error corrections) than one call to a capable model. Always measure total cost.
Ignoring edge cases. Rare but important cases might need different handling. A 0.1% error rate might be acceptable for casual uses but unacceptable for legal or medical applications.

Prompt Engineering for Efficiency

Every word in a prompt costs money. The system understands "Please kindly consider the following context and provide a comprehensive and detailed analysis" (14 tokens) and "Analyse" (2 tokens) the same way. Use that to your advantage:

Prioritise clear, direct language.
Remove verbose instructions and redundant context.
Minimise the number of examples.
Test whether fewer examples achieve similar results.

Context Management

Context management directly impacts costs. It is coded into the logic of LLMs: every token in your prompt is processed and charged for each API call. A 10,000-token prompt sent 1,000 times costs the same as a 100-token prompt sent 100,000 times. The task is to maximise relevance while minimising volume.

The most obvious tactic is including only the critical context for each request:

For document processing, use retrieval systems (RAG) to send only pertinent chunks rather than entire documents.
Implement sliding windows for long conversations rather than sending the complete history.
Summarise older conversation turns to compress context.

These approaches and others below deserve a closer look.

Sliding Window Approach

For conversational applications, do not send the entire chat history indefinitely. A sliding window keeps recent messages (perhaps the last 5–10 exchanges) in full detail while summarising or dropping older content.

A 50-message conversation with full fidelity might consume 25,000 tokens per request. With a sliding window keeping 10 recent messages and a summary of earlier content, the figure drops to around 3,000 tokens while preserving continuity — an 88% reduction in input tokens for each subsequent message.

The trade-off is potential loss of context from early conversation. Test to find the optimal window size for your use case.

Retrieval-Augmented Generation (RAG)

RAG separates "what the model knows" from "what context it needs right now." A knowledge base might contain millions of tokens, but any single query likely needs only hundreds or thousands. Instead of sending entire knowledge bases with each query, retrieve only relevant chunks.

The implementation involves creating vector embeddings of the content, storing them in a vector database, embedding the user's query, and retrieving the most similar chunks. A 100-page technical manual contains ≈130,000 tokens. Retrieving the three most relevant sections (≈1,500 tokens) provides focused context at a fraction of the cost. Quality does not necessarily suffer, as it depends on retrieval accuracy.

Selective Context Loading

Not every request needs the same context. A customer asking about order status needs their order history, not their browsing behaviour. A product recommendation engine needs purchase history and preferences, not the content of support tickets.

Create context profiles for different request types and construct the context dynamically each time. Load only relevant data and skip the details that do not matter for this specific query. That prevents the "include everything just in case" approach that bloats every request.

Summarisation Techniques

For long conversations where context must be fully considered, summarise older content periodically. This is a middle ground between sending everything and retrieving chunks. After every ten exchanges, compress the oldest five into a summary. That maintains narrative continuity while controlling growth.

An alternative is to generate a summary first. It works for long documents that require careful consideration. A 50,000-token research paper might compress to a 2,000-token summary adequate for most queries. Retrieve specific sections of the full text only when users ask for details. Otherwise, cache the summary and reuse it.

Session State Management

Applications often embed user data in prompts, though it is not the most efficient approach. Instead of embedding full user profiles, preference objects, or configuration data in every request, externalise them. Keep the details in a database and include only relevant fields as concise key-value pairs.

Instructions like this:

"User preferences: {vegetarian: true, allergies: ['peanuts', 'shellfish'], budget: 'moderate', cuisine_likes: ['Italian', 'Thai', 'Mexican']...}."

Or simpler:

"Vegetarian, avoids peanuts/shellfish, moderate budget, likes Italian/Thai/Mexican."

The information is the same, communicated with fewer tokens. The same applies to all structured data.

Measuring Context Efficiency

How efficiency initiatives perform in practice is what matters. The answer is in the metrics.

Track the context relevance ratio: what percentage of prompt tokens directly contribute to the response? Sending 5,000 tokens with only 500 actually relevant to answering the query is an efficiency problem.

Several other parameters help identify optimisation opportunities:

Average input tokens per request.
Context token ratio (context tokens ÷ total input tokens).
Cache hit rate for stable context.

High input token counts and low cache hits suggest poor context management. High context ratios indicate RAG or summarisation might help. Without specific numbers, the work becomes guessing — and "optimising" somewhere wrong.

Caching Strategies

Prompt caching lets you mark portions of a prompt as "cacheable," storing them server-side so subsequent requests for the same content pay only a fraction of the cost for those tokens. The discount can reach about 90% on cache hits.

How Prompt Caching Works

In an API request, certain prompt sections can be designated for caching using cache control breakpoints. The first time these sections are sent (a "cache write"), full price applies. Subsequent requests within the cache lifetime (typically 5 minutes of inactivity) that include identical cached content pay the reduced rate for a "cache read."

Anthropic's implementation is instructive. Cache writes cost 1.25x standard input rate for the 5-minute cache and 2x for the 1-hour cache. Cache reads cost only 0.1x — a 90% discount in both tiers. The breakeven is immediate: two requests with the same content already save money on the 5-minute tier, three on the 1-hour tier.

OpenAI offers automatic caching. Prompts over 1,024 tokens are automatically cached for 5–10 minutes. Prices differ across models. The promise is to reduce latency by up to 80% and input token costs by 50–75% depending on model.

Mechanics vary by provider, but the pattern holds. The cache persists as long as you make requests using it within the timeout window. Each request refreshes the timer, so high-frequency applications can maintain caches indefinitely.

What to Cache

Not all prompts need caching. It works best when the same large chunks of text are sent repeatedly. Prime candidates:

System instructions — if every request includes the same 2,000-token system prompt defining the assistant's role, behaviour, and constraints. Probably the most common use case.
Reference documents — chunks of a retrieved document if multiple users query the same content, or if one user asks multiple questions about the same document.
Few-shot examples — if a prompt includes 10 examples of the task (5,000 tokens) that never change.
Conversation history — in ongoing conversations, the growing history can be cached, with only new messages added as non-cached content.

The pattern to look for is stable content that appears frequently and meets minimum token thresholds.

Minimum Size Requirements

Caching has minimum thresholds — typically 1,024 tokens for Claude, similar for others. Content smaller than this cannot be cached individually. Multiple cache breakpoints can sit in a single prompt:

Cache block 1: System instructions and policies (3,000 tokens).
Cache block 2: Conversation history (5,000 tokens).
Fresh content: New user message (100 tokens).

That allows different sections to be cached with different refresh patterns. System instructions might be cached for hours. Conversation history refreshes every few turns.

Cache Economics

The breakeven point is easy to calculate using Anthropic's pricing. Cached tokens cost $0.30 per million — a 90% discount from $3. Regular input tokens cost $3 per million.

For 10,000 cached tokens:

First request (cache write): 10,000 × $3 / 1M = $0.030.
Subsequent requests (cache read): 10,000 × $0.30 / 1M = $0.003.
Savings per subsequent request: $0.027.

Breakeven happens immediately after the first cached request. By the tenth request, the saving compared to no caching is $0.243. Scaled to thousands of requests, the savings compound quickly.

Caching Patterns

Once the targets for caching are identified, the next decision is how to structure the approach. Options include:

Static cache. System prompts and examples cached once and used across all requests — fit for consistent applications.
Document cache. Cache retrieved documents for the duration of a user's research session. Each document is cached on first access, improving subsequent queries.
Conversation cache. Cache the growing conversation history, adding only new turns as uncached content. Long conversations stay affordable.
Hybrid approach. Cache stable system instructions plus frequently accessed reference material, while dynamically adding uncached query-specific context.

Cache Invalidation Challenges

The hard part is knowing when to invalidate, as caches expire after inactivity periods. For low-frequency applications, the benefit may be limited because caches expire between requests. High-frequency applications maintain caches naturally through regular use.

When the system prompt or reference materials are updated, old caches become stale. Updating the cache as content changes requires paying for a new cache write. Plan updates strategically to minimise churn.

Measuring Cache Performance

Track the cache hit rate — the percentage of requests that successfully use cached content. High hit rates (80% and above) indicate effective caching. Anything below 50% suggests the cache duration is too short or that requests are not hitting the same content patterns. Monitor cache write frequency to understand churn costs.

Output Control

Input costs are predictable — you control what you send. Output costs are variable — the model determines response length. That is why output control mechanisms matter.

Output control focuses on limiting what the model generates. Output tokens typically cost three to five times as much as input tokens, so generating unnecessarily long responses directly inflates costs without adding value in most cases.

In short: do not let the model generate lengthy responses. Limit output length explicitly when appropriate using the `max_tokens` parameter. For structured outputs, request concise formats like JSON rather than verbose prose.

Max Tokens Parameter

The most direct control mechanism is the `maxtokens` parameter in the API call. It sets a hard ceiling on the length of generated output. For a short answer, set `maxtokens` accordingly — perhaps 100 for a brief response, 500 for a moderate explanation, or 2,000 for detailed content.

A quick reference for limits by task:

Classification tasks: 10 tokens (a category label only).
Short summaries: 200–500 tokens.
Code generation: 1,500–2,000 tokens.
Long-form content (in general): 2,000–5,000 tokens.
Essay writing: 4,000+.

Tailor this to each endpoint or task type. Be careful: limits set aggressively low cause responses to get truncated mid-sentence, failing to provide the expected result. The key is understanding the use case requirements and setting appropriate limits with some buffer room.

Prompt-Based Length Control

The model can be instructed about desired length directly in the prompt, beyond the API parameters. Be specific: "Provide a response in 2–3 sentences," "Answer in under 100 words," "Be extremely concise."

Models tend to respect these instructions, though precision is not always perfect. For better calibration, combine prompt instructions with `max_tokens`. The prompt guides the model's natural stopping point; the limit prevents outliers.

For structured outputs, specify what is needed exactly — for example, "Return only the JSON object, no explanation." Instructions like that prevent the model from adding preambles or commentary (such as "Certainly! Here's the JSON response you requested") that waste tokens.

Structured Output Formats

JSON, XML, or other structured formats are often more token-efficient than prose, especially when only data is needed, not explanation. Compare these responses to "What is the user's sentiment?":

Prose: "Based on my analysis of the user's message, I would characterise their sentiment as positive. They express satisfaction with the product and gratitude for the service." That is 32 tokens.
Structured: `{"sentiment": "positive", "confidence": 0.95}`. That is 12 tokens — almost three times shorter and cheaper.

The structured format is 62% more efficient while conveying the same essential information. For high-volume applications, that compounds. When structured data is needed, request it and nothing else: "Return only a JSON object with no preamble or explanation."

Tiered Response Strategies

Consider implementing different response depth tiers based on context:

Minimal for classification and simple queries: category labels, yes/no answers, single values.
Standard for typical Q&A or moderate explanations: 2–3 sentence responses.
Detailed for complex queries or when users explicitly request depth: full explanations with examples.

Route requests to appropriate tiers automatically. A FAQ chatbot defaults to minimal/standard responses, escalating to detailed only when users ask follow-up questions or explicitly request elaboration.

The same logic can be applied based on user preferences, subscription level, or query complexity. Free users receive minimal responses, paid users receive standard responses, and enterprise users receive detailed responses.

For complex tasks, initial responses can be concise with refinement available. Do not ask the model to generate everything at once — use multi-turn interactions strategically:

Generate an outline first (50 tokens).
Expand specific sections as needed (200 tokens each).

For research tasks, generate summaries first, then dive deeper only into relevant sections. Users who want depth can request it. Those satisfied with the brief answer save output costs for both parties. That gives users control and prevents generation of content they do not want. Tokens stop being spent on tangential material.

Stop Sequences

Stop sequences halt generation at natural boundaries. For specific information extraction, a stop sequence like "\n\n" prevents the model from continuing beyond the first complete answer. For templated outputs, stop sequences ensure the model does not add unwanted commentary after completing the template.

That prevents the model from generating beyond what is actually necessary. If the summary stops at 200 tokens with a double newline, generation stops there instead of continuing to `max_tokens` just because the model is capable of providing a longer response.

Output Validation and Retry Logic

When output exceeds expected bounds, retry with stricter instructions. While that does not prevent initial generation, efficient validation reduces waste from regeneration.

For specific formatting, validate immediately and provide clear correction prompts rather than generic "try again" messages that might produce equally lengthy invalid outputs. The retry consumes tokens but prevents generating responses that do not suit.

Track common output issues and refine prompts to prevent them, reducing retry costs.

Measuring Output Efficiency

A specific benchmark is needed to estimate efficiency objectively (and optimise properly).

Calculate the average output tokens per request type. If classification tasks average 150 output tokens but only need 10, the waste is obvious. Investigate why. Are prompts inadvertently encouraging verbosity? Is `max_tokens` set too high?

In addition to averages, monitor the distribution of output lengths. If 90% of requests need 100 tokens but 10% generate 2,000 tokens, those outliers drive disproportionate costs. Find what triggers verbose outputs and address it.

Retry rate (how often you regenerate for length issues) and truncation rate (how often `max_tokens` cuts off responses) also provide insights for better model calibration.

Cost Impact Example

A chatbot handling 1 million queries monthly. Reducing average output from 300 to 150 tokens through better output control (Claude Sonnet pricing):

Before: 300M tokens × $15/1M = $4,500.
After: 150M tokens × $15/1M = $2,250.
Monthly savings: $2,250.
Annual savings: $27,000.

Output control is not a marginal optimisation — it is a significant cost lever.

Batching and Asynchronous Processing

Not every AI request needs an immediate response. That is the logic powering batching and asynchronous processing. Batch processing defers non-urgent work in exchange for substantial discounts, requiring latency tolerance. Separate time-sensitive operations from those that can wait, and the same user experience is preserved while costs drop considerably.

Batch API Offerings

Many foundation model providers, including OpenAI, Google, and Anthropic, offer batch-processing APIs with substantial discounts for requests that can tolerate delayed processing — usually 12–24 hours. The bill is typically 50% off standard pricing.

The mechanism is similar across providers:

Submit a collection of requests together as a batch job.
Receive a job identifier.
Either poll for completion or receive a webhook notification when results are ready.

The system processes these during off-peak periods when computing capacity is underutilised. The economics are simple: accept 24-hour latency, pay half price. For applications that can wait, this is free money.

Ideal Batch Use Cases

The 50% discount on batch APIs is compelling, but only if task completion can be postponed. The scenarios where deferred processing does not affect the value of the work and just makes sense:

1. Data processing and analysis:

Analysing thousands of customer reviews.
Processing support tickets for insights.
Scoring leads in your CRM.
Sentiment analysis across large datasets.

Tasks like these rarely require real-time results, and batching fits the process.

2. Content generation at scale:

Generating product descriptions for entire catalogues.
Creating personalised email variations.
Drafting social media content for the week.
Translating documentation into multiple languages.

The majority of content can be planned. The above work can be scheduled overnight or over weekends.

3. Model evaluation and testing:

Running test suites against prompts.
A/B testing different prompt formulations.
Comparing model outputs across versions.
Validating system changes across hundreds of test cases.

These use cases benefit from batch's higher rate limits as much as cost savings. 100,000 documents can be processed at once without throttling.

4. Periodic reporting:

Daily summaries.
Weekly analytics reports.
Monthly content audits.
Compliance reviews.

Anything with natural time boundaries that does not require instant results falls under "suitable for batching." Reports generated at 2 AM for 8 AM review work fine.

5. Data enrichment:

Categorising documents.
Extracting metadata.
Tagging images.
Generating embeddings for search indexes.
Any bulk transformation of large datasets.

Order and precision matter more than urgency here. The work is retrospective. Results in 24 hours serve the same purpose as results in 1 hour.

Non-Batch Asynchronous Patterns

Even without formal batch APIs, asynchronous processing indirectly reduces costs by improving resource utilisation and reducing waste from timeouts or failed synchronous requests. A few patterns enable processing flexibility.

1. Queue-based processing.

Instead of handling requests synchronously as they arrive, queue them for worker pools to process. That enables sophisticated rate limiting, intelligent prioritisation, and graceful degradation during traffic spikes. Users receive immediate acknowledgement while processing occurs asynchronously.

2. Background jobs.

Separate user-facing operations (must be fast) from background operations (can be slower). If a user uploads a document, acknowledge receipt immediately, then extract and analyse the content asynchronously. The user does not wait, and processing can be optimised for efficiency rather than speed.

3. Scheduled batch windows.

Accumulate similar requests throughout the day and process them together in scheduled windows. Customer feedback submitted during business hours can be analysed in nightly batches, with insights appearing on dashboards by morning.

These patterns do not offer the explicit 50% discount of batch APIs, but they provide better load distribution, which also translates into cost savings.

Hybrid Approaches

Most production applications do not fit neatly into "everything real-time" or "everything batched." Different parts of the system have different latency requirements and cost sensitivities. Often, it makes sense to blend real-time and batch processing.

Take a content moderation system. High-risk content can be processed immediately using fast, economical models for real-time safety. Lower-risk or ambiguous content can be queued for batch processing with more capable models during off-hours. Review queues are prepared for human moderators the next morning.

The same approach applies to e-commerce. A platform can generate personalised recommendations in real time for currently active users while pre-computing tailored offers for inactive segments in nightly batch jobs. When those users return, recommendations are instantly available without real-time generation costs.

The same works for software development. A code analysis tool can provide immediate syntax checking and basic linting in real time while queuing deeper security analysis, architectural reviews, and optimisation suggestions for batch processing.

The efficiency of hybrid approaches is not limited to specific industries or narrow cases. Understand the logic, design the architecture with multiple processing paths, and route work based on urgency and value.

Implementation Considerations

Batch processing is not just an API switch. It requires infrastructure to manage jobs, store results, and communicate status. The potential cost cut justifies the investment, but several operational concerns matter to make batching reliable and maintainable:

Managing expectations. Users need clear communication about the availability of results. "Your analysis will be ready within 24 hours" works for many scenarios but is unacceptable for interactive experiences. Choose the right flows and design user interfaces that acknowledge processing status and notify on completion.
Job monitoring and observability. Batch systems need robust monitoring. Track job submission, processing time, failure rates, and retry attempts. Failed jobs should not silently disappear — implement dead letter queues and alerting.
Result storage and retrieval. Batch results require persistent storage with appropriate retention policies. Design the data model to retrieve results efficiently, handle partial failures, and clean up completed jobs.
Prioritisation mechanisms. Implement priority queues so time-sensitive batch work does not wait behind lower-priority tasks. Premium users, urgent analysis, or business-critical operations might warrant priority processing even within batch systems.

Some of these considerations call for more serious engineering effort, but they are one-time investments that scale across all batch workloads.

Cost-Benefit Analysis

The expected batch discount sounds attractive in theory, but does it justify the infrastructure investment and operational complexity? The answer is yes, and the calculations prove it.

A document processing pipeline that handles 1 million documents per month:

Real-time processing: 1M requests × $0.02 per request = $20,000.
Batch processing: 1M requests × $0.01 per request (50% discount) = $10,000.

$10,000 saved monthly, or $120,000 annually — funding for additional infrastructure, engineering resources, product development, and other initiatives. The question is not whether to batch but which workloads to migrate first.

When Not to Batch

Some workloads demand immediate responses. The reason is not impatience; it is that latency fundamentally breaks the use case. Savings mean nothing if the reason users chose the product has been eliminated. Specific cases where batching does not work:

User-facing interactions: chatbots and conversational AI, real-time customer support, interactive applications, live demonstrations.
Time-sensitive operations: fraud detection, anomaly alerts, real-time content moderation, live trading decisions.
Sequential dependencies: task B that depends on task A's output, multi-step workflows requiring human review between stages, iterative refinement processes.

The rule: if latency directly impacts user experience or business value, pay for real-time. Do not sacrifice core user experience for cost savings on operations that define the product's value proposition.

Advanced Techniques

The core optimisations — routing, caching, context management — get you considerable savings, but mature applications can push further. The techniques below are not first-day optimisations, but they become viable once the fundamentals are in place.

Smart Retry Logic

API calls fail. Models return malformed responses. Outputs hit token limits mid-sentence. With native settings, the system retries everything at full cost, doubling the spend on failures. No one wants to waste tokens on failed requests. The cure is exponential backoff and avoiding unnecessary retries.

Poor retry strategies compound costs by causing redundant processing and potentially degrading the user experience through increased latency. Smart retry logic recognises that different failures require different responses. Some warrant immediate retry with the same model, others call for escalation to a more capable tier, and some should abort entirely. The goal is to recover from failures without unnecessarily multiplying costs.

Understanding Retry Costs

When an API request fails after the model has begun generating output, input tokens and any partial output tokens have already been paid for. An immediate retry of the same request doubles — or triples — the costs for transient failures. Over thousands of requests, the inefficiency becomes a significant cost driver.

Aggressive retry patterns during outages create additional problems. They can overwhelm provider infrastructure, creating cascading failures that extend downtime for everyone.

Exponential Backoff

Exponential backoff is the foundation of intelligent retry logic. Instead of retrying immediately, wait progressively longer between attempts: 1 second, then 2, then 4, then 8, then 16. That gives transient issues time to resolve while preventing the application from hammering the API during outages. Most providers recommend this pattern in their documentation.

Add randomised jitter — random variation in wait times — to prevent thundering herd problems. Without jitter, many clients retry simultaneously after the same failure, creating artificial traffic spikes that prolong outages.

Retry Budget Pattern

Implement a retry budget — a limit on the total number of retries across the application within a given time window. If the retry rate exceeds, say, 5% system-wide, stop automatic retries and alert engineers. That adjustment in logic prevents burning the budget on systemic issues that retries cannot fix.

The pattern serves two purposes: it protects costs during major incidents, and it forces teams to address root causes rather than masking problems with aggressive retries. A sustained high retry rate signals a fundamental issue — API degradation, prompt issues, or architectural problems — that needs investigation, not more retries.

Error Classification

Different errors require different responses. Retries are only one of the options that can actually resolve the problem. To address errors effectively, classify them into three categories.

Retryable errors — transient and likely to succeed on retry:

Rate limits (429).
Service unavailable (503).
Timeouts.
Network failures.

Non-retryable errors — retrying will not help and wastes resources; fix the request instead:

Invalid authentication (401).
Malformed requests (400).
Content policy violations.

Ambiguous errors — might be transient or might indicate request-specific problems; retry cautiously with backoff:

Generic server errors (500).
Token limit exceeded errors.

Implement different retry strategies per error class. Return non-retryable errors to callers immediately rather than wasting time and money on futile retries.

Rate Limit Handling

Rate limit errors (HTTP 429) require a specific approach. Providers often include headers indicating when to retry (Retry-After) or current rate limit status. Respect those signals rather than inventing logic from guesswork.

When rate-limited, pause all requests to that endpoint until the rate limit window resets. Ignoring the limits and continuing to hammer the endpoints wastes money on failed requests — and potentially triggers stricter rate limiting.

Circuit Breaker Pattern

Circuit breakers stop requests to failing endpoints after consecutive failures, giving systems time to recover instead of compounding problems. The pattern mimics the logic of electrical circuit breakers: when things go wrong, cut the flow. Three states govern their behaviour:

Closed: normal operation; requests flow through as usual.
Open: stop making requests for a timeout period after N consecutive failures.
Half-open: allow limited test requests after timeout to check service recovery.

If test requests succeed, close the circuit and resume normal operation. If they fail, reopen the circuit for another timeout period. The pattern protects both costs and system stability.

Request Deduplication

Track in-flight requests and avoid duplicate retries for the same logical operation. If a user clicks "Submit" twice, or if a network hiccup causes a repeated request, deduplicate at the application layer before hitting the API. Maintaining a cache of recent request identifiers allows a check against it before sending.

Use idempotency keys where providers support them. These ensure retried requests do not create duplicate side effects even if processed multiple times server-side. The provider recognises the key and returns the cached result instead of reprocessing.

Timeout Configuration

Set appropriate request timeouts based on expected response times. This matters because:

Long timeouts waste time waiting for failed requests before retrying.
Short timeouts trigger unnecessary retries when requests would have succeeded.

Different operations warrant different timeouts. Complex generation might need 60 seconds, while simple classification should complete in 5–10 seconds. Configure per-endpoint or per-operation-type.

Monitoring and Alerting

Track retry rates, failure rates by error type, and the latency impact of retry logic. These metrics reveal system health and cost efficiency.

High retry rates indicate systemic issues. Investigate rather than tolerate. Maybe prompt issues are causing validation failures, the infrastructure has problems, or rate limit configurations do not match actual load.

Whatever the reason, it is best to alert when retry rates exceed thresholds so teams can respond before the budget impact becomes severe.

Cost Impact Example

An application that makes 100,000 requests daily. Without smart retry logic:

5% initial failure rate = 5,000 failures.
Immediate retry of all failures with 50% success = 2,500 more failures.
Another immediate retry with 50% success = 1,250 more failures.
Total: 8,750 failed requests burning tokens.

With smart retry logic, the final bill looks different:

Classify errors: 60% are non-retryable (malformed requests) — do not retry.
The remaining 2,000 failures use exponential backoff.
90% eventual success = 200 permanent failures.
Total: 2,200 failed requests.

A 75% reduction in wasted failed requests. The monthly savings — roughly $300 in this example — scale with volume. At 1 million requests daily, poor retry logic could cost $3,000–4,000 per month on doomed retries that smart classification would have avoided entirely.

Fine-Tuning Consideration

Fine-tuning means training a foundation model on your specific dataset. The goal is to create a customised version optimised for your use cases. While powerful, fine-tuning has complex cost trade-offs that require careful analysis before implementation.

For highly repetitive, specific tasks, fine-tuning a smaller model might be more economical than repeatedly prompting a larger one — though it entails higher initial investment and greater complexity.

The Fine-Tuning Cost Structure

Fine-tuning is not a simple API call with predictable per-token pricing. It incurs upfront costs and ongoing inference costs that differ from base models:

Training costs — one-time expense to create a custom model that varies by provider and model size. Expect hundreds to low thousands of dollars depending on dataset size and the number of training epochs.
Storage costs — monthly fees to host the fine-tuned model, usually $10–50 per month per model, depending on provider.
Inference costs — per-token costs for using the fine-tuned model. These can be cheaper than base models or comparable in price, depending on the provider's pricing structure.

When Fine-Tuning Makes Financial Sense

Fine-tuning becomes cost-effective when high-volume, repetitive tasks let a smaller fine-tuned model replace a larger base model at comparable quality. Examples:

High-volume classification. Running millions of monthly classification requests through a flagship model like GPT-5 or Opus 4.6 — fine-tuning a smaller model like GPT-5 mini or Haiku 4.5 on specific categories might achieve similar accuracy at a fraction of the cost per request.
Domain-specific formatting. When every prompt includes extensive examples showing the desired output format (consuming thousands of tokens), fine-tuning embeds that knowledge into the model, eliminating those example tokens from every request.
Specialised vocabulary. Technical domains with unique terminology benefit from fine-tuning. Medical, legal, or industry-specific language that base models handle inconsistently becomes more reliable with domain-specific training.
Consistent style or tone. When a specific writing style, brand voice, or communication pattern is needed across millions of generations, fine-tuning is more efficient than including style guides in every prompt.

The pattern is clear: fine-tuning rewards scale and specificity. Look for those in the task before investing in this kind of model calibration.

Cost-Benefit Calculation

A realistic scenario shows whether fine-tuning makes economic sense. 5 million monthly classification requests through a premium model at $0.01 per request is $50,000 monthly.

With the fine-tuning option:

Training cost (one-time): $500.
Storage: $20/month.
Fine-tuned inference: $0.002 per request = $10,000 monthly.
Total first month: $10,520.
Monthly ongoing: $10,020.

Breakeven happens in the first month; annual savings come to approximately $480,000. The assumption is that the fine-tuned model matches quality. If accuracy drops from 95% to 85%, downstream costs from errors might exceed savings.

The Quality Tradeoff

Fine-tuning smaller models rarely achieves the same capability as larger base models on complex tasks. The decision hinges on whether task complexity allows downsizing the model. A quick reference:

Good candidates: classification, entity extraction, formatting standardisation, style adaptation, domain-specific generation with clear patterns.
Poor candidates: complex reasoning, nuanced judgement, creative synthesis, tasks requiring broad world knowledge, ambiguous scenarios needing inference.

Even with suitable cases, the model must be tested extensively. Fine-tune on relevant, clean training data, validate on held-out test sets, and assess whether the quality degradation is acceptable for the use case.

Alternative: Prompt Optimisation First

Before investing in fine-tuning, consider simpler prompt optimisation opportunities. Often, better prompts with few-shot examples produce results similar to fine-tuning — all without upfront costs or a maintenance burden.

If that does not work, move forward confidently. Fine-tuning makes sense when prompt engineering hits diminishing returns and the model size/cost is confirmed as the bottleneck, not prompt quality.

Maintenance Costs

Fine-tuning is not a one-time expense. The models require ongoing maintenance — the "hidden costs" of:

Retraining. As requirements evolve, periodic retraining with updated data is needed. Each retraining incurs training costs again.
Versioning. Managing multiple fine-tuned model versions, A/B testing them, and migrating traffic adds operational complexity.
Monitoring drift. Fine-tuned models can degrade as real-world data diverges from training data. Continuous monitoring and quality checks are essential.
Data pipeline. Infrastructure is needed to collect, label, clean, and version training data. This engineering overhead should not be underestimated.

These costs often exceed initial training expenses and can make fine-tuning economically unsustainable.

Provider-Specific Considerations

Fine-tuning capabilities and economics vary across providers:

OpenAI offers supervised, DPO, and reinforcement fine-tuning across the GPT-5 mini, GPT-5 nano, and legacy GPT-4o lines, with straightforward pricing and tooling.
Anthropic offers fine-tuning for Haiku and Sonnet through Amazon Bedrock and Google Cloud Vertex AI, with broader availability than the early enterprise-only programme.
Google provides fine-tuning for the Gemini 2.5 family (Pro, Flash, Flash-Lite) through Vertex AI, with both supervised and distillation options.
Open-source models — Llama 4, Mistral, Qwen, DeepSeek and others — can be fine-tuned on internal infrastructure, trading API costs for compute and engineering costs.

Evaluate each provider's offerings against specific requirements and volume.

Decision Framework

Fine-tuning decisions should not be made on intuition or technical enthusiasm. Even drafting the basic calculation is not always sufficient. The following questions guide the decision objectively:

Is this high volume enough to justify upfront investment? (Generally 1M+ requests monthly.)
Have prompt engineering approaches been maximised?
Can a smaller fine-tuned model match quality requirements?
Is there sufficient high-quality training data (typically 1,000+ examples minimum)?
Are engineering resources available for maintenance and monitoring?
Are requirements stable enough that retraining will not be frequent?

If most answers are "yes," fine-tuning is worth serious consideration. If several are "no," stick with prompt engineering and base models — for now.

Hybrid Approaches

Fine-tuning is not all or nothing. Use fine-tuned models for high-volume standardised tasks. Reserve base models for complex low-volume scenarios. In practice: a customer service application might use fine-tuned Haiku for the 70% of queries that follow predictable patterns (password resets, order status, common troubleshooting) and escalate unusual cases to premium base Sonnet or Opus.

Monitoring and Analytics

Cost efficacy comes with iterative optimisation. And you cannot optimise what you do not measure. Monitoring and analytics provide visibility into where costs accumulate and where changes yield maximum return. Without instrumentation, the work is blind.

Essential Metrics to Track

1. Token consumption by dimension.

Break down token usage (input and output separately) by user, feature, endpoint, request type, and time period. Identify which parts of the application drive costs. Often, 20% of features consume 80% of the budget.

2. Cost per request type.

Calculate the average cost for different operations: classification, generation, summarisation, etc. Some operations might be 10x more expensive than others. Understanding this helps prioritise optimisation efforts.

3. Model distribution.

When using multiple models, track what percentage of requests go to each. Use this data to set up effective routing. Otherwise, expensive models get overused and spending climbs more than it has to.

4. Latency metrics.

Latency and cost often correlate. Track P50, P95, P99 latencies. Slow requests might indicate inefficient prompts, oversized context, or model selection issues. All are fixable with the right insights.

5. Error and retry rates.

High error rates indicate problems that result in wasted token consumption. Track errors by type to identify patterns rather than relying on hotfixes — immediate but lacking deeper context.

6. Cache hit rates.

When using prompt caching, monitor hit rates. Low rates suggest caches are not properly configured or content changes too frequently. Easy to fix, and the saving improves the experience for all parties.

Implementation Approaches

Tracking metrics requires infrastructure — logging, aggregation, visualisation, and alerting systems. The sophistication needed depends on scale and organisational maturity. Several approaches to rely on.

Logging and tagging

Log every API call with metadata: userid, featurename, requesttype, modelused, tokens_consumed, latency, cost.
Tag requests consistently so the data can be sliced meaningfully.

Real-time dashboards

Build dashboards showing current spend rate, requests per minute, token consumption trends, and error rates.
Use real-time visibility to enable rapid response to anomalies.

Alerting thresholds

Set alerts for cost spikes, increases in error rates, or unusual usage patterns.
Investigate immediately if hourly spend exceeds 2x normal.

User-level tracking

Monitor per-user consumption to observe patterns.
Identify abuse, bugs causing runaway requests, or power users requiring different handling.

Cost Attribution

Attribute costs to business units, products, or features to understand ROI and make informed trade-offs. A recommendation engine might cost $10,000 monthly — but does it drive enough sales to justify that spend? Without attribution, the question has no answer.

Track cost per business outcome: cost per sale, cost per support ticket resolved, cost per content piece generated. That connects AI spend to business value rather than treating it as an abstract infrastructure cost.

A $5,000 monthly chatbot sounds expensive until it is resolving 10,000 support tickets that would otherwise cost $15 each — a $150,000 monthly value for $5,000 spend. The ROI is 30x, making increased investment obvious rather than questionable.

Identifying Optimisation Opportunities

High costs alone do not tell you what to fix — understanding where inefficiency hides and which optimisations will have the greatest impact is what matters. Metrics reveal problems; analytics reveal specific optimisation targets:

Expensive outliers. If 95% of requests cost $0.01 but 5% cost $0.50, investigate those outliers. Are they legitimately complex or inefficiently constructed?
Inefficient features. A feature used 100 times daily, costing $50 per use, needs optimisation more urgently than one used 10,000 times, costing $0.10.
Temporal patterns. Usage spikes at certain times might indicate batch operations that could be rescheduled or automated differently.
Failed request patterns. If specific request types fail frequently, fix root causes rather than paying for repeated failures.

Continuous Improvement Loop

Models evolve, pricing changes, usage patterns shift, and new features introduce costs that were not anticipated. Optimisation is a dynamic process that never ends — though you can pause between turns. A gradual, iterative approach, focused on one thing at a time:

Baseline current performance and costs.
Implement optimisation (caching, routing, prompt refinement).
Measure impact with A/B tests or staged rollouts.
Calculate actual savings and quality impact.
Roll forward if beneficial, roll back if not.
Identify the next optimisation target.

With proper monitoring and prioritisation, this should not be complex.

Benchmarking and Forecasting

Track cost trends over time to understand how efficiency evolves with scale. Pay attention to whether cost per request stays constant or increases as usage grows (increasing costs suggest inefficient scaling).

Forecast future costs based on user growth and feature expansion. If the current trajectory projects $100,000 monthly spend at target scale, proactive optimisation is essential. It is always better to invest engineering time optimising at a lower scale than scramble when bills become unsustainable.

Establish clear baseline metrics and build multiple scenarios: conservative, expected, and aggressive. Stress-test the cost model: what happens if a feature goes viral or cache hit rates drop 20%? Scenario planning reveals vulnerabilities before they become crises.

Example Impact

A practical case to close on. A company tracking detailed analytics discovered that 40% of token consumption came from a single feature — rarely used and with a redundant prompt structure. Optimising that feature alone reduced overall spending by 35%, amounting to approximately $15,000 per month.

Without analytics, they would never have identified this specific target. They might have switched to cheaper models globally or implemented aggressive output limits across all features — and none would have solved the problem.

Balancing Cost and Quality

The goal of sustainable operations is not minimum spend — it is optimal cost for the quality requirements. Chasing the cheapest possible implementation often backfires. A $0.001 request that produces garbage requiring human review costs more than a $0.01 request that gets it right the first time.

Sometimes, spending more on a capable model reduces costs elsewhere by eliminating the need for complex post-processing, multiple attempts, or human review. Using Opus for complex code generation might cost 5x as much per request as Haiku. But if it produces working code 95% of the time versus 60%, debugging cycles, retry attempts, and developer frustration are taken care of.

Test systematically to find where spending can be reduced without compromising outcomes that matter to users. The teams that optimise costs most successfully are not the ones spending the least. They are the ones spending intentionally — matching cost to value, and measuring relentlessly to ensure they are getting what they pay for.

[✳]

Slava Tarasov
AI for Compliance: A Practical Guide for Teams Tired of Manual Reviews
AI/ML•AI Strategy•LLM Integration
Slava Tarasov
AI for Compliance Monitoring vs. Traditional Rule-Based Systems: What Changes
AI/ML•AI Strategy•LLM Integration
BN Digital
Why Retrieval-Augmented Generation (RAG) Is Already Becoming Legacy Architecture
AI/ML•LLM Integration•AI Strategy

Usage and Cost Optimisation for Foundation Models

Understanding the Cost Structure

Token-Based Pricing Fundamentals

Input vs Output Token Costs

Model Tier Pricing

Hidden Cost Factors

Context Window Usage

Prompt Caching Economics

Failed Requests and Errors

Real-World Cost Modelling

Naive Approach

Optimised Approach

Volume Discounts and Enterprise Pricing

Cost per Use Case Economics

The ROI Perspective

Core Optimisation Strategies

Model Selection and Routing

The Model Capability-Cost Tradeoff

Task Categorisation Framework

Routing Architecture Patterns

Sequential Cascade (Waterfall)

Parallel Ensemble

Pre-Routing/Classification

Hybrid Decomposition

Implementing Smart Routing

Confidence-Based Escalation

Performance Monitoring

Dynamic Routing Logic

Cost Analysis of Routing

Common Pitfalls

Prompt Engineering for Efficiency

Context Management

Sliding Window Approach

Retrieval-Augmented Generation (RAG)

Selective Context Loading

Summarisation Techniques

Session State Management

Measuring Context Efficiency

Caching Strategies

How Prompt Caching Works

What to Cache

Minimum Size Requirements

Cache Economics

Caching Patterns

Cache Invalidation Challenges

Measuring Cache Performance

Output Control

Max Tokens Parameter

Prompt-Based Length Control

Structured Output Formats

Tiered Response Strategies

Iterative Refinement

Stop Sequences

Output Validation and Retry Logic

Measuring Output Efficiency

Cost Impact Example

Batching and Asynchronous Processing

Batch API Offerings

Ideal Batch Use Cases

Non-Batch Asynchronous Patterns

Hybrid Approaches

Implementation Considerations

Cost-Benefit Analysis

When Not to Batch

Advanced Techniques

Smart Retry Logic

Understanding Retry Costs

Exponential Backoff

Retry Budget Pattern

Error Classification

Rate Limit Handling

Circuit Breaker Pattern

Request Deduplication

Timeout Configuration

Monitoring and Alerting

Cost Impact Example

Fine-Tuning Consideration

The Fine-Tuning Cost Structure

When Fine-Tuning Makes Financial Sense

Cost-Benefit Calculation