LLM Cost Optimization: How We Cut Reply Generation from $0.011 to $0.0009

When we shipped the first version of AI-generated replies for HelperX, each reply cost us about $0.011 in API spend. That sounds tiny until you multiply by 30 replies per slot per day times 200 active slots: roughly $66 per day, or ~$2,000 per month.

A year later, we're spending $0.0009 per reply — a 12x reduction. Same model providers, similar reply quality, same throughput. The savings came from four optimization layers stacked on top of each other.

This is exactly what each layer does, the order we applied them, and the cost reduction each one produced.

The starting point

The naive implementation looked like this:

async function generateReply(tweet, persona) {
  const response = await anthropic.messages.create({
    model: 'claude-sonnet-4-6',
    max_tokens: 200,
    messages: [{
      role: 'user',
      content: `You are a \({persona.role} with tone level \){persona.tone}.
                Reply to this tweet in 2-3 sentences:
                
                Tweet: "${tweet.text}"
                Author: @\({tweet.author} (\){tweet.followers} followers)
                
                Reply should add value without being promotional.`
    }],
  });
  return response.content[0].text;
}

Sonnet, fresh request every time, full system prompt baked into every call. Cost per reply: ~$0.011 with overhead.

Layer 1: Model routing (40% savings)

The first realization: not every reply needs the smartest model.

A reply to "AI is changing everything" doesn't need Sonnet-level reasoning. A reply to a detailed technical thread might. We built a router that picks the model based on the complexity of the input tweet:

function routeModel(tweet) {
  const complexityScore =
    (tweet.text.length > 200 ? 2 : 0) +
    (tweet.hasNumbers ? 1 : 0) +
    (tweet.questionCount > 0 ? 1 : 0) +
    (tweet.technicalKeywords > 2 ? 2 : 0);

  if (complexityScore >= 4) return 'claude-sonnet-4-6';
  return 'claude-haiku-4-5-20251001';
}

Human-evaluated A/B test on 500 reply pairs:

Haiku replies rated equal or better: 87%
Haiku replies rated noticeably worse: 8%
Haiku replies rated much worse: 5% (almost all on high-complexity input — the router catches these)

Production distribution: 78% Haiku, 22% Sonnet. Per-reply cost dropped from $0.011 to ~$0.00081.

Layer 2: Prompt caching (60% savings on input)

Anthropic's prompt caching lets you mark a portion of your prompt as cacheable. First request pays full input cost; subsequent requests within TTL pay 10% of input cost for the cached portion.

Restructured for cache hits:

const response = await anthropic.messages.create({
  model: 'claude-haiku-4-5-20251001',
  max_tokens: 200,
  system: [
    {
      type: 'text',
      text: LONG_SYSTEM_INSTRUCTIONS,
      cache_control: { type: 'ephemeral' },
    },
    {
      type: 'text',
      text: PERSONA_TEMPLATES_BLOCK,
      cache_control: { type: 'ephemeral' },
    },
  ],
  messages: [{
    role: 'user',
    content: `Persona: \({persona.role}, tone \){persona.tone}.
              Tweet from @\({tweet.author}: "\){tweet.text}"`,
  }],
});

Two cache blocks: a system block (the rules) and a persona templates block. Both are stable across many requests; only the per-tweet user message varies.

Cache hit rate after structuring this way: 94%. Per-reply cost dropped to $0.00067.

Layer 3: Embedding-based deduplication (35% savings)

In an active niche, you'll see the same news event tweeted by 8 different accounts in the same hour. Same topic, slightly different framing. The reply doesn't need to be generated from scratch.

async function generateReplyWithDedup(tweet, persona) {
  const embedding = await embedTweet(tweet.text);

  const cached = await findSimilarReply(embedding, persona.id, {
    similarityThreshold: 0.93,
    maxAgeHours: 6,
  });

  if (cached) {
    return adaptReply(cached.reply, tweet); // light rewrite
  }

  const reply = await llmGenerate(tweet, persona);
  await storeReplyEmbedding(embedding, reply, persona.id);
  return reply;
}

Cache hit rate on similarity: 32%. The adaptReply step uses Haiku for a tiny transformation. Per-reply cost dropped to $0.00050.

Quality impact

30-day A/B test:

Reply engagement rate: unchanged
Reply-rated quality: unchanged
Detection rate: unchanged

Layer 4: Streaming and adaptive max_tokens (15% savings)

Streaming with early termination

const stream = await anthropic.messages.stream({ model, messages, max_tokens: 200 });

let reply = '';
let consecutiveSpaces = 0;
for await (const event of stream) {
  if (event.type === 'content_block_delta') {
    const delta = event.delta.text;
    reply += delta;
    
    if (reply.length > 40 && /[.!?]\s*$/.test(reply)) {
      consecutiveSpaces++;
      if (consecutiveSpaces > 2) {
        await stream.controller.abort();
        break;
      }
    } else {
      consecutiveSpaces = 0;
    }
  }
}

Saves about 12% of output tokens.

Adaptive max_tokens

function estimateMaxTokens(tweet, persona) {
  const base = 80;
  const tweetBoost = tweet.text.length > 150 ? 40 : 0;
  const personaBoost = persona.verbosity === 'high' ? 40 : 0;
  return Math.min(220, base + tweetBoost + personaBoost);
}

Combined Layer 4 savings: ~15% on output cost = roughly 10% on total per-reply cost.

The actual production number

The blended production cost lands at $0.00088 per reply after accounting for retries, failures, and edge cases. Down from $0.011 starting point — a 12x reduction.

Cost summary

Layer	Per-reply cost	Reduction
Naive Sonnet, no caching	$0.0110	—
Model routing	$0.00081	13.6x
Prompt caching (94% hit rate)	$0.00067	16.4x
Embedding deduplication (32% hit)	$0.00050	22x
Streaming + adaptive max_tokens	$0.00045	24.4x
Production overhead	$0.00088	12.5x

What didn't work

1. Self-hosted open-source models. Throughput unpredictable, quality worse on short-form, total cost not competitive with Haiku below ~100M tokens/day.

2. Pre-generating reply pools. Replies sounded canned, detection went up.

3. Cross-provider routing. Quality differences across providers were noticeable. Sticking with one provider eliminated integration bugs.

4. Aggressive temperature reduction. Lower temp made replies feel mechanical; engagement dropped 18%.

Key takeaways

Model routing is the first and biggest lever — 78% of traffic moves down a tier with no quality loss.
Prompt caching needs to be designed in, not bolted on.
Embedding-based dedup is underrated. Many "different" requests are near-duplicates.
Streaming with early termination is a small but free win.
Adaptive max_tokens doesn't save direct cost but improves quality on short outputs.
Open-source models aren't worth it below ~100M tokens/day.
Document the production overhead — naive theoretical numbers are 30-50% off the actual production cost.

12x cost reduction is what it looks like when four small wins compound.

HelperX uses all four layers in production. Bring your own LLM API key. Free 30-day trial.

LLM Cost Optimization: How We Cut Reply Generation from $0.011 to $0.0009

The starting point

Layer 1: Model routing (40% savings)

Layer 2: Prompt caching (60% savings on input)

Layer 3: Embedding-based deduplication (35% savings)

Quality impact

Layer 4: Streaming and adaptive max_tokens (15% savings)

Streaming with early termination

Adaptive max_tokens

The actual production number

Cost summary

What didn't work

Key takeaways

Comments

More from this blog

Scheduling 10,000 Actions Across Time Zones: Work-Time Windows in Node.js

Server-Side Rate Caps You Can't Bypass: Why Client Trust Is a Security Bug

Building a Persona Engine: How We Make AI Sound Like Different People

Rate Limiting Done Right: Respecting X's Anti-Abuse Stack

Command Palette

The starting point

Layer 1: Model routing (40% savings)

Layer 2: Prompt caching (60% savings on input)

Layer 3: Embedding-based deduplication (35% savings)

Quality impact

Layer 4: Streaming and adaptive max_tokens (15% savings)

Streaming with early termination

Adaptive max_tokens

The actual production number

Cost summary

What didn't work

Key takeaways

Comments

More from this blog