Skip to main content

Command Palette

Search for a command to run...

LLM Cost Optimization: How We Cut Reply Generation from $0.011 to $0.0009

Updated
6 min read
LLM Cost Optimization: How We Cut Reply Generation from $0.011 to $0.0009

When we shipped the first version of AI-generated replies for HelperX, each reply cost us about $0.011 in API spend. That sounds tiny until you multiply by 30 replies per slot per day times 200 active slots: roughly $66 per day, or ~$2,000 per month.

A year later, we're spending $0.0009 per reply — a 12x reduction. Same model providers, similar reply quality, same throughput. The savings came from four optimization layers stacked on top of each other.

This is exactly what each layer does, the order we applied them, and the cost reduction each one produced.

The starting point

The naive implementation looked like this:

async function generateReply(tweet, persona) {
  const response = await anthropic.messages.create({
    model: 'claude-sonnet-4-6',
    max_tokens: 200,
    messages: [{
      role: 'user',
      content: `You are a \({persona.role} with tone level \){persona.tone}.
                Reply to this tweet in 2-3 sentences:
                
                Tweet: "${tweet.text}"
                Author: @\({tweet.author} (\){tweet.followers} followers)
                
                Reply should add value without being promotional.`
    }],
  });
  return response.content[0].text;
}

Sonnet, fresh request every time, full system prompt baked into every call. Cost per reply: ~$0.011 with overhead.

Layer 1: Model routing (40% savings)

The first realization: not every reply needs the smartest model.

A reply to "AI is changing everything" doesn't need Sonnet-level reasoning. A reply to a detailed technical thread might. We built a router that picks the model based on the complexity of the input tweet:

function routeModel(tweet) {
  const complexityScore =
    (tweet.text.length > 200 ? 2 : 0) +
    (tweet.hasNumbers ? 1 : 0) +
    (tweet.questionCount > 0 ? 1 : 0) +
    (tweet.technicalKeywords > 2 ? 2 : 0);

  if (complexityScore >= 4) return 'claude-sonnet-4-6';
  return 'claude-haiku-4-5-20251001';
}

Human-evaluated A/B test on 500 reply pairs:

  • Haiku replies rated equal or better: 87%

  • Haiku replies rated noticeably worse: 8%

  • Haiku replies rated much worse: 5% (almost all on high-complexity input — the router catches these)

Production distribution: 78% Haiku, 22% Sonnet. Per-reply cost dropped from $0.011 to ~$0.00081.

Layer 2: Prompt caching (60% savings on input)

Anthropic's prompt caching lets you mark a portion of your prompt as cacheable. First request pays full input cost; subsequent requests within TTL pay 10% of input cost for the cached portion.

Restructured for cache hits:

const response = await anthropic.messages.create({
  model: 'claude-haiku-4-5-20251001',
  max_tokens: 200,
  system: [
    {
      type: 'text',
      text: LONG_SYSTEM_INSTRUCTIONS,
      cache_control: { type: 'ephemeral' },
    },
    {
      type: 'text',
      text: PERSONA_TEMPLATES_BLOCK,
      cache_control: { type: 'ephemeral' },
    },
  ],
  messages: [{
    role: 'user',
    content: `Persona: \({persona.role}, tone \){persona.tone}.
              Tweet from @\({tweet.author}: "\){tweet.text}"`,
  }],
});

Two cache blocks: a system block (the rules) and a persona templates block. Both are stable across many requests; only the per-tweet user message varies.

Cache hit rate after structuring this way: 94%. Per-reply cost dropped to $0.00067.

Layer 3: Embedding-based deduplication (35% savings)

In an active niche, you'll see the same news event tweeted by 8 different accounts in the same hour. Same topic, slightly different framing. The reply doesn't need to be generated from scratch.

async function generateReplyWithDedup(tweet, persona) {
  const embedding = await embedTweet(tweet.text);

  const cached = await findSimilarReply(embedding, persona.id, {
    similarityThreshold: 0.93,
    maxAgeHours: 6,
  });

  if (cached) {
    return adaptReply(cached.reply, tweet); // light rewrite
  }

  const reply = await llmGenerate(tweet, persona);
  await storeReplyEmbedding(embedding, reply, persona.id);
  return reply;
}

Cache hit rate on similarity: 32%. The adaptReply step uses Haiku for a tiny transformation. Per-reply cost dropped to $0.00050.

Quality impact

30-day A/B test:

  • Reply engagement rate: unchanged

  • Reply-rated quality: unchanged

  • Detection rate: unchanged

Layer 4: Streaming and adaptive max_tokens (15% savings)

Streaming with early termination

const stream = await anthropic.messages.stream({ model, messages, max_tokens: 200 });

let reply = '';
let consecutiveSpaces = 0;
for await (const event of stream) {
  if (event.type === 'content_block_delta') {
    const delta = event.delta.text;
    reply += delta;
    
    if (reply.length > 40 && /[.!?]\s*$/.test(reply)) {
      consecutiveSpaces++;
      if (consecutiveSpaces > 2) {
        await stream.controller.abort();
        break;
      }
    } else {
      consecutiveSpaces = 0;
    }
  }
}

Saves about 12% of output tokens.

Adaptive max_tokens

function estimateMaxTokens(tweet, persona) {
  const base = 80;
  const tweetBoost = tweet.text.length > 150 ? 40 : 0;
  const personaBoost = persona.verbosity === 'high' ? 40 : 0;
  return Math.min(220, base + tweetBoost + personaBoost);
}

Combined Layer 4 savings: ~15% on output cost = roughly 10% on total per-reply cost.

The actual production number

The blended production cost lands at \(0.00088 per reply after accounting for retries, failures, and edge cases. Down from \)0.011 starting point — a 12x reduction.

Cost summary

Layer Per-reply cost Reduction
Naive Sonnet, no caching $0.0110
Model routing $0.00081 13.6x
Prompt caching (94% hit rate) $0.00067 16.4x
Embedding deduplication (32% hit) $0.00050 22x
Streaming + adaptive max_tokens $0.00045 24.4x
Production overhead $0.00088 12.5x

What didn't work

1. Self-hosted open-source models. Throughput unpredictable, quality worse on short-form, total cost not competitive with Haiku below ~100M tokens/day.

2. Pre-generating reply pools. Replies sounded canned, detection went up.

3. Cross-provider routing. Quality differences across providers were noticeable. Sticking with one provider eliminated integration bugs.

4. Aggressive temperature reduction. Lower temp made replies feel mechanical; engagement dropped 18%.

Key takeaways

  1. Model routing is the first and biggest lever — 78% of traffic moves down a tier with no quality loss.

  2. Prompt caching needs to be designed in, not bolted on.

  3. Embedding-based dedup is underrated. Many "different" requests are near-duplicates.

  4. Streaming with early termination is a small but free win.

  5. Adaptive max_tokens doesn't save direct cost but improves quality on short outputs.

  6. Open-source models aren't worth it below ~100M tokens/day.

  7. Document the production overhead — naive theoretical numbers are 30-50% off the actual production cost.

12x cost reduction is what it looks like when four small wins compound.


HelperX uses all four layers in production. Bring your own LLM API key. Free 30-day trial.