Optimize Gateway Pricing With 3-Tier Cost Calculation

by Alex Johnson 54 views

When you're running an LLM gateway, especially one that handles a lot of traffic and integrates with various providers like Gemini, Vertex AI, or Anthropic, you'll quickly run into a common challenge: cost calculation. It's not just a simple matter of counting input and output tokens anymore. Many modern LLM providers offer prompt caching, which significantly changes how they bill you. If your gateway just uses a single price for all input tokens, you're going to end up with discrepancies between what you think you're being charged and what the provider actually bills you. This is particularly true when you have a high cache hit rate – imagine if 70% or more of your input tokens are being served from the cache! The cost can swing wildly. To get things right, we need a more nuanced approach, a 3-tier pricing structure that accounts for standard input, cached input, and output. This article will break down why this is crucial and how to implement it.

Understanding the Nuances of Prompt Caching and Billing

The core issue stems from how prompt caching works. When a user sends a prompt, the LLM gateway checks if it has seen a similar prompt before and if the response is available in its cache. If it is, the provider can serve the response much faster and, crucially, at a lower cost because they don't need to perform a full computation. However, if your billing system treats all input tokens the same, regardless of whether they were cached or not, you're essentially overpaying for those cached requests. This is where the need for a 3-tier system becomes apparent. We need to differentiate between tokens that required a full computation by the LLM provider (standard input) and those that were served directly from a cache (cached input). Then, of course, there are the output tokens, which represent the actual generated response and are typically billed differently as well.

Defining Your Token Groups for Accurate Billing

To implement this 3-tier system, we first need to clearly define the different types of tokens involved in a request. Think of it as categorizing the information flow. We'll use the following definitions:

  • total_prompt (or prompt_tokens): This is the entire input you send to the LLM provider for a given request. It's the sum of everything the model needs to process or consider.
  • cached_prompt (or cached_tokens): This is the portion of your total_prompt that was successfully retrieved from the provider's cache. If there's no caching involved for this specific request, this value will be zero. A crucial constraint here is that cached_prompt can never be more than total_prompt, and it must always be zero or a positive number. We clamp it to ensure 0 <= cached_prompt <= total_prompt.
  • output_tokens (or completion_tokens): This refers to the tokens that make up the LLM's generated response. For some providers, like Gemini or Vertex AI, this might not be a single, straightforward number. You might need to sum up tokens from the actual answer (candidates_token_count) and any internal