AI API Pricing Comparison
This page compares pricing structures, not just headline model names. The right choice depends on input size, output length, cache reuse, latency needs, and the cost of a wrong answer.
How providers usually differ
| Provider | Pricing pattern to watch | Good fit | Budget risk |
|---|---|---|---|
| OpenAI | Clear input, cached input, and output token prices across model sizes. | General product assistants, coding, routing, and production workflows. | Premium reasoning models can become expensive when output length is not capped. |
| Claude | Strong model tiers with separate cached input pricing. | Writing, analysis, long document work, and high-quality support replies. | Long answers and document-heavy prompts need careful output and context limits. |
| Gemini | Paid Tier prices may depend on modality and context tier for some models. | Multimodal apps, long-context workflows, and cost-sensitive flash use cases. | Long-context tiering and non-text features can make simple estimates incomplete. |
| DeepSeek | Cache-hit input can be much cheaper than cache-miss input. | High-volume chat, coding assistance, and repeated-context workloads. | Mixing cache hit and cache miss assumptions can produce unrealistic budgets. |
Input vs output vs cached input
Input tokens are the prompt and context you send. Cached input tokens are repeated context that the provider may bill at a lower rate. Output tokens are generated by the model. In many real applications, output tokens dominate the bill because generated text is usually priced higher and can grow when no max output length is set.
| Cost driver | Product example | Control lever |
|---|---|---|
| Large input | Summarizing long support tickets or documents. | Trim context, summarize history, retrieve fewer chunks. |
| Large output | Generating long reports, emails, or code files. | Set response length, ask for outlines first, stream follow-up sections only when needed. |
| Repeated context | Same policy, docs, or system prompt sent on every call. | Use provider caching where supported and track cache hit rate separately. |
Model selection guidance
- For routing, classification, and short extraction tasks, start with the lowest-cost model that passes quality checks.
- For customer support, compare total conversation cost, not only cost per message. A better model can reduce follow-up turns.
- For coding and agents, budget multiple calls per user action because planning, tool calls, retries, and final responses may all use tokens.
- For long documents, test the exact context size. A model with cheap input can still be costly if it produces long output.
Example: support assistant choice
Suppose a support assistant handles 40,000 monthly messages with 900 input tokens and 350 output tokens each. A model with cheaper output may beat a model with cheaper input because the output price is multiplied by every generated reply. If you add a 2,000-token repeated policy prompt, cached input pricing becomes important.
FAQ
Should I always choose the cheapest model?
No. Choose the cheapest model that meets the task quality bar. Cheap failed answers can increase retries, support tickets, or human review cost.
Why not rank all models from best to worst?
Because workloads differ. A model that is best for long writing may not be best for routing, code review, or short extraction.
Does this include every possible provider charge?
No. It focuses on token pricing. Audio, grounding, storage, batch, priority, fine-tuning, and account-specific discounts may differ by provider.