Summary
Estimating AWS Bedrock costs starts with three inputs: request volume, average input tokens, and average output tokens. Use the free CountTokens API to get exact token counts before writing production code. It accepts your real prompts and returns model-specific counts at no charge. The bigger risk is what the token formula misses: Knowledge Base infrastructure, CloudWatch logging, and agent token amplification together often add 20 to 35 percent to the inference bill. Our workload-based Bedrock Cost Calculator builds a baseline estimate from the numbers you already know, like users, documents, or images per month, without requiring you to work backward from requests per minute.
The AWS Bedrock pricing page shows clean per-token numbers. Most teams use those numbers to build a budget and move forward. Then the bill arrives and it's 1.5 to 2 times what they projected.
The overrun almost never comes from a surprise fee. It comes from costs that were predictable but not modeled: input tokens that are larger than expected, output tokens that run long, and infrastructure charges that don't show up on the inference line at all. This guide covers how to estimate accurately before you spend a dollar on inference.
Every Bedrock text inference request produces a cost you can calculate in advance. The formula is:
(input tokens ÷ 1,000 × input rate) + (output tokens ÷ 1,000 × output rate) = cost per request
Multiply that by your monthly request volume to get a monthly estimate.
As a concrete example: Claude Sonnet 4.6 costs $0.003 per 1,000 input tokens and $0.015 per 1,000 output tokens. A request with 2,000 input tokens and 500 output tokens costs $0.006 + $0.0075 = $0.0135. At 100,000 requests per month, that is $1,350 per month before any infrastructure costs.
The formula is simple. The hard part is knowing what to put into it.
A token is roughly 4 characters or three-quarters of an English word. A 1,000-word document contains approximately 1,300 tokens. Those rules of thumb are useful for quick estimates, but they are approximations. Token counts are model-specific because different models use different tokenization strategies.
AWS provides two ways to get exact counts before sending a single inference request.
The console Tokenizer is the fastest option if you are not yet writing code. Open the Amazon Bedrock console, go to Playgrounds, click Test, and then click the Tokenizer button in the top navigation bar. Paste your prompt, select your model, and the tool returns an exact count. It costs nothing.
The CountTokens API works programmatically. It accepts the same input formats as InvokeModel and Converse, so the count it returns matches exactly what you would be charged during inference. Calling it requires the bedrock:CountTokens IAM permission. There is no charge.
Use the CountTokens API on a representative sample of real prompts before you build your estimate. The gap between assumed token counts and actual token counts is where most budgets go wrong.
1. Request volume. Monthly request counts are what you multiply everything else by. For a customer-facing chat application, this is estimated from your projected monthly active users times the average number of conversations per user. For a document processing pipeline, it is the number of documents you expect to process.
2. Average input tokens per request. Input tokens include everything you send to the model: your system prompt, any context or documents, and the user's message. System prompts are often larger than teams expect. A detailed system prompt can run 500 to 1,500 tokens, and that cost repeats on every single request. Use the CountTokens API on your actual prompts, not a word count estimate.
3. Average output tokens per request. Output tokens are what the model generates in response. This is harder to predict before you have a working prompt, but you can bracket it. A yes/no classification or a simple label produces 10 to 50 tokens. A structured extraction from a document produces 200 to 600 tokens. A full document summary or a conversational response can run 500 to 2,000 tokens. Output tokens typically cost 3 to 5 times more than input tokens for the same model. They drive the majority of most inference bills.
4. Model selection. Model choice changes the cost per token, the context window available, and whether the model supports vision. Amazon Nova Micro at $0.000035 per 1,000 input tokens and Claude Opus 4.7 at $0.005 per 1,000 input tokens represent a 140x price difference within the same platform. Choosing the right model for each task is often the highest-leverage cost decision you will make.
5. Output token ceiling. Your prompt should include a max_tokens parameter. Without one, verbose models can generate far more tokens than your use case requires, and you pay for every token they produce.
The inference formula above covers model costs. Most bills on production workloads include additional charges that the formula does not capture.
Knowledge Bases and vector storage. If your application uses Retrieval Augmented Generation (RAG), you need a vector store. Bedrock Knowledge Bases historically provisioned OpenSearch Serverless in the background, which requires a minimum of 2 OCUs (OpenSearch Compute Units) for a production-grade setup. At roughly $0.24 per OCU per hour, that floor costs approximately $350 per month regardless of how much traffic you have. If you are building new Knowledge Bases, Amazon S3 Vectors, launched in December 2025, is up to 90% cheaper and should be your default unless you have a specific reason to use OpenSearch.
CloudWatch logging. Bedrock can log prompts and responses to CloudWatch. It is easy to enable and easy to forget about. CloudWatch charges $0.50 per GB of data ingested. A high-volume application with long system prompts will accumulate logging costs that compound quickly.
Adjacent infrastructure. VPC endpoints, KMS encryption, S3 storage for batch jobs, and data transfer charges each add a small amount independently. Together they consistently run 20 to 35 percent of the inference bill on production deployments. They are worth including in your estimate even before you know the exact figures.
The formula above works for direct inference: one request in, one response out. Agents do not work that way.
A single user query sent to an agent triggers multiple internal model calls. The agent thinks about the request, decides whether to use a tool, calls the tool, processes the result, and then generates a response. Depending on task complexity and tool availability, a single user question can consume 4 to 10 times the tokens you would expect from the prompt and response alone.
If your application uses Amazon Bedrock AgentCore, there are also infrastructure costs separate from inference: the Runtime environment, session Memory, Gateway for tool access, and Observability through CloudWatch. Those costs do not appear in the Bedrock inference calculator. The AgentCore Cost Calculator models them alongside inference in a single estimate.
The AWS pricing calculator requires requests per minute and compute hours. Most teams planning a project do not know those numbers yet. The natural planning unit is users, conversations, documents, or images.
The Tech 42 AWS Bedrock Cost Calculator starts from workload types. Enter the number of monthly active users and their conversation patterns for a chat application. Enter document counts and page lengths for a document processing pipeline. Enter image volumes and dimensions for a vision workload. The calculator converts those inputs into token estimates, applies current model pricing, and shows a live comparison across all available models.
You can also add your own model if you are evaluating one not listed. The output includes a downloadable cost report you can share with stakeholders.
Two cost areas the Bedrock calculator does not model: batch inference (a flat 50% discount for asynchronous workloads) and prompt caching (up to 90% reduction on repeated input context). Both are worth evaluating once you have a baseline estimate.
A pre-build cost estimate does not need to be exact. It needs to be close enough to make infrastructure and budget decisions without a major revision six weeks in.
Start with representative prompts. Count actual tokens using the CountTokens API or the console Tokenizer. Build the inference formula from real counts rather than word-count approximations. Add a 25 to 35 percent buffer for adjacent infrastructure. If your application uses agents, multiply the inference estimate by 4 to 6 before that buffer.
That process will get you within range. Once you have a working prototype with real traffic, CloudWatch's InputTokenCount and OutputTokenCount metrics will give you the actual numbers to validate against.
If you want help sizing a Bedrock deployment before you commit to building it, Tech 42's AgentCore Accelerate Program includes architecture review and cost modeling as part of the two-week engagement. AWS funding programs often cover part or all of the cost for eligible projects.
Pricing data reflects AWS Bedrock on-demand rates for us-east-1 as of May 2026. AWS pricing changes without notice. Always confirm current rates at aws.amazon.com/bedrock/pricing before making infrastructure or budget decisions.