The Problem: Agents With Credit Cards

Here's the nightmare scenario every AI platform faces: an agent enters a loop, generates 10,000 requests in an hour, and racks up $5,000 in API costs. Or a prompt injection tricks an agent into making expensive model calls. Or a single customer's chatbot goes viral and consumes the entire monthly budget in a day.

When you're running multi-tenant AI infrastructure — where hundreds of businesses share the same platform — one company's runaway costs can impact everyone. CognitiveLimiter exists to make this impossible.

Architecture: Pre-Flight Validation, Not Post-Hoc Billing

The key design decision: CognitiveLimiter validates every request BEFORE it reaches the LLM. It's a gate, not a meter. If a request would exceed any limit, it's blocked before any tokens are consumed.

This is the opposite of how most platforms handle AI costs. Most track usage and send a bill. We prevent the usage from happening in the first place.

The Six Validation Layers

Every LLM request passes through six checks. All six must pass, or the request is blocked:

Layer 1: Per-Request Cost Ceiling

Before estimating cost, CognitiveLimiter uses tiktoken to count the exact input tokens, then calculates the maximum possible cost (input tokens + max output tokens at the selected model's rate). If this exceeds the per-request ceiling, the request is blocked immediately.

Default ceiling: $0.50 per request (starter tier). Higher tiers get higher ceilings. Agent-specific overrides exist — Marcus (Growth Intelligence) gets $1.00/request because his analytical tasks are inherently more expensive.

Layer 2: Rate Limiting

Redis-based sliding window rate limits per company, per agent. Default: 30 requests per minute for starter tier. This prevents loops and abuse. Per-session limits (50 requests per session) add a second boundary for chatbot-style interactions.

Layer 3: Per-Session Cost Cap

Individual chat sessions are capped at $25.00. This prevents a single conversation from consuming a disproportionate share of the company's budget — whether from a legitimate power user or a bad actor trying to drain tokens through the chat widget.

Layer 4: Daily Budget

Each company has a daily AI spend limit tracked in Redis. When 90% is consumed, a warning fires. At 95%, a critical alert. At 100%, all AI requests are blocked until the next day. Starter tier default: $5/day. Professional tier gets significantly more.

Layer 5: Monthly Budget

Same concept as daily, but monthly. This is the hard ceiling that prevents surprise bills. Starter tier default: $20/month. Budget thresholds can be customized per company through the company_ai_budgets table — set automatically at provisioning based on subscription tier.

Layer 6: Spike Detection

CognitiveLimiter calculates the rolling average cost per request. If a single request would cost3x the average, it triggers a spike alert. This catches prompt injection attacks that try to force expensive model usage, misconfigured agents that suddenly start making premium requests, and bugs that inflate context size.

Agent-Specific Overrides

Not all agents are equal. A customer service bot making quick responses needs different limits than a growth intelligence agent running deep analytics:

Agent	Override	Why
Sarah (Customer Service)	100K context, 30 req/min	Needs full conversation history for context
Marcus (Growth)	$200/day, $1/request	Analytical tasks are expensive but high-value
Jake (Inventory)	20K context, $0.25/request	Should be fast and cheap — inventory extraction
Carrot (SDR)	$150/day, 20 req/min	High volume outreach, needs daily budget headroom
Beet (Analytics)	80K context, 8K output	Generates long analytical reports

Fail-Closed Design: When Redis Goes Down

CognitiveLimiter stores all counters in Redis. So what happens when Redis is unavailable?

Most systems fail open — “if we can't check the limit, let it through.” We made the opposite choice: CognitiveLimiter fails closed after a 60-second grace period.

First 60 seconds of Redis outage: Requests are allowed (grace period to handle brief blips)
After 60 seconds: All AI requests are BLOCKED. A critical alert fires. The system refuses to make LLM calls it can't track.

This is aggressive, and it means a Redis outage temporarily disables AI features. We accept this tradeoff. A 5-minute Redis outage that blocks AI is far less painful than an untracked hour of AI spending that generates thousands in surprise costs.

Pre-Flight Cost Estimation

Before any request, CognitiveLimiter calculates the maximum possible cost using tiktoken for accurate token counting (not the len(text)//4 estimation you see in most codebases). It checks provider-specific pricing from a verified cost table that maps every model to its current per-token rates.

The estimation is deliberately conservative — it assumes max output tokens will be used. This means the pre-flight check might block a request that would have actually been cheaper, but it never allows a request that exceeds the budget.

Integration with SmartRouter

CognitiveLimiter and SmartRouter work together in a specific order:

SmartRouter selects the optimal model (best provider for the task type)
CognitiveLimiter validates the cost (would this exceed any limit at this model's rate?)
If validation fails, SmartRouter can try a cheaper model from the fallback chain
If no model passes validation, the request is blocked with a clear error

This means CognitiveLimiter doesn't just block expensive requests — it pushes the system toward cheaper alternatives when budgets are tight. A company near their daily limit will automatically get routed to cheaper models rather than being cut off entirely.

Multi-Tenant Budget Isolation

Every company gets its own budget counters in Redis, keyed by company ID. Company A consuming their entire daily budget has zero impact on Company B. The limits are set per subscription tier at provisioning:

Limit	Starter ($89/mo)	Builder ($199/mo)	Professional ($499/mo)
Per-request ceiling	$0.50	$1.00	$2.00
Daily budget	$5	$25	$100
Monthly budget	$20	$100	$500
Requests/minute	30	60	120

Alert Thresholds

CognitiveLimiter fires alerts at specific budget consumption levels:

90% consumed: Warning alert — “Company X has used 90% of their daily AI budget”
95% consumed: Critical alert — “Company X is about to hit their daily ceiling”
100% consumed: Hard block — all AI requests denied until reset
3x average spike: Anomaly alert — unusual spending pattern detected

These alerts feed into ADA (our AI coordinator) and the super-admin dashboard, so both AI and human operators can respond quickly.

What We Learned

Validate before, not after. Post-hoc billing doesn't prevent damage. By the time you send a bill for $5,000 in runaway AI costs, the customer has already churned. Gate every request.
Fail closed, not open. When you can't verify the budget, don't allow the spend. The 60-second grace period handles brief Redis blips. Anything longer is a real problem that deserves attention.
Agent-specific limits matter. A one-size-fits-all budget treats a $0.01 classification the same as a $1.00 analytical report. Different agents have fundamentally different cost profiles.
Spike detection catches bugs. The 3x multiplier has caught more agent bugs than security threats. When an agent's cost suddenly triples, it's usually a code regression that inflated the context window, not an attack.
Integration with routing creates a cascade. CognitiveLimiter + SmartRouter together create graceful degradation: as budgets tighten, the system automatically shifts to cheaper models instead of cutting off service entirely.

The Foundation of Trust

CognitiveLimiter is the component that lets us promise “no surprise bills” to every business on the platform. Without it, multi-tenant AI is a liability. With it, every company gets predictable AI costs that match their subscription tier.

It's not glamorous — nobody buys software because of its cost control system. But it's the reason AI agents can operate autonomously across hundreds of businesses without a single human approving every LLM request. That autonomy is what makes Solid# infrastructure, not just software.

CognitiveLimiter: Real-Time AI Cost Control at Scale