Skip to main content
AI Security

Published March 2026

PromptGuard: 5-Layer Defense Against Prompt Injection in Production AI

When 14 AI agents serve hundreds of businesses, every user message is a potential attack vector. Here's how we built a security stack that blocks prompt injection, jailbreaks, and data exfiltration — in 16 milliseconds, without degrading the user experience.

The Threat Model: Multi-Tenant Makes Everything Harder

Prompt injection is a known problem. But most discussions assume a single application with a single system prompt. Our threat model is worse: hundreds of businesses, each with custom system prompts, custom knowledge bases, custom agent configurations, and custom AI credentials.

A successful attack doesn't just compromise one application. It could:

  • Extract a company's system prompt — revealing their AI configuration, business logic, and competitive strategy
  • Hijack an agent's role — making a customer service bot act as a code executor or data exporter
  • Exfiltrate cross-tenant data — if the attack escalates privileges beyond the company boundary
  • Drain AI budgets — by forcing expensive model calls through crafted prompts

PromptGuard is designed for this reality. It runs on every AI interaction, before SmartRouter selects a model and before CognitiveLimiter checks the budget. Security is the first gate, not an afterthought.

Request Security Pipeline

User Input
PromptGuard
SmartRouter
CognitiveLimiter
LLM Provider
Output Scan
User

PromptGuard scans input before processing AND output before returning. Security bookends the entire pipeline.

The 5 Security Layers

Defense-in-depth is the only approach that works against a motivated attacker. Each layer catches what the previous one missed.

Layer 1: Input Sanitization (~2ms)

Before any analysis, the raw input is cleaned. Null bytes, control characters, and dangerous delimiter sequences are neutralized. Inputs exceeding 10,000 characters are truncated.

This layer doesn't detect attacks — it removes the weapons. Delimiter injection attacks that try to break out of the user context (using sequences like <|system|> or triple backticks) are defused before they reach pattern matching.

Layer 2: Pattern Detection (~5ms)

41 compiled regex patterns scan for known attack signatures across six threat categories:

Direct Injection

HIGH

12 patterns

"Ignore all previous instructions," system tag injection, format delimiter abuse

Role Hijacking

HIGH

8 patterns

"You are now a [role]," "pretend to be," "switch to [mode]"

Jailbreak Attempts

HIGH

8 patterns

DAN prompts, "developer mode," "bypass safety," "remove restrictions"

Data Exfiltration

MEDIUM

6 patterns

"Show your system prompt," "what are your instructions," "repeat your rules"

Delimiter Attacks

MEDIUM

5 patterns

Markdown fence injection, token boundary abuse, separator attacks

Encoding Attacks

MEDIUM

6 patterns (est.)

Base64 obfuscation, eval/exec injection, hex/unicode/URL encoding tricks

All patterns are pre-compiled at service initialization — there's no regex compilation at request time. Case-insensitive matching catches variants. Each detection records up to three matched examples for the audit trail.

Layer 3: Heuristic Analysis

Pattern matching catches known attacks. Heuristics catch suspicious behavior that doesn't match a specific pattern:

  • Special character ratio: If more than 10% of the input is delimiter characters (<>[]{}|~\), something is likely wrong. Normal business messages don't look like that.
  • Context stuffing: Inputs over 10,000 characters get flagged. Legitimate customer messages are rarely that long. Prompt flooding attacks are.
  • Repetition analysis: If a message has more than 10 words but fewer than 30% are unique, it's likely a repetitive extraction attempt.
  • Instruction-like starts: Messages that begin with “you must,” “you will,” “always,” or “never” in the first 100 characters are flagged. Customers ask questions; attackers give instructions.

Layer 4: ML-Based Detection (~150ms, Async)

This is the semantic layer — it catches attacks that are phrased naturally enough to bypass pattern matching. It runs asynchronously and only when Layers 1-3 don't find a HIGH or CRITICAL threat.

Two-stage pipeline:

  1. Embedding similarity (fast): The input is embedded and compared against a database of 26 known attack patterns using cosine similarity. High similarity to a known attack triggers an immediate block. Moderate similarity escalates to the next stage.
  2. LLM classification (semantic): A fast, cheap model analyzes the input with context about the agent type and recent conversation history. It returns a structured judgment: is this an injection attempt? What type? How confident?

The ML layer is deliberately optional. If embeddings fail or the classifier times out, the request proceeds with only pattern-based protection. Security should degrade gracefully, not break the product.

Layer 5: Output Validation (~8ms)

The four layers above protect the input. Layer 5 protects the output. Even if an attack gets through, this layer prevents the damage from reaching the user.

Canary Token Detection

Every company's system prompt contains a unique invisible marker — a canary token. It's a short string that serves no purpose except detection: if the AI model ever includes this token in its response, it means the system prompt has been exposed.

Canary tokens are:

  • Unique per company. No two businesses share a canary. An attack against Company A doesn't reveal Company B's token.
  • Invisible to users. The token is embedded in a security instruction the model is told never to reveal.
  • Irrefutable proof. If a canary appears in output, the system prompt was leaked. No ambiguity, no false positive.

When a canary is detected, the entire response is replaced with a safe fallback message, the incident is logged at CRITICAL severity, and an admin alert fires.

System Prompt Leak Detection

Beyond canary tokens, output scanning checks for phrases that indicate the model is revealing its instructions: “my system prompt is,” “I was instructed to,” “here are my instructions.” Matched content is redacted before reaching the user.

Threat Response: Five Severity Levels

LevelMeaningActionBlocked?
NONEClean inputContinue normallyNo
LOWSuspicious patternLog and flag for reviewNo
MEDIUMLikely injectionLog, flag, block if strict modeConfigurable
HIGHConfirmed injectionBlock immediatelyYes
CRITICALActive attack / data exfilBlock + admin alert + rate limitYes

Each company can configure their sensitivity level: strict (blocks MEDIUM and above),balanced (blocks HIGH and above, default), or permissive (blocks only CRITICAL). Companies can also whitelist specific phrases that trigger false positives in their industry.

Multi-Tenant Security Isolation

Every PromptGuard operation is scoped by company ID:

  • Audit trails are per-company. Company A can see their security events. They cannot see Company B's. Every logged detection includes the company ID, threat level, agent type, conversation ID, and matched pattern.
  • Canary tokens are per-company. Even if an attacker extracts one company's canary, it reveals nothing about other tenants.
  • Configuration is per-company. Sensitivity levels, whitelisted phrases, and alert recipients are all tenant-specific.
  • Analytics are per-company. Block rates, threat distribution, and trend data are filtered by company ID in every query.

Performance: Security Without Latency

The biggest objection to AI security is latency. If your security layer adds 500ms to every response, customers notice. Here's our budget:

LayerLatencyModeBlocks on?
1. Sanitization~2msSynchronousAlways runs
2. Pattern Detection~5msSynchronousAlways runs
3. Heuristic Analysis~1msSynchronousAlways runs
4. ML Detection~150msAsync (optional)Only if Layers 1-3 are inconclusive
5. Output Validation~8msSynchronousAlways runs (on response)

Synchronous overhead: ~16ms. The user doesn't notice. The ML layer runs asynchronously and only when needed — it doesn't block the response path for clearly safe or clearly malicious inputs.

What We Learned

  1. Defense-in-depth is the only strategy. No single layer catches everything. Pattern matching misses creative phrasing. ML misses novel attack structures. Heuristics miss targeted attacks. Together, they cover each other's blind spots.
  2. Output scanning catches what input scanning misses. Some attacks are impossible to detect in the input because they rely on the model's behavior. Canary tokens are the last line of defense — they prove the attack succeeded even if you didn't see it coming.
  3. Multi-tenancy multiplies the attack surface. Single-tenant apps worry about their one system prompt. We worry about hundreds, each with different content, different agents, and different sensitivity requirements. Per-company canary tokens, per-company audit trails, and per-company configuration aren't optional — they're the whole point.
  4. Security must be faster than the thing it protects. An LLM response takes 1-3 seconds. If your security layer adds another second, it's doubling the perceived latency. Keep synchronous checks under 20ms. Push expensive analysis to async.
  5. False positives are worse than false negatives. A blocked legitimate message frustrates a customer. A missed injection gets logged and can be addressed. We default to “balanced” mode (block HIGH and above) because overly aggressive filtering degrades the product more than occasional attacks. Companies that need strict mode can enable it.
  6. The ML layer fails gracefully. If the embedding service is down or the classifier times out, the request proceeds with pattern-based protection only. A security system that breaks the product when it fails is worse than no security at all.

Security Is Infrastructure, Not a Feature

PromptGuard isn't a checkbox or a compliance requirement. It's the reason we can run 14 AI agents across hundreds of businesses and let them interact with untrusted user input without worrying about prompt injection leaking Company A's data into Company B's response.

The pattern is applicable to any multi-tenant AI system: sanitize first, pattern-match second, analyze behavior third, classify semantically fourth, validate output fifth. The specifics change. The layered approach doesn't.

PromptGuard runs before SmartRouter (model selection) and CognitiveLimiter (cost control). Together with persistent memory, they form the AI infrastructure layer that makes autonomous agents safe to deploy at scale.

PromptGuard: 5-Layer Defense Against Prompt Injection in Production AI | Solid# Research | SolidNumber