The Problem: AI Amnesia

A business owner tells their AI assistant: “We want to hit 50 customers by Q3. Our biggest blocker is lead conversion from our website.”

Next week, the same owner opens a new conversation: “How should we spend our marketing budget?”

Without persistent memory, the AI gives a generic marketing answer. With persistent memory, it says: “Since you're targeting 50 customers by Q3 and lead conversion is the bottleneck, I'd prioritize the website first — maybe a landing page A/B test — before spending on ads that drive traffic to a page that isn't converting.”

That's the difference between a tool and an assistant. At Solid#, AI agents serve businesses across 54 industries. Every one of them needs to remember what happened last time. And the model that remembers might not be the model that had the original conversation.

Architecture: Extract, Store, Inject

The core insight is simple: separate the memory from the model. The AI model that had the conversation doesn't store the memory. A different, cheaper model extracts the facts. The facts live in a cache. And whatever model handles the next conversation gets those facts injected into its context.

Three phases, three different concerns:

Memory Data Flow

1. Extract

After each conversation, Haiku extracts 0-3 facts worth remembering. Async — user never waits. Cost: ~$0.0004.

2. Store

Facts go to Redis (hot, <1ms reads) and PostgreSQL (cold, durable). Automatic backfill on cache miss.

3. Inject

Next conversation: facts injected into system prompt BEFORE all other context. Zero inference cost. Any model.

The key constraint: the user never waits for memory. Extraction happens after the response is sent. Retrieval is a Redis hash lookup, not an LLM call. The memory system is invisible in terms of latency.

Why Haiku Extracts the Facts

We use Claude Haiku for extraction — the smallest, fastest, cheapest model in our stack. This might seem counterintuitive. Shouldn't the “smartest” model decide what's worth remembering?

No, and here's why:

Fact extraction is a structured output task. “Read this conversation and return 0-3 JSON objects with topic and fact.” Haiku excels at this. You don't need Opus to extract that “customer wants 50 clients by Q3.”
Cost makes it viable at scale. Extraction costs ~$0.0004 per conversation. At 30 conversations per company per month across 1,000 companies, total extraction cost is $12/month. That's less than 0.02% of the cheapest subscription tier.
Speed lets it be async. Haiku responds in under 500ms. The extraction task fires on a background worker after the conversation is committed to the database. The user sees their response immediately — the memory extraction happens silently behind them.
Fallback chain. If Anthropic is down, the system falls back to GPT-4o Mini for extraction. If both are down, the conversation just doesn't get extracted. No memory is better than a failed chat.

Two-Layer Storage: Redis + PostgreSQL

Memories live in two places simultaneously, each optimized for a different access pattern:

Aspect	Redis (Hot Path)	PostgreSQL (Cold Path)
Read latency	<1ms	~50ms
Data structure	Hash: topic → fact	Row per memory with metadata
TTL	30 days (auto-expire)	Permanent
Survives restart	No (volatile)	Yes (durable)
Metadata	Just the fact text	Confidence, reinforcement count, timestamps, decay
When used	Every conversation (primary)	Cold start + Redis miss (fallback)

On cache miss (Redis restart, new deployment, expired TTL), the system queries PostgreSQL, returns the top memories by recency and reinforcement, and backfills Redis. The next request is back to sub-millisecond. This happens transparently — no code change, no configuration, no manual intervention.

If Redis is completely unavailable, conversations continue without memory injection. Memory is a quality enhancement, not a dependency. The chat still works; it just doesn't remember.

Memory Injection: Layer 0.5

This is the part that makes it feel like the AI “just knows.” At the start of every conversation, before the AI model sees the knowledge base, brand voice, or custom instructions — it sees the persistent memory.

We call this Layer 0.5 because it comes before everything else in the system prompt. The model reads it first and it colors every response.

The injection isn't a raw dump. It's structured:

Topic-match boosting. Memories whose topics match keywords in the user's current message get injected first. If the user asks about “marketing budget,” the memory about their Q3 customer target rises to the top.
Open items section. Commitments the user made (“I'll upload the catalog by Friday”) get their own section with an instruction: “Proactively follow up on these.” This makes the AI ask “How's that catalog upload going?” naturally.
Tier-aware capping. A Starter plan injects up to 20 memories. Enterprise injects up to 200. The cap prevents prompt bloat while creating a genuine feature differentiation between tiers.

The critical insight: zero inference cost on retrieval. There's no embedding lookup, no semantic search, no LLM call to decide which memories are relevant. It's a Redis hash read + keyword matching. The “AI” part of memory is entirely in the extraction phase. Retrieval is pure engineering.

Confidence Scoring and Reinforcement

Not all memories are equal. A fact mentioned once in passing is less important than a goal repeated across five conversations. The system tracks this:

Initial confidence: 0.8 (new fact, probably relevant)
Each reinforcement: +0.1 (capped at 1.0). When Haiku extracts the same fact again from a later conversation, the existing memory's confidence increases.
Reinforcement count. A memory mentioned in 7 conversations has a count of 7. This is used for cap enforcement — when the tier limit is reached, least-reinforced memories get evicted first.
Pinned memories. When a user says “always remember this” or “don't forget” — the memory gets locked at confidence 1.0 with zero decay. Pinned memories are never evicted by the cap. The user has explicit control over what the AI considers permanent.

Time-Based Decay

Memories that aren't reinforced gradually lose confidence. The decay formula is simple:

# Applied periodically

confidence = max(min_confidence, confidence - (decay_rate * days_since_reinforced))

# Default: 0.005 per day = ~5% loss per 100 days

# Pinned: decay_rate = 0.0 (never decays)

# Floor: min_confidence = 0.3 (never fully forgotten)

This means a fact that was important 6 months ago but never mentioned again gradually fades to 30% confidence. It won't be evicted entirely (the floor prevents that), but it'll lose priority to more recently reinforced memories. If the user mentions it again, confidence jumps back up.

This mirrors how human memory works — frequently referenced information stays sharp, while rarely-used facts become hazy but recoverable.

Shared vs. Agent-Specific Memory

When a business owner tells ADA “we have 5 employees and we're targeting the residential market” — should only ADA know this, or should Sarah (customer service), Marcus (growth), and Devon (operations) all know it too?

The extraction model makes this call. Facts about the company itself (“we have 5 employees”) are flagged as shared and stored in a company-wide memory space. Facts specific to an agent interaction (“I prefer Sarah to send follow-ups by email”) stay agent-specific.

Memory Type	Example	Who Sees It
Shared	“Company has 5 employees, targets residential market”	All AI agents
Shared	“Got first online sale on March 15” (win)	All AI agents
Agent-specific	“Prefers email over SMS for follow-ups”	Sarah only
Agent-specific	“Wants weekly revenue reports on Mondays”	Taylor (analytics) only

At injection time, each agent reads both their own memories and the company-wide shared memories. If the same topic appears in both, the agent-specific version wins — it's more targeted. This creates a layered memory: company-wide context that every agent shares, plus specialized knowledge per agent.

Memory vs. Knowledge Base: Two Systems, One Context

This is the hardest problem in the system, and it's where most AI memory implementations fall apart: what happens when persistent memory and the knowledge base disagree?

Solid# has two separate knowledge systems:

Aspect	Persistent Memory	Knowledge Base (KB)
Source	Extracted from conversations by Haiku	Uploaded by user or generated from templates
Update frequency	Every conversation	Manual or scheduled
Injection layer	Layer 0.5 (earliest)	Layer 3.5 (later)
Decay	Yes (confidence fades over time)	No (permanent until changed)
Scale	20-200 entries (tier-capped)	Thousands of entries

The resolution is architectural: memory is injected first, KB is injected later. Memory provides the recent, conversational context (“this customer cares about lead conversion”). KB provides the comprehensive reference (“here are our 15 services with pricing”). When the same topic appears in both, the model sees the memory version first (more recent, more personalized) and the KB version second (more comprehensive, more authoritative). The model naturally synthesizes both.

Memories can also reference their source KB entry via a foreign key — creating a trace from “I learned this from a conversation” back to “and here's the official KB entry on this topic.” This is important for audit trails and for future consolidation where frequently-mentioned topics in memory could suggest KB gaps.

Eight Memory Categories

Every extracted fact is tagged with a category that determines how it's used:

🎯

Goals

Business targets and timelines

"Wants 50 customers by Q3"

🚧

Blockers

Pain points and obstacles

"Struggling with lead conversion from website"

⚙️

Preferences

How they like things done

"Always send invoices on Fridays"

🏢

Operations

Business details and processes

"Runs plumbing company with 5 employees"

👥

Team

People and roles

"Jake handles scheduling, Maria does estimates"

📋

Strategy

Decisions and direction

"Shifting from residential to commercial"

🏆

Wins

Milestones and achievements

"Got first online sale on March 15"

📌

Open Items

Commitments and follow-ups

"Will upload product catalog by Friday"

Open Items get special treatment. When the AI sees an open item in memory, it doesn't just know about it passively — it's instructed to proactively follow up. This is what makes the AI feel like it's paying attention: “Hey, how's that catalog upload going? You mentioned last week you'd have it ready by Friday.”

Multi-Tenant Memory Isolation

Every memory operation is scoped by company ID. Company A's memories are invisible to Company B — at every layer:

Redis keys are namespaced: xmem:{company_id}:{agent}. No wildcard queries. No cross-company reads.
PostgreSQL queries always filter by company_id. The foreign key to the companies table enforces referential integrity.
Extraction tasks receive company_id from authenticated controllers. Never from user input.
Cap enforcement is per-company, per-agent. One company hitting their cap doesn't affect another.

The Cost Math: Why This Works

Per-Company Monthly Cost

Extraction: ~30 conversations x $0.0004$0.012

Redis storage: ~3KB per company$0.000

PostgreSQL storage: ~50 rows$0.000

Memory retrieval (Redis reads): ~30 reads$0.000

Total per company per month~$0.01

As % of cheapest tier ($199/month)0.01%

This is why we use Haiku for extraction and Redis for reads. The entire memory system costs a penny per company per month. Memory caps exist for prompt quality (don't bloat the context window) and tier differentiation (Enterprise gets deeper memory) — not for cost control.

What We Learned

Separate extraction from the conversation model. The model that had the conversation is expensive and already done. Use a cheap model to extract facts asynchronously. The user never knows or waits.
Zero-cost retrieval changes everything. If memory retrieval requires an LLM call (embedding search, semantic ranking), you'll think twice about using it on every conversation. Make retrieval a cache lookup and you can use it everywhere, unconditionally.
Injection order matters more than injection content. Placing memory at Layer 0.5 (before KB, before brand voice) means the model reads it first. This primes every response with personal context without any explicit instruction to “use the memories.”
Open items create the “wow” moment. When the AI asks “how'd that catalog upload go?” — business owners stop and say “wait, you remembered?” This is the single most impactful memory category. It's not the goals or preferences. It's the follow-ups.
Confidence scoring prevents memory bloat naturally. Without reinforcement, old memories fade. Frequently-discussed topics stay sharp. You don't need manual cleanup or a “memory management UI” — the system does what human memory does.
Shared memory unifies the agent team. When Sarah (customer service) knows the company just hit a milestone because Marcus (growth) had that conversation last week, it feels like the agents are a real team, not isolated tools.

Memory Is What Makes Agents Feel Real

The technical pattern is straightforward: cheap extraction, fast caching, structured injection. There's no neural memory network, no vector database, no retrieval-augmented generation. It's a Haiku call, a Redis hash, and a well-placed system prompt section.

But the user experience is transformative. An AI that remembers your goals, follows up on your commitments, and shares context across agents doesn't feel like software anymore. It feels like a team.

That's what separates AI infrastructure from AI features. The feature is memory. The infrastructure is a two-layer cache with confidence scoring, multi-tenant isolation, tier-aware capping, and cross-model compatibility — running at a penny per company per month.

Cross-Model Persistent Memory: How Haiku Remembers What GPT-5 Said Yesterday