Implementing AI Response Semantic Cache in Mobile App
Regular cache works by exact key match. "How to add transaction?" and "How do I add a new transaction?" — different strings, different requests, two API calls. Semantic cache works by meaning: both questions get same cached answer because embeddings are close in vector space.
Semantic Cache Architecture
Flow: user request → generate embedding → search nearest in vector store → if cosine similarity > threshold, return cached response → else call LLM → save embedding + response to cache.
Use Redis + RediSearch for small volumes (vector similarity built-in). pgvector if PostgreSQL in stack. Managed services Pinecone / Weaviate for millions of records.
Threshold critical parameter. At 0.85, cache too aggressive: different-meaning questions get one answer. At 0.97 — barely works. Optimal range for most domains: 0.90–0.95, tuned on real queries.
Invalidation and TTL
Invalidate semantic cache on system prompt or base model updates — old answers may not match new behavior. Minimum TTL — 7–30 days for stable FAQ-like questions. For time-bound questions ("what's my balance?") — inapplicable. Identify via classifier or keywords.
Timeline Estimates
Basic semantic cache on Redis + OpenAI Embeddings — 2–3 days. With threshold tuning on real data and hit rate monitoring — 3–5 days.







