AI Integration: Chatbots, RAG, Semantic Search, Recommendations
Most "AI chatbots" on websites are just a wrapper around GPT-4o with a system prompt "you are an assistant for company X". Without context of the company's real data. A user asks about a specific pricing plan — the bot hallucinates the price. Asks about order status — gets generic phrases. This is not AI integration, it's an expensive FAQ.
RAG as the Foundation of a Useful Chatbot
Retrieval-Augmented Generation is the standard approach when you need the model to answer based on real company documents. Scheme: user query → search for relevant fragments in vector database → insert fragments into context → model response.
Implementation details that determine quality:
Chunking strategy. You can't just split a document into 500-token chunks. If you cut through the middle of a paragraph, meaning is lost. Recursive text splitter with 10–15% overlap — minimum. For structured documents (contracts, instructions) — semantic splitter by sections.
Embedding model. text-embedding-3-large from OpenAI or intfloat/multilingual-e5-large for Russian-language content. Search quality directly depends on the model — the difference between ada-002 and e5-large on Russian text is noticeable.
Vector database. pgvector for projects where PostgreSQL already exists — install the extension, add a column of type vector(1536), create an HNSW index. For large volumes (10M+ documents) — Qdrant or Weaviate. On a project with a knowledge base of 80,000 support articles pgvector with HNSW index gave p95 search in 12ms — that's sufficient.
Hybrid search. Vector search alone poorly finds exact matches (article numbers, names, abbreviations). Full-text search alone doesn't understand meaning. Combination via RRF (Reciprocal Rank Fusion): vector search + BM25, results are mixed. In Qdrant this is called sparse-dense hybrid search.
Reranking. After initial search (top-20 candidates) we run through cross-encoder model (cross-encoder/ms-marco-MiniLM-L-6-v2) for precise reranking. This adds 50–100ms, but significantly improves relevance.
Semantic Search on a Website
The search "red running shoes" should find products with description "red athletic sneakers for running", even if the words don't match. Regular LIKE search can't do this.
Architecture: when adding a product/article, automatically generate embedding and save in pgvector. When searching — embed the query, search for nearest neighbors by cosine similarity. HNSW index on 100,000 vectors builds in 2–3 minutes, takes ~400MB in memory for 1536-dimensional vectors.
Recommendation Systems
Collaborative filtering ("users similar to you bought X") requires interaction history — at least several months of data. For a new product it works content-based: embedding of currently viewed product → search for similar by vector distance.
Hybrid model: content-based for new users, collaborative for those with history. Switching threshold — usually after 10–20 interactions we switch to collaborative. LightFM can combine both approaches in one model.
For e-commerce with real traffic — A/B testing of recommendations is mandatory. CTR and conversion rate on recommended products — key metrics, not model accuracy.
Response Streaming
User shouldn't wait for the model to generate the entire response. Server-Sent Events for token streaming — standard. OpenAI SDK supports stream: true, returns AsyncIterator. On frontend — Vercel AI SDK or custom EventSource handler.
Typical mistake: streaming to frontend via WebSocket instead of SSE. For one-directional stream SSE is simpler and more reliable.
Agent Orchestration
A simple chatbot answers questions. An agent can perform actions: create a ticket, check order status, book a time. LangChain or LangGraph for orchestrating chains of tool calls. Vercel AI SDK (useChat + tools) for Next.js projects — integration in a few lines.
The main difficulty with agents is reliability. The model sometimes calls the wrong tool or passes incorrect parameters. Validation via Zod schemas on input of each tool, structured outputs for deterministic JSON.
Work Process
We start with a data audit: what exists, in what format, how current. No point building RAG on outdated documentation. Prototype in 1–2 weeks with quality metrics (retrieval precision, hallucination rate via LLM-as-judge). Then iterations on quality — chunking, embedding model, reranking.
Monitoring in production: LangSmith or Langfuse for tracing call chains, request logging for manual quality audit.
Timeline
RAG chatbot with indexing of existing knowledge base: 3–6 weeks. Semantic search on top of existing catalog: 2–4 weeks. Recommendation system with A/B testing: 6–10 weeks. Multi-agent system with tools and integrations: from 8 weeks.







