Implementing RAG (Retrieval-Augmented Generation) for AI Bot in a Mobile Application
RAG solves specific problem: model doesn't know your product, your documentation, your internal regulations. Fine-tuning—expensive and slowly updates. RAG—cheaper, more current, more transparent. User asks question → system searches relevant documentation fragments → passes them to model context → model answers based on real data.
RAG System Components and Where They Live
RAG is not one function but pipeline of stages:
Ingestion (loading and indexing):
- Split documents into chunks (chunking)
- Create embeddings for each chunk
- Save in vector DB
Retrieval (search):
- Embedding of user query
- Vector search (cosine similarity / ANN)
- Reranking results (optional)
Generation (generation):
- Form prompt with context
- Call LLM
- Postprocess answer
On mobile, entire Ingestion and most Retrieval—server task. Client makes API request, gets answer with sources.
Chunking: Most Underestimated Stage
RAG quality determined by chunk quality. Bad chunking kills accuracy regardless of model.
Fixed chunking (by 500 characters)—don't. Breaks sentences, loses paragraph context.
Semantic chunking—split by semantic boundaries (headers, paragraphs, sentences). Works by default for Markdown and HTML. Library LangChain4j on Java/Kotlin provides RecursiveCharacterTextSplitter with delimiters ["\n\n", "\n", ". "]—correct approach.
Overlap—10–20% between chunks: last 50–100 tokens of previous chunk included in next's start. Preserves context on boundaries.
Optimal chunk size depends on document type: technical docs—300–500 tokens, legal texts—500–800 tokens, FAQ—one chunk = one Q&A.
Embeddings: Model Choice
| Model | Dimension | Context | Cost | Best For |
|---|---|---|---|---|
| text-embedding-3-small | 1536 | 8192 | Cheap | General content |
| text-embedding-3-large | 3072 | 8192 | Medium | Technical docs |
| nomic-embed-text | 768 | 8192 | Free (self-host) | Private data |
| multilingual-e5-large | 1024 | 512 | Free (self-host) | Multilingual |
For mobile app with sensitive data—self-hosted model. OpenAI Embeddings sends docs to OpenAI servers.
Retrieval: What Really Affects Quality
Hybrid search—combining vector search and BM25 (keyword search) gives better results than vector alone. Pgvector + pg_trgm allow doing this in PostgreSQL without separate infrastructure.
Reranking—after vectorsearch take top-20 results, run through cross-encoder model (cross-encoder/ms-marco-MiniLM-L-6-v2), return top-5. Significantly improves relevance. Cohere Rerank API—if don't want self-hosted model.
Metadata filtering—if documents have metadata (date, section, language, document type), filter by them before vector search. Searching vectors among 10 thousand relevant chunks instead million—faster and more accurate.
Prompt Formation with Context
System: You are product assistant. Answer ONLY based on provided context.
If answer not in context—say so directly.
Context:
[Chunk 1]: <text>
[Chunk 2]: <text>
[Chunk 3]: <text>
User: How to set up two-factor authentication?
Indicating sources—good practice. On mobile show list of chunks/documents under answer: user can verify where info comes from. Reduces hallucination risk and increases trust.
Mobile UI for RAG Bot
Answer rendering specifics:
- Stream via SSE—answer appears gradually
- Sources under answer (collapsible list)
- "Searching knowledge base" indicator during Retrieval (100–300 ms)
- "Didn't find answer" button for escalation to operator
Flutter: flutter_markdown for answer rendering, custom widget for sources. iOS: UILabel with NSAttributedString or UITextView + WKWebView for Markdown. Android: Markwon—best Markdown renderer for RecyclerView.
Stages and Timeline
Audit document corpus → design indexing schema → choose vector DB → implement ingestion pipeline → configure hybrid search + reranking → integrate with LLM → mobile chat UI with sources → evaluate quality (RAGAS or manual) → iterate on prompts and chunking.
Basic RAG bot with simple documentation—3–5 weeks. Production system with hybrid search, reranking, multilinguality, quality evaluation—8–12 weeks.







