RAG (Retrieval-Augmented Generation) for AI Bot in Mobile App

NOVASOLUTIONS.TECHNOLOGY is engaged in the development, support and maintenance of iOS, Android, PWA mobile applications. We have extensive experience and expertise in publishing mobile applications in popular markets like Google Play, App Store, Amazon, AppGallery and others.
Development and support of all types of mobile applications:
Information and entertainment mobile applications
News apps, games, reference guides, online catalogs, weather apps, fitness and health apps, travel apps, educational apps, social networks and messengers, quizzes, blogs and podcasts, forums, aggregators
E-commerce mobile applications
Online stores, B2B apps, marketplaces, online exchanges, cashback services, exchanges, dropshipping platforms, loyalty programs, food and goods delivery, payment systems.
Business process management mobile applications
CRM systems, ERP systems, project management, sales team tools, financial management, production management, logistics and delivery management, HR management, data monitoring systems
Electronic services mobile applications
Classified ads platforms, online schools, online cinemas, electronic service platforms, cashback platforms, video hosting, thematic portals, online booking and scheduling platforms, online trading platforms

These are just some of the types of mobile applications we work with, and each of them may have its own specific features and functionality, tailored to the specific needs and goals of the client.

Showing 1 of 1 servicesAll 1735 services
RAG (Retrieval-Augmented Generation) for AI Bot in Mobile App
Complex
~1-2 weeks
FAQ
Our competencies:
Development stages
Latest works
  • image_mobile-applications_feedme_467_0.webp
    Development of a mobile application for FEEDME
    756
  • image_mobile-applications_xoomer_471_0.webp
    Development of a mobile application for XOOMER
    624
  • image_mobile-applications_rhl_428_0.webp
    Development of a mobile application for RHL
    1052
  • image_mobile-applications_zippy_411_0.webp
    Development of a mobile application for ZIPPY
    947
  • image_mobile-applications_affhome_429_0.webp
    Development of a mobile application for Affhome
    862
  • image_mobile-applications_flavors_409_0.webp
    Development of a mobile application for the FLAVORS company
    445

Implementing RAG (Retrieval-Augmented Generation) for AI Bot in a Mobile Application

RAG solves specific problem: model doesn't know your product, your documentation, your internal regulations. Fine-tuning—expensive and slowly updates. RAG—cheaper, more current, more transparent. User asks question → system searches relevant documentation fragments → passes them to model context → model answers based on real data.

RAG System Components and Where They Live

RAG is not one function but pipeline of stages:

Ingestion (loading and indexing):

  1. Split documents into chunks (chunking)
  2. Create embeddings for each chunk
  3. Save in vector DB

Retrieval (search):

  1. Embedding of user query
  2. Vector search (cosine similarity / ANN)
  3. Reranking results (optional)

Generation (generation):

  1. Form prompt with context
  2. Call LLM
  3. Postprocess answer

On mobile, entire Ingestion and most Retrieval—server task. Client makes API request, gets answer with sources.

Chunking: Most Underestimated Stage

RAG quality determined by chunk quality. Bad chunking kills accuracy regardless of model.

Fixed chunking (by 500 characters)—don't. Breaks sentences, loses paragraph context.

Semantic chunking—split by semantic boundaries (headers, paragraphs, sentences). Works by default for Markdown and HTML. Library LangChain4j on Java/Kotlin provides RecursiveCharacterTextSplitter with delimiters ["\n\n", "\n", ". "]—correct approach.

Overlap—10–20% between chunks: last 50–100 tokens of previous chunk included in next's start. Preserves context on boundaries.

Optimal chunk size depends on document type: technical docs—300–500 tokens, legal texts—500–800 tokens, FAQ—one chunk = one Q&A.

Embeddings: Model Choice

Model Dimension Context Cost Best For
text-embedding-3-small 1536 8192 Cheap General content
text-embedding-3-large 3072 8192 Medium Technical docs
nomic-embed-text 768 8192 Free (self-host) Private data
multilingual-e5-large 1024 512 Free (self-host) Multilingual

For mobile app with sensitive data—self-hosted model. OpenAI Embeddings sends docs to OpenAI servers.

Retrieval: What Really Affects Quality

Hybrid search—combining vector search and BM25 (keyword search) gives better results than vector alone. Pgvector + pg_trgm allow doing this in PostgreSQL without separate infrastructure.

Reranking—after vectorsearch take top-20 results, run through cross-encoder model (cross-encoder/ms-marco-MiniLM-L-6-v2), return top-5. Significantly improves relevance. Cohere Rerank API—if don't want self-hosted model.

Metadata filtering—if documents have metadata (date, section, language, document type), filter by them before vector search. Searching vectors among 10 thousand relevant chunks instead million—faster and more accurate.

Prompt Formation with Context

System: You are product assistant. Answer ONLY based on provided context.
If answer not in context—say so directly.

Context:
[Chunk 1]: <text>
[Chunk 2]: <text>
[Chunk 3]: <text>

User: How to set up two-factor authentication?

Indicating sources—good practice. On mobile show list of chunks/documents under answer: user can verify where info comes from. Reduces hallucination risk and increases trust.

Mobile UI for RAG Bot

Answer rendering specifics:

  • Stream via SSE—answer appears gradually
  • Sources under answer (collapsible list)
  • "Searching knowledge base" indicator during Retrieval (100–300 ms)
  • "Didn't find answer" button for escalation to operator

Flutter: flutter_markdown for answer rendering, custom widget for sources. iOS: UILabel with NSAttributedString or UITextView + WKWebView for Markdown. Android: Markwon—best Markdown renderer for RecyclerView.

Stages and Timeline

Audit document corpus → design indexing schema → choose vector DB → implement ingestion pipeline → configure hybrid search + reranking → integrate with LLM → mobile chat UI with sources → evaluate quality (RAGAS or manual) → iterate on prompts and chunking.

Basic RAG bot with simple documentation—3–5 weeks. Production system with hybrid search, reranking, multilinguality, quality evaluation—8–12 weeks.