LLM (ChatGPT/Claude) Integration in Mobile Chatbot
Direct calls to OpenAI API from mobile apps work for prototypes and kill production: a key in APK is compromised within hours. The correct architecture always assumes a proxy server between app and LLM. This isn't over-engineering — it's a requirement.
Architecture: What Must Be on the Server
The backend performs tasks that can't be shifted to the client:
- Storing API keys for OpenAI/Anthropic
- Rate limiting per user — without it, one active user burns the monthly budget
- Dialog history — LLMs are stateless; each request must include prior messages
-
Moderation — OpenAI's
omni-moderation-latestor custom checks before sending to model - Caching identical requests (FAQ, frequently asked questions)
Dialog history is the costliest aspect. Each additional exchange grows context, hence request cost. For a support bot, storing full history is unnecessary: keep the last 10–20 messages plus system prompt.
Streaming on Mobile Client
Users won't wait 5–10 seconds for a full response. You need streaming: the server sends tokens as generated via Server-Sent Events (SSE) or WebSocket; the client displays them in real-time.
OpenAI API supports SSE via stream: true parameter. On the server:
const stream = await openai.chat.completions.create({
model: 'gpt-4o',
messages: conversationHistory,
stream: true,
});
for await (const chunk of stream) {
const delta = chunk.choices[0]?.delta?.content;
if (delta) {
res.write(`data: ${JSON.stringify({ token: delta })}\n\n`);
}
}
res.write('data: [DONE]\n\n');
res.end();
On Android, the client reads SSE via OkHttp EventSource:
val request = Request.Builder()
.url("$baseUrl/chat/stream")
.post(body)
.build()
val listener = object : EventSourceListener() {
override fun onEvent(source: EventSource, id: String?, type: String?, data: String) {
if (data == "[DONE]") return
val token = Json.decodeFromString<TokenEvent>(data).token
viewModel.appendToken(token)
}
}
EventSources.createFactory(okHttpClient).newEventSource(request, listener)
On iOS — use URLSession with dataTaskPublisher or AsyncSequence to read SSE line-by-line.
System Prompt: The Primary Behavior Control Tool
Bot quality is 80% determined by system prompt, not the choice between GPT-4o and Claude. Common mistakes:
Too generic prompt. "You are a helpful store assistant" leaves the model too much latitude. It starts reasoning about unrelated topics and hallucinating non-existent promotions.
No knowledge domain limits. Explicitly write: "Answer only questions about Company X products. If off-topic, politely decline."
No response format specified. For mobile chat apps, long paragraphs are unwieldy — ask the model for brief answers, lists only when needed.
Anthropic Claude via Messages API works similarly, but it doesn't use system in the messages array — it's a separate parameter. Claude better maintains role when facing jailbreak attempts, important for public bots.
Function Calling (Tool Use)
For bots that should take action (create order, check status, find product), you need function calling. The model returns JSON with function name and parameters, not text. The server executes the function and returns results for the model to formulate a response.
tools = [{
"type": "function",
"function": {
"name": "get_order_status",
"description": "Get order status by number",
"parameters": {
"type": "object",
"properties": {
"order_id": {"type": "string", "description": "Order number"}
},
"required": ["order_id"]
}
}
}]
This enables bots that actually perform tasks, not just answer questions.
Model Selection
| Model | Context | Speed | Use Case |
|---|---|---|---|
| GPT-4o | 128K | Medium | Complex scenarios, long documents |
| GPT-4o mini | 128K | Fast | FAQ, simple queries |
| Claude 3.5 Haiku | 200K | Very fast | Bulk chats, streaming |
| Claude 3.5 Sonnet | 200K | Medium | Quality answers, tool use |
For mobile support chatbots, GPT-4o mini or Claude 3.5 Haiku offer the best speed-to-cost ratio.
Development Process
Designing architecture: use cases, tools (functions), history storage.
Developing backend: API proxy, rate limiting, context storage.
System prompt: testing edge cases, staying on topic.
Mobile client: SSE/WebSocket for streaming, "typing..." animations.
Load testing and tuning limits before launch.
Timeline Estimates
Basic LLM chatbot plus mobile client — 3–5 days. With function calling, history, rate limiting, moderation, dialog analytics — 2–4 weeks.







