AI Chatbot Implementation in Mobile Applications
Integrating GPT-4o or Claude into mobile chat isn't "plug SDK and done". Real complexity starts after the first working request: conversation context management, displaying streaming generation without UI jank, handling network during poor signal, storing chat history between sessions without data leaks.
Conversation Context Management
All LLMs are stateless. Each request to OpenAI, Anthropic, GigaChat, or YandexGPT sends full conversation history. This means: storage and context truncation is your job. With naive implementation, after 20 messages token cost grows 3–4x, and with 128k context you can wait 30+ seconds for response.
Practical solution — sliding window with summarization:
class ConversationManager {
private var messages: [ChatMessage] = []
private let maxMessages = 20
private let summaryThreshold = 15
func addMessage(_ message: ChatMessage) {
messages.append(message)
if messages.count > summaryThreshold {
Task { await compressSummary() }
}
}
private func compressSummary() async {
// Take messages before threshold, summarize with separate LLM request
let toCompress = Array(messages.prefix(10))
let summary = try? await llmClient.summarize(messages: toCompress)
if let summary {
messages = [ChatMessage(role: .system, content: "Context: \(summary)")] +
Array(messages.suffix(10))
}
}
}
System prompt is separate. It should remain first message always. When compressing context, don't touch it.
Streaming Generation and UI
Users shouldn't wait for full response. Streaming via SSE is standard for all modern LLM APIs. On iOS:
// Update SwiftUI View through @Published
class ChatViewModel: ObservableObject {
@Published var streamingText = ""
func streamResponse(for prompt: String) {
streamingText = ""
Task {
for try await chunk in llmClient.stream(prompt: prompt) {
await MainActor.run {
streamingText += chunk
}
}
}
}
}
On Android with Compose — StateFlow<String>, updated from collectAsState(). Common mistake: calling notifyDataSetChanged() or recreating RecyclerView Adapter on each chunk — causes visible flicker. Update only last message text, not entire list.
Offline Mode and Local Models
For basic scenarios (FAQ bot, data formatting) — consider on-device models. Apple Intelligence API (iOS 18+) gives access to local language model via FoundationModels framework without network. Google ML Kit on Android provides SmartReply and EntityExtraction offline.
For more complex: llama.cpp via Metal/CoreML on iOS or NNAPI on Android — runs Llama 3 8B int4 directly on device. On iPhone 15 Pro generation speed ~15 tokens/sec, acceptable for auxiliary functions.
Chat History Storage
Chat history is personal data. SQLite/Core Data with encryption via SQLCipher or iOS Data Protection. Don't store in UserDefaults — syncs to iCloud unencrypted. On Android — Room with EncryptedSharedPreferences for encryption keys.
Cleanup strategy: auto-delete conversations older than N days, or explicit deletion on user request — GDPR/CCPA requirement.
Common Production Issues
Repeating answers. GPT sometimes loops on pattern. Parameter presence_penalty: 0.6 and frequency_penalty: 0.3 reduce probability. If looped — client-side detect logic: if last 3 bot messages contain > 60% identical n-grams, reset context.
Timeout on poor network. LLM can generate long. URLSession default timeout is 60 seconds, too short for long streamed responses. Set timeoutIntervalForResource: 120 and add progress indicator "thinking..." after 5 seconds waiting for first chunk.
Moderation. OpenAI Moderation API before sending user input — required for consumer apps. One POST /v1/moderations costs less than handling App Store Review complaint.
Implementation Process
Design architecture: choose LLM provider, on-device vs cloud, authorization scheme. Develop backend proxy with rate limiting. Implement ConversationManager with context management. Chat UI with streaming, bubble layout, typing indicator. Encrypted chat history. Test edge cases: network loss during generation, very long responses, parallel requests.
Timeline Guidelines
Simple chat with one LLM provider without history — 5–7 days. Full-featured chatbot with history, context compression, offline mode, and moderation — 3–5 weeks.







