YandexGPT Integration in Mobile Applications
YandexGPT is ideal where Russian-language semantics matter: customer support, form autofill, content generation. OpenAI models excel at English — YandexGPT provides advantage specifically for Russian-speaking audiences, especially with regional queries and domain-specific terminology.
The integration task looks simple at first: POST to https://llm.api.cloud.yandex.net/foundationModels/v1/completion, pass IAM token and text. In practice, it's a chain of non-trivial decisions.
Authorization: IAM Token vs API Key
First failure point is authorization. IAM token lives 12 hours, API key is permanent but less secure. You can't store a service key directly in mobile app: it'll be extracted from APK in 10 minutes via apktool. Correct scheme: mobile client authenticates with your backend, backend holds IAM token and proxies requests to Yandex Cloud.
// iOS: request through own proxy
struct YGPTRequest: Encodable {
let prompt: String
let maxTokens: Int
let temperature: Double
}
func sendToYandexGPT(prompt: String) async throws -> String {
let url = URL(string: "https://api.yourapp.com/ai/complete")!
var request = URLRequest(url: url)
request.httpMethod = "POST"
request.setValue("Bearer \(authToken)", forHTTPHeaderField: "Authorization")
request.httpBody = try JSONEncoder().encode(YGPTRequest(
prompt: prompt,
maxTokens: 500,
temperature: 0.7
))
let (data, _) = try await URLSession.shared.data(for: request)
return try JSONDecoder().decode(CompletionResponse.self, from: data).text
}
On Android — similarly via Retrofit with OkHttp interceptor for token injection.
Streaming Generation (stream: true)
stream: true mode in Yandex Foundation Models API returns response in chunks — like ChatGPT. For mobile UX this matters: users see text as it generates, not waiting 3–5 seconds.
On iOS, handle Server-Sent Events via URLSessionDataDelegate:
class StreamingDelegate: NSObject, URLSessionDataDelegate {
var onChunk: (String) -> Void
var buffer = Data()
func urlSession(_ session: URLSession,
dataTask: URLSessionDataTask,
didReceive data: Data) {
buffer.append(data)
// SSE parsing: look for "data:" lines
guard let text = String(data: buffer, encoding: .utf8) else { return }
let lines = text.components(separatedBy: "\n")
for line in lines where line.hasPrefix("data: ") {
let json = String(line.dropFirst(6))
if let chunk = parseYGPTChunk(json) {
DispatchQueue.main.async { self.onChunk(chunk) }
}
}
}
}
On Android — OkHttp with EventSource (library okhttp-sse) or manual BufferedReader line-by-line parsing.
Models and Parameters
Yandex provides several options: yandexgpt-lite — fast and cheap, yandexgpt — full version, yandexgpt-32k — long contexts. For most mobile scenarios (chat hints, autofill), yandexgpt-lite suffices and is noticeably faster.
| Model | Context | Response Speed | Use Case |
|---|---|---|---|
| yandexgpt-lite | 8k tokens | ~1–2 sec | Suggestions, summaries |
| yandexgpt | 8k tokens | ~3–5 sec | Complex tasks |
| yandexgpt-32k | 32k tokens | ~8–15 sec | Long documents |
Parameter temperature from 0 to 1: 0.2–0.4 — deterministic answers (FAQ bot), 0.7–0.9 — creative text.
Caching and Limits
Yandex Cloud charges by tokens. Cache repeated queries on mobile client — typical pattern for FAQ or onboarding. Simple LRU cache of 100 entries in memory cuts costs on repeated sessions.
Yandex Foundation Models rate limit is 10 RPS per folder by default. Under peak load (many users simultaneously), backend queue needed, not direct calls from devices.
Implementation Process
Audit scenarios: where exactly is LLM needed — support, text generation, request classification. Choose model and mode (synchronous / stream). Develop proxy service on backend with IAM token management. Integrate into mobile app with UI for streaming generation. Test answer quality on real user queries, tune system prompt.
Timeline Guidelines
Basic integration through proxy without streaming — 2–3 days. Full chat interface with streaming generation, caching, and error handling — 5–8 days.







