AI Long Text Summarization in Mobile Applications
Long text summarization hits one constraint immediately: model context window. GPT-4o accepts 128K tokens (roughly 100K words). Claude 3 — 200K. Sounds like plenty, but a 200-page legal contract, technical report, book — all can exceed the limit. Even when it fits — long context is expensive and slows response.
Strategies for Different Text Lengths
Direct summarization — works for texts up to 50–80K tokens. Send entire text in one request, ask to summarize. Simple, cheap implementation. Limitation — token cost and latency (model processes large context slower).
Map-Reduce — for texts exceeding context. Split into chunks → summarize each → summarize summaries:
async def map_reduce_summarize(text: str, chunk_size: int = 4000) -> str:
chunks = split_text(text, chunk_size)
# Map: summarize each chunk in parallel
chunk_summaries = await asyncio.gather(*[
summarize_chunk(chunk) for chunk in chunks
])
# Reduce: summarize results
combined = "\n\n".join(chunk_summaries)
if count_tokens(combined) > chunk_size:
return await map_reduce_summarize(combined, chunk_size) # recursion
return await summarize_final(combined)
asyncio.gather — parallel API requests for all chunks simultaneously. For 10 chunks, time almost equals one chunk.
Refine — summarize first chunk, then refine summary with each next chunk. Final summary is enriched sequentially. Higher quality than Map-Reduce for coherent narratives, but slower — requests are sequential.
Managing Prompt Size and Token Count
Major mistake — not counting tokens before sending. tiktoken (Python) or gpt-tokenizer (JS) give exact count:
import tiktoken
enc = tiktoken.encoding_for_model("gpt-4o")
token_count = len(enc.encode(text))
if token_count < 100_000:
return await direct_summarize(text)
elif token_count < 500_000:
return await map_reduce_summarize(text, chunk_size=8000)
else:
return await map_reduce_summarize(text, chunk_size=4000)
Different summary types — different prompts:
- Executive summary (for management): 3–5 sentences, only key decisions and numbers
- Detailed narrative: structured list with subheadings
- Key points list: bullets without flowing text
- Question answer: "what is this document and what should be done"
On mobile — offer user choice of summary type before starting.
Summarization Progress on Mobile
Summarizing 100-page document takes 15–60 seconds. Without progress indicator — bad UX. Backend sends events via SSE:
event: progress
data: {"step": "chunking", "total_chunks": 12, "completed": 0}
event: progress
data: {"step": "summarizing", "total_chunks": 12, "completed": 4}
event: result
data: {"summary": "...", "word_count": 450}
On mobile client — progress bar with step description, animated text like "Processing pages 1–25...".
Streaming final summary also important. User sees text appearing gradually, not waiting several seconds for complete response.
Long Document Specifics
Lost in the Middle. Research shows LLMs process information from middle of long context worse than beginning and end. With Map-Reduce not a problem — each chunk in its own context. With direct summarization — important to know the limitation.
Duplication in summary. With Map-Reduce, final summarization may repeat similar points from different chunks. Explicitly state in prompt: "Merge similar points, don't repeat one idea twice".
Structured output. For legal and financial documents, summary in JSON format with fixed fields (parties, obligations, deadlines, key_figures) is more reliable than free text. OpenAI response_format: {"type": "json_object"} or Anthropic structured outputs.
Caching
Summarizing one document costs money. Cache results by content hash + summary type. Redis with 7–30 day TTL — standard approach. If document changes — invalidate cache by document_id.
Implementation Timeline
Determine strategy for different document sizes → server pipeline with Map-Reduce → streaming API with progress → mobile UI with summary type selection → caching → test quality on real documents.
Basic summarization (up to 100K tokens) with mobile UI — 1–2 weeks. Full pipeline with Map-Reduce, streaming, caching, multiple summary types — 3–5 weeks.







