Document Upload and Indexing for RAG in Mobile Applications
User attaches a PDF from Files.app or gallery, hits Upload, and within seconds can ask questions about the document. Behind those seconds lies a pipeline of upload, parsing, chunking, embedding creation, and writing to vector DB. Each stage has its bottlenecks.
File Upload from Mobile: Technical Details
Android. ActivityResultContracts.GetContent() with "application/pdf" or "*/*" is the right approach for Android 13+. You get Uri type content://. For server upload, you need InputStream:
val uri: Uri = // from ActivityResult
val bytes = contentResolver.openInputStream(uri)?.use { it.readBytes() }
?: throw IOException("Failed to open file")
// Multipart upload via OkHttp
val requestBody = MultipartBody.Builder()
.setType(MultipartBody.FORM)
.addFormDataPart("file", filename,
bytes.toRequestBody("application/octet-stream".toMediaType()))
.build()
For large files (50+ MB) — chunked upload. Don't read entire file into ByteArray at once: on devices with 2 GB RAM this causes OutOfMemoryError. Use InputStream directly with OkHttp RequestBody via custom writeTo.
iOS. UIDocumentPickerViewController with UTType.pdf, UTType.plainText, etc. URL type file://. For upload:
let data = try Data(contentsOf: fileURL)
// For large files — URLSession uploadTask with stream
let request = URLRequest(url: uploadEndpoint)
let (_, response) = try await URLSession.shared.upload(for: request, from: data)
Data(contentsOf:) for files > 20 MB is a bad idea on iOS. Use URLSession.shared.uploadTask(with:fromFile:) directly — it reads file in chunks, not loading into memory.
Upload Progress. URLSession provides uploadProgress (iOS), OkHttp provides RequestBody.writeTo with CountingOutputStream (Android). Progress bar during document upload is mandatory: user must understand what's happening with their 10-MB file.
Server Pipeline: From File to Vectors
After receiving file, backend runs async pipeline. Sync response to client: {"job_id": "abc123", "status": "processing"}. Client polls status or receives push.
# FastAPI + Celery task
@app.post("/api/documents")
async def upload_document(file: UploadFile, user_id: str = Depends(get_user_id)):
# Save file
file_path = save_to_storage(await file.read(), file.filename)
# Start async processing
job = process_document.delay(file_path, user_id, file.content_type)
return {"job_id": job.id, "status": "processing"}
@celery.task
def process_document(file_path: str, user_id: str, content_type: str):
# 1. Parsing
text = extract_text(file_path, content_type)
# 2. Chunking
chunks = split_into_chunks(text, chunk_size=500, overlap=50)
# 3. Embeddings batch
embeddings = create_embeddings_batch(chunks)
# 4. Upsert to vector DB
upsert_to_vector_store(chunks, embeddings, user_id)
# 5. Update status
update_document_status(file_path, "completed")
Document Parsing
| Format | Tool | Notes |
|---|---|---|
| PDF (text) | PyMuPDF (fitz) | Fast, preserves structure |
| PDF (scanned) | Tesseract + pdf2image | Slow, needs OCR |
| DOCX | python-docx | Without images |
| TXT / MD | Native | Trivial |
| HTML | BeautifulSoup | Need tag cleanup |
| XLSX | openpyxl | Tables → text per row |
PyMuPDF is best for PDF: 10x faster than PyPDF2, correctly handles Cyrillic, preserves font info (useful for heading detection).
Showing Indexing Status on Mobile
While document is being processed — show progress. Two options:
Polling. Every 2–3 seconds request /api/documents/{job_id}/status. Simple, works everywhere. Downside — extra requests.
WebSocket / SSE. Client subscribes to events for job_id. Backend sends updates: {"step": "chunking", "progress": 0.3} → {"step": "embedding", "progress": 0.7} → {"step": "completed"}. Better UX, harder to implement on client in background mode.
After indexing completes — user notification and document list update. Documents stored with metadata: filename, size, upload date, chunk count, status.
Managing User Documents
User must be able to delete documents. Deletion means:
- Delete file from storage
- Delete all chunks from vector DB (by
user_id+document_id) - Update record in relational DB
In Pinecone: index.delete(filter={"document_id": "xyz"}, namespace=user_id). In pgvector: DELETE FROM documents WHERE document_id = $1 AND user_id = $2.
Implementation Timeline
File upload with progress on mobile → backend pipeline with task queue → format parsing → chunking and embeddings → vector DB upsert → status API and push notifications → document management → testing on real files.
MVP with PDF + TXT, basic chunking, pgvector — 3–4 weeks. Full pipeline with OCR, multiple formats, async queue, WebSocket status — 6–8 weeks.







