RAG Development with FAISS for Local Vector Search

We design and deploy artificial intelligence systems: from prototype to production-ready solutions. Our team combines expertise in machine learning, data engineering and MLOps to make AI work not in the lab, but in real business.

8+Years of workmore info 900+Completed projectsmore info 100+In house employeesmore info 19+Partnersmore info

Offered services

Showing 1 of 1 servicesAll 1566 services

Medium

from 1 week to 3 months

FAQ

AI Development Areas

Discuss your AI project

Free consultation — we'll show you how AI can solve your challenge

Get a quote

We'll estimate the budget and timeline for your AI project

AI Solution Development Stages

Latest works

Development of a web application for FEEDME
1170
Development of an online store for the company FURNORO
1094
B2B Advance company logo design
563
Development of a web application for Enviok
830
AIDER company logo development
763
CRM development for Chasseurs
879

Show more works

Building RAG with FAISS for Local Vector Search

FAISS (Facebook AI Similarity Search) is a vector search library from Meta, not a traditional database. FAISS is a high-performance vector search engine, operating in-memory or on-disk, without network interaction. It's ideal for embedding into applications directly, offline scenarios, and situations where external services are unacceptable.

FAISS Index Types

Index	Method	Search Speed	Accuracy	RAM	Application
IndexFlatL2	Brute force	Slow	100%	High	Testing, small corpus
IndexFlatIP	Brute force (inner product)	Slow	100%	High	Testing (cosine via L2 normalization)
IndexIVFFlat	IVF clustering	Fast	95–99%	Medium	100K–10M vectors
IndexHNSW	HNSW graph	Fast	98–99%	Medium	10K–100M vectors
IndexIVFPQ	IVF + Product Quantization	Very fast	85–95%	Low	>10M vectors
IndexIVFSQ8	IVF + Scalar Quantization	Fast	90–97%	Low	Balance

Creating an Index and Indexing

import faiss
import numpy as np
import pickle
from openai import OpenAI

openai_client = OpenAI()

def build_faiss_index(texts: list[str], dimension: int = 1536) -> tuple:
    """Creates a FAISS index and corresponding text list"""

    # Get embeddings in batches
    embeddings = []
    batch_size = 100
    for i in range(0, len(texts), batch_size):
        batch = texts[i:i + batch_size]
        response = openai_client.embeddings.create(
            model="text-embedding-3-small",
            input=batch,
        )
        batch_embeddings = [e.embedding for e in response.data]
        embeddings.extend(batch_embeddings)

    # Convert to numpy float32
    vectors = np.array(embeddings, dtype=np.float32)

    # Normalize for cosine similarity (via inner product)
    faiss.normalize_L2(vectors)

    # Create HNSW index
    index = faiss.IndexHNSWFlat(dimension, 16)  # M=16
    index.hnsw.efConstruction = 200
    index.add(vectors)

    return index, texts

# Saving to disk
def save_index(index, texts, path_prefix: str):
    faiss.write_index(index, f"{path_prefix}.index")
    with open(f"{path_prefix}_texts.pkl", "wb") as f:
        pickle.dump(texts, f)

# Loading
def load_index(path_prefix: str) -> tuple:
    index = faiss.read_index(f"{path_prefix}.index")
    with open(f"{path_prefix}_texts.pkl", "rb") as f:
        texts = pickle.load(f)
    return index, texts

Search and RAG Answer

def faiss_rag_answer(
    question: str,
    index: faiss.Index,
    texts: list[str],
    top_k: int = 5
) -> str:
    # Embed the question
    query_embedding = openai_client.embeddings.create(
        model="text-embedding-3-small",
        input=question,
    ).data[0].embedding

    query_vector = np.array([query_embedding], dtype=np.float32)
    faiss.normalize_L2(query_vector)

    # Search
    distances, indices = index.search(query_vector, top_k)

    # Extract texts
    context_texts = [texts[i] for i in indices[0] if i >= 0]
    context = "\n\n---\n\n".join(context_texts)

    # Generate answer
    response = openai_client.chat.completions.create(
        model="gpt-4o-mini",
        messages=[
            {"role": "system", "content": "Answer strictly based on the provided context."},
            {"role": "user", "content": f"Context:\n{context}\n\nQuestion: {question}"}
        ],
        temperature=0,
    )
    return response.choices[0].message.content

GPU Acceleration of FAISS

FAISS supports moving an index to GPU for 10–100× search acceleration:

# Move to GPU
res = faiss.StandardGpuResources()
gpu_index = faiss.index_cpu_to_gpu(res, 0, index)  # GPU 0

# Search on GPU
distances, indices = gpu_index.search(query_vectors, top_k)

When to Choose FAISS

FAISS is preferred for:

Offline processing (no network access)
Embedding into existing Python applications
Very high throughput on batch search (>10K QPS)
Research tasks and experimentation

FAISS is undesirable for:

Real-time updates needed (FAISS poorly supports partial updates)
Multiple services accessing one index (no network API)
Metadata filtering needed (no built-in payload filtering)

Timelines

Developing FAISS RAG pipeline: 3–7 days
Index optimization and testing: 2–4 days
Total: 1–2 weeks