RAG Development with FAISS for Local Vector Search

We design and deploy artificial intelligence systems: from prototype to production-ready solutions. Our team combines expertise in machine learning, data engineering and MLOps to make AI work not in the lab, but in real business.
Showing 1 of 1 servicesAll 1566 services
RAG Development with FAISS for Local Vector Search
Medium
from 1 week to 3 months
FAQ
AI Development Areas
AI Solution Development Stages
Latest works
  • image_web-applications_feedme_466_0.webp
    Development of a web application for FEEDME
    1161
  • image_ecommerce_furnoro_435_0.webp
    Development of an online store for the company FURNORO
    1041
  • image_logo-advance_0.png
    B2B Advance company logo design
    561
  • image_crm_enviok_479_0.webp
    Development of a web application for Enviok
    823
  • image_logo-aider_0.jpg
    AIDER company logo development
    762
  • image_crm_chasseurs_493_0.webp
    CRM development for Chasseurs
    848

Building RAG with FAISS for Local Vector Search

FAISS (Facebook AI Similarity Search) is a vector search library from Meta, not a traditional database. FAISS is a high-performance vector search engine, operating in-memory or on-disk, without network interaction. It's ideal for embedding into applications directly, offline scenarios, and situations where external services are unacceptable.

FAISS Index Types

Index Method Search Speed Accuracy RAM Application
IndexFlatL2 Brute force Slow 100% High Testing, small corpus
IndexFlatIP Brute force (inner product) Slow 100% High Testing (cosine via L2 normalization)
IndexIVFFlat IVF clustering Fast 95–99% Medium 100K–10M vectors
IndexHNSW HNSW graph Fast 98–99% Medium 10K–100M vectors
IndexIVFPQ IVF + Product Quantization Very fast 85–95% Low >10M vectors
IndexIVFSQ8 IVF + Scalar Quantization Fast 90–97% Low Balance

Creating an Index and Indexing

import faiss
import numpy as np
import pickle
from openai import OpenAI

openai_client = OpenAI()

def build_faiss_index(texts: list[str], dimension: int = 1536) -> tuple:
    """Creates a FAISS index and corresponding text list"""

    # Get embeddings in batches
    embeddings = []
    batch_size = 100
    for i in range(0, len(texts), batch_size):
        batch = texts[i:i + batch_size]
        response = openai_client.embeddings.create(
            model="text-embedding-3-small",
            input=batch,
        )
        batch_embeddings = [e.embedding for e in response.data]
        embeddings.extend(batch_embeddings)

    # Convert to numpy float32
    vectors = np.array(embeddings, dtype=np.float32)

    # Normalize for cosine similarity (via inner product)
    faiss.normalize_L2(vectors)

    # Create HNSW index
    index = faiss.IndexHNSWFlat(dimension, 16)  # M=16
    index.hnsw.efConstruction = 200
    index.add(vectors)

    return index, texts

# Saving to disk
def save_index(index, texts, path_prefix: str):
    faiss.write_index(index, f"{path_prefix}.index")
    with open(f"{path_prefix}_texts.pkl", "wb") as f:
        pickle.dump(texts, f)

# Loading
def load_index(path_prefix: str) -> tuple:
    index = faiss.read_index(f"{path_prefix}.index")
    with open(f"{path_prefix}_texts.pkl", "rb") as f:
        texts = pickle.load(f)
    return index, texts

Search and RAG Answer

def faiss_rag_answer(
    question: str,
    index: faiss.Index,
    texts: list[str],
    top_k: int = 5
) -> str:
    # Embed the question
    query_embedding = openai_client.embeddings.create(
        model="text-embedding-3-small",
        input=question,
    ).data[0].embedding

    query_vector = np.array([query_embedding], dtype=np.float32)
    faiss.normalize_L2(query_vector)

    # Search
    distances, indices = index.search(query_vector, top_k)

    # Extract texts
    context_texts = [texts[i] for i in indices[0] if i >= 0]
    context = "\n\n---\n\n".join(context_texts)

    # Generate answer
    response = openai_client.chat.completions.create(
        model="gpt-4o-mini",
        messages=[
            {"role": "system", "content": "Answer strictly based on the provided context."},
            {"role": "user", "content": f"Context:\n{context}\n\nQuestion: {question}"}
        ],
        temperature=0,
    )
    return response.choices[0].message.content

GPU Acceleration of FAISS

FAISS supports moving an index to GPU for 10–100× search acceleration:

# Move to GPU
res = faiss.StandardGpuResources()
gpu_index = faiss.index_cpu_to_gpu(res, 0, index)  # GPU 0

# Search on GPU
distances, indices = gpu_index.search(query_vectors, top_k)

When to Choose FAISS

FAISS is preferred for:

  • Offline processing (no network access)
  • Embedding into existing Python applications
  • Very high throughput on batch search (>10K QPS)
  • Research tasks and experimentation

FAISS is undesirable for:

  • Real-time updates needed (FAISS poorly supports partial updates)
  • Multiple services accessing one index (no network API)
  • Metadata filtering needed (no built-in payload filtering)

Timelines

  • Developing FAISS RAG pipeline: 3–7 days
  • Index optimization and testing: 2–4 days
  • Total: 1–2 weeks