Embedding Model Selection and Setup for RAG

We design and deploy artificial intelligence systems: from prototype to production-ready solutions. Our team combines expertise in machine learning, data engineering and MLOps to make AI work not in the lab, but in real business.

8+Years of workmore info 900+Completed projectsmore info 100+In house employeesmore info 19+Partnersmore info

Offered services

Showing 1 of 1 servicesAll 1566 services

Medium

from 1 business day to 3 business days

FAQ

AI Development Areas

Discuss your AI project

Free consultation — we'll show you how AI can solve your challenge

Get a quote

We'll estimate the budget and timeline for your AI project

AI Solution Development Stages

Latest works

Development of a web application for FEEDME
1170
Development of an online store for the company FURNORO
1094
B2B Advance company logo design
563
Development of a web application for Enviok
830
AIDER company logo development
763
CRM development for Chasseurs
879

Show more works

Embedding Model Selection and Setup for RAG

An embedding model is one of the most critical components of a RAG system. The quality of retrieval directly depends on how well the model represents texts in vector space. Switching the embedding model can provide greater recall improvement than optimizing chunking or search parameters.

Categories of Embedding Models

Proprietary API Models:

text-embedding-3-large (OpenAI, dim=3072): best quality on most MTEB benchmarks
text-embedding-3-small (OpenAI, dim=1536): good price-to-quality ratio
embed-v3 (Cohere): strong on retrieval tasks, supports input_type parameter

Open Models (self-hosted):

BAAI/bge-m3 (dim=1024): multilingual, supports dense+sparse+colbert
BAAI/bge-large-en-v1.5 (dim=1024): best for English
intfloat/multilingual-e5-large (dim=1024): good for Russian
nomic-ai/nomic-embed-text-v1.5 (dim=768): matryoshka (proportional dimensions)

MTEB Benchmark: Model Comparison

On Retrieval tasks (BEIR benchmark, averaged nDCG@10):

Model	NDCG@10 (BEIR avg)	Dim	Max tokens	Type
text-embedding-3-large	54.9	3072	8191	API
text-embedding-3-small	51.7	1536	8191	API
cohere embed-v3	55.0	1024	512	API
BAAI/bge-m3	54.0	1024	8192	Open
intfloat/e5-mistral-7b	56.9	4096	32768	Open
nomic-embed-text-v1.5	53.5	768	8192	Open

For Russian-language tasks, the picture is different — we recommend testing on your own domain.

Setting up Cohere Embed v3 with input_type

Cohere embed-v3 requires specifying input_type — this is important for retrieval:

import cohere

co = cohere.Client(api_key="...")

def embed_documents(texts: list[str]) -> list[list[float]]:
    """For document indexing"""
    response = co.embed(
        texts=texts,
        model="embed-multilingual-v3.0",
        input_type="search_document",  # For documents during indexing
    )
    return response.embeddings

def embed_query(query: str) -> list[float]:
    """For search queries"""
    response = co.embed(
        texts=[query],
        model="embed-multilingual-v3.0",
        input_type="search_query",  # Asymmetric model — different types
    )
    return response.embeddings[0]

Using the correct input_type increases recall by 8–15% — the model is trained on asymmetric retrieval.

Self-hosted BGE-M3

BGE-M3 is the most universal open-source model: it supports dense, sparse (SPLADE), and ColBERT-style multi-vector retrieval from a single model:

from FlagEmbedding import BGEM3FlagModel

model = BGEM3FlagModel(
    "BAAI/bge-m3",
    use_fp16=True,   # Memory savings
    device="cuda",
)

# Dense embeddings (for standard ANN search)
dense_embeddings = model.encode(
    texts,
    batch_size=32,
    max_length=8192,
    return_dense=True,
    return_sparse=False,
    return_colbert_vecs=False,
)["dense_vecs"]

# Sparse embeddings (for BM25-like search)
sparse_embeddings = model.encode(
    texts,
    return_dense=False,
    return_sparse=True,
)["lexical_weights"]  # dict {token: weight}

# Hybrid retrieval score
def compute_bge_m3_score(query_dense, doc_dense, query_sparse, doc_sparse,
                          alpha=0.5) -> float:
    dense_score = np.dot(query_dense, doc_dense)
    sparse_score = sum(
        query_sparse.get(token, 0) * doc_sparse.get(token, 0)
        for token in query_sparse
    )
    return alpha * dense_score + (1 - alpha) * sparse_score

Choosing Dimensionality: Matryoshka Embeddings

Nomic Embed and several other models support matryoshka embeddings — you can use the first N dimensions without retraining:

from openai import OpenAI

client = OpenAI()

# text-embedding-3-large with reduced dimensionality
# 3072 → 1536 without reindexing (officially supported by OpenAI)
response = client.embeddings.create(
    model="text-embedding-3-large",
    input=texts,
    dimensions=1536,  # Reduce dimensionality
)

This allows you to reduce vector database RAM requirements by 2× with minimal (2–5%) quality loss.

Practical Embedding Model Selection

If data is confidential / on-premise: BGE-M3 or E5-mistral-7b (self-hosted).

If you need the best Russian language support: test BGE-M3, multilingual-e5-large, and text-embedding-3-large on your domain. There is no universal winner.

If minimal latency is required: text-embedding-3-small (API) or nomic-embed-text-v1.5 (small, self-hosted).

If you need hybrid sparse+dense without two models: BGE-M3 — the only open-source model with native support for both modes.

Evaluation on Your Domain

from ragas import evaluate
from ragas.metrics import context_recall, context_precision

# Testing two models on your dataset
for model_name in ["text-embedding-3-small", "text-embedding-3-large"]:
    retriever = build_retriever(model_name)
    scores = evaluate(test_dataset, metrics=[context_recall, context_precision],
                      retriever=retriever)
    print(f"{model_name}: recall={scores['context_recall']:.3f}, "
          f"precision={scores['context_precision']:.3f}")

Timeline

Setting up embedding model and indexing: 2–5 days
Comparative testing of 2–3 models: 3–5 days
Total: 1–2 weeks