Embedding Model Selection and Setup for RAG

We design and deploy artificial intelligence systems: from prototype to production-ready solutions. Our team combines expertise in machine learning, data engineering and MLOps to make AI work not in the lab, but in real business.
Showing 1 of 1 servicesAll 1566 services
Embedding Model Selection and Setup for RAG
Medium
from 1 business day to 3 business days
FAQ
AI Development Areas
AI Solution Development Stages
Latest works
  • image_web-applications_feedme_466_0.webp
    Development of a web application for FEEDME
    1161
  • image_ecommerce_furnoro_435_0.webp
    Development of an online store for the company FURNORO
    1041
  • image_logo-advance_0.png
    B2B Advance company logo design
    561
  • image_crm_enviok_479_0.webp
    Development of a web application for Enviok
    823
  • image_logo-aider_0.jpg
    AIDER company logo development
    762
  • image_crm_chasseurs_493_0.webp
    CRM development for Chasseurs
    848

Embedding Model Selection and Setup for RAG

An embedding model is one of the most critical components of a RAG system. The quality of retrieval directly depends on how well the model represents texts in vector space. Switching the embedding model can provide greater recall improvement than optimizing chunking or search parameters.

Categories of Embedding Models

Proprietary API Models:

  • text-embedding-3-large (OpenAI, dim=3072): best quality on most MTEB benchmarks
  • text-embedding-3-small (OpenAI, dim=1536): good price-to-quality ratio
  • embed-v3 (Cohere): strong on retrieval tasks, supports input_type parameter

Open Models (self-hosted):

  • BAAI/bge-m3 (dim=1024): multilingual, supports dense+sparse+colbert
  • BAAI/bge-large-en-v1.5 (dim=1024): best for English
  • intfloat/multilingual-e5-large (dim=1024): good for Russian
  • nomic-ai/nomic-embed-text-v1.5 (dim=768): matryoshka (proportional dimensions)

MTEB Benchmark: Model Comparison

On Retrieval tasks (BEIR benchmark, averaged nDCG@10):

Model NDCG@10 (BEIR avg) Dim Max tokens Type
text-embedding-3-large 54.9 3072 8191 API
text-embedding-3-small 51.7 1536 8191 API
cohere embed-v3 55.0 1024 512 API
BAAI/bge-m3 54.0 1024 8192 Open
intfloat/e5-mistral-7b 56.9 4096 32768 Open
nomic-embed-text-v1.5 53.5 768 8192 Open

For Russian-language tasks, the picture is different — we recommend testing on your own domain.

Setting up Cohere Embed v3 with input_type

Cohere embed-v3 requires specifying input_type — this is important for retrieval:

import cohere

co = cohere.Client(api_key="...")

def embed_documents(texts: list[str]) -> list[list[float]]:
    """For document indexing"""
    response = co.embed(
        texts=texts,
        model="embed-multilingual-v3.0",
        input_type="search_document",  # For documents during indexing
    )
    return response.embeddings

def embed_query(query: str) -> list[float]:
    """For search queries"""
    response = co.embed(
        texts=[query],
        model="embed-multilingual-v3.0",
        input_type="search_query",  # Asymmetric model — different types
    )
    return response.embeddings[0]

Using the correct input_type increases recall by 8–15% — the model is trained on asymmetric retrieval.

Self-hosted BGE-M3

BGE-M3 is the most universal open-source model: it supports dense, sparse (SPLADE), and ColBERT-style multi-vector retrieval from a single model:

from FlagEmbedding import BGEM3FlagModel

model = BGEM3FlagModel(
    "BAAI/bge-m3",
    use_fp16=True,   # Memory savings
    device="cuda",
)

# Dense embeddings (for standard ANN search)
dense_embeddings = model.encode(
    texts,
    batch_size=32,
    max_length=8192,
    return_dense=True,
    return_sparse=False,
    return_colbert_vecs=False,
)["dense_vecs"]

# Sparse embeddings (for BM25-like search)
sparse_embeddings = model.encode(
    texts,
    return_dense=False,
    return_sparse=True,
)["lexical_weights"]  # dict {token: weight}

# Hybrid retrieval score
def compute_bge_m3_score(query_dense, doc_dense, query_sparse, doc_sparse,
                          alpha=0.5) -> float:
    dense_score = np.dot(query_dense, doc_dense)
    sparse_score = sum(
        query_sparse.get(token, 0) * doc_sparse.get(token, 0)
        for token in query_sparse
    )
    return alpha * dense_score + (1 - alpha) * sparse_score

Choosing Dimensionality: Matryoshka Embeddings

Nomic Embed and several other models support matryoshka embeddings — you can use the first N dimensions without retraining:

from openai import OpenAI

client = OpenAI()

# text-embedding-3-large with reduced dimensionality
# 3072 → 1536 without reindexing (officially supported by OpenAI)
response = client.embeddings.create(
    model="text-embedding-3-large",
    input=texts,
    dimensions=1536,  # Reduce dimensionality
)

This allows you to reduce vector database RAM requirements by 2× with minimal (2–5%) quality loss.

Practical Embedding Model Selection

If data is confidential / on-premise: BGE-M3 or E5-mistral-7b (self-hosted).

If you need the best Russian language support: test BGE-M3, multilingual-e5-large, and text-embedding-3-large on your domain. There is no universal winner.

If minimal latency is required: text-embedding-3-small (API) or nomic-embed-text-v1.5 (small, self-hosted).

If you need hybrid sparse+dense without two models: BGE-M3 — the only open-source model with native support for both modes.

Evaluation on Your Domain

from ragas import evaluate
from ragas.metrics import context_recall, context_precision

# Testing two models on your dataset
for model_name in ["text-embedding-3-small", "text-embedding-3-large"]:
    retriever = build_retriever(model_name)
    scores = evaluate(test_dataset, metrics=[context_recall, context_precision],
                      retriever=retriever)
    print(f"{model_name}: recall={scores['context_recall']:.3f}, "
          f"precision={scores['context_precision']:.3f}")

Timeline

  • Setting up embedding model and indexing: 2–5 days
  • Comparative testing of 2–3 models: 3–5 days
  • Total: 1–2 weeks