Embedding Model Selection and Setup for RAG
An embedding model is one of the most critical components of a RAG system. The quality of retrieval directly depends on how well the model represents texts in vector space. Switching the embedding model can provide greater recall improvement than optimizing chunking or search parameters.
Categories of Embedding Models
Proprietary API Models:
-
text-embedding-3-large(OpenAI, dim=3072): best quality on most MTEB benchmarks -
text-embedding-3-small(OpenAI, dim=1536): good price-to-quality ratio -
embed-v3(Cohere): strong on retrieval tasks, supports input_type parameter
Open Models (self-hosted):
-
BAAI/bge-m3(dim=1024): multilingual, supports dense+sparse+colbert -
BAAI/bge-large-en-v1.5(dim=1024): best for English -
intfloat/multilingual-e5-large(dim=1024): good for Russian -
nomic-ai/nomic-embed-text-v1.5(dim=768): matryoshka (proportional dimensions)
MTEB Benchmark: Model Comparison
On Retrieval tasks (BEIR benchmark, averaged nDCG@10):
| Model | NDCG@10 (BEIR avg) | Dim | Max tokens | Type |
|---|---|---|---|---|
| text-embedding-3-large | 54.9 | 3072 | 8191 | API |
| text-embedding-3-small | 51.7 | 1536 | 8191 | API |
| cohere embed-v3 | 55.0 | 1024 | 512 | API |
| BAAI/bge-m3 | 54.0 | 1024 | 8192 | Open |
| intfloat/e5-mistral-7b | 56.9 | 4096 | 32768 | Open |
| nomic-embed-text-v1.5 | 53.5 | 768 | 8192 | Open |
For Russian-language tasks, the picture is different — we recommend testing on your own domain.
Setting up Cohere Embed v3 with input_type
Cohere embed-v3 requires specifying input_type — this is important for retrieval:
import cohere
co = cohere.Client(api_key="...")
def embed_documents(texts: list[str]) -> list[list[float]]:
"""For document indexing"""
response = co.embed(
texts=texts,
model="embed-multilingual-v3.0",
input_type="search_document", # For documents during indexing
)
return response.embeddings
def embed_query(query: str) -> list[float]:
"""For search queries"""
response = co.embed(
texts=[query],
model="embed-multilingual-v3.0",
input_type="search_query", # Asymmetric model — different types
)
return response.embeddings[0]
Using the correct input_type increases recall by 8–15% — the model is trained on asymmetric retrieval.
Self-hosted BGE-M3
BGE-M3 is the most universal open-source model: it supports dense, sparse (SPLADE), and ColBERT-style multi-vector retrieval from a single model:
from FlagEmbedding import BGEM3FlagModel
model = BGEM3FlagModel(
"BAAI/bge-m3",
use_fp16=True, # Memory savings
device="cuda",
)
# Dense embeddings (for standard ANN search)
dense_embeddings = model.encode(
texts,
batch_size=32,
max_length=8192,
return_dense=True,
return_sparse=False,
return_colbert_vecs=False,
)["dense_vecs"]
# Sparse embeddings (for BM25-like search)
sparse_embeddings = model.encode(
texts,
return_dense=False,
return_sparse=True,
)["lexical_weights"] # dict {token: weight}
# Hybrid retrieval score
def compute_bge_m3_score(query_dense, doc_dense, query_sparse, doc_sparse,
alpha=0.5) -> float:
dense_score = np.dot(query_dense, doc_dense)
sparse_score = sum(
query_sparse.get(token, 0) * doc_sparse.get(token, 0)
for token in query_sparse
)
return alpha * dense_score + (1 - alpha) * sparse_score
Choosing Dimensionality: Matryoshka Embeddings
Nomic Embed and several other models support matryoshka embeddings — you can use the first N dimensions without retraining:
from openai import OpenAI
client = OpenAI()
# text-embedding-3-large with reduced dimensionality
# 3072 → 1536 without reindexing (officially supported by OpenAI)
response = client.embeddings.create(
model="text-embedding-3-large",
input=texts,
dimensions=1536, # Reduce dimensionality
)
This allows you to reduce vector database RAM requirements by 2× with minimal (2–5%) quality loss.
Practical Embedding Model Selection
If data is confidential / on-premise: BGE-M3 or E5-mistral-7b (self-hosted).
If you need the best Russian language support: test BGE-M3, multilingual-e5-large, and text-embedding-3-large on your domain. There is no universal winner.
If minimal latency is required: text-embedding-3-small (API) or nomic-embed-text-v1.5 (small, self-hosted).
If you need hybrid sparse+dense without two models: BGE-M3 — the only open-source model with native support for both modes.
Evaluation on Your Domain
from ragas import evaluate
from ragas.metrics import context_recall, context_precision
# Testing two models on your dataset
for model_name in ["text-embedding-3-small", "text-embedding-3-large"]:
retriever = build_retriever(model_name)
scores = evaluate(test_dataset, metrics=[context_recall, context_precision],
retriever=retriever)
print(f"{model_name}: recall={scores['context_recall']:.3f}, "
f"precision={scores['context_precision']:.3f}")
Timeline
- Setting up embedding model and indexing: 2–5 days
- Comparative testing of 2–3 models: 3–5 days
- Total: 1–2 weeks







