Implementing Automatic Product Matching Across Different Suppliers
Matching is the task of establishing correspondence between items from different suppliers that describe one physical product. Unlike deduplication (eliminating obvious duplicates in one catalog), matching works with initially different nomenclature systems. Supplier A calls a product "Smartphone Samsung S24 256GB", supplier B calls it "SAMSUNG Galaxy S24 (256 GB) Black SM-S921B". It's one product, but without matching, the system creates two cards.
Matching Methods
Methods are used sequentially — from strict to soft:
1. Exact Identifiers
- EAN/GTIN — most reliable, covers 40–60% of electronics
- Manufacturer part number (MPN) + brand — additional 20–30%
- ISBN for books, ASIN for Amazon-compatible catalogs
2. Structured Attributes
- Brand + model + key characteristics (capacity, color, size)
- Works for standardized categories (electronics, clothing)
3. Fuzzy Text
- Jaro-Winkler / Levenshtein on normalized names
- TF-IDF + cosine similarity on descriptions
- Covers non-standardized categories
4. Vector Matching (ML)
- Embeddings via sentence-transformers or OpenAI API
- Effective for products with complex descriptions
Data Schema
-- Matching table
CREATE TABLE product_matches (
id BIGSERIAL PRIMARY KEY,
master_id BIGINT NOT NULL REFERENCES products(id),
supplier_id INT NOT NULL REFERENCES suppliers(id),
supplier_sku VARCHAR(255) NOT NULL,
match_method VARCHAR(30) NOT NULL, -- 'gtin', 'mpn_brand', 'fuzzy', 'ml', 'manual'
confidence FLOAT, -- 0.0–1.0
status VARCHAR(20) DEFAULT 'active', -- 'active', 'rejected', 'pending_review'
created_at TIMESTAMP DEFAULT NOW(),
UNIQUE(supplier_id, supplier_sku)
);
CREATE INDEX idx_matches_master ON product_matches(master_id);
CREATE INDEX idx_matches_confidence ON product_matches(confidence) WHERE status = 'pending_review';
Matching Pipeline
class ProductMatchingPipeline
{
private array $matchers = [];
public function __construct(
private GtinMatcher $gtinMatcher,
private MpnBrandMatcher $mpnBrandMatcher,
private FuzzyMatcher $fuzzyMatcher,
private VectorMatcher $vectorMatcher,
) {
$this->matchers = [
['matcher' => $gtinMatcher, 'threshold' => 1.0, 'auto_accept' => true],
['matcher' => $mpnBrandMatcher, 'threshold' => 1.0, 'auto_accept' => true],
['matcher' => $fuzzyMatcher, 'threshold' => 0.90, 'auto_accept' => true],
['matcher' => $vectorMatcher, 'threshold' => 0.85, 'auto_accept' => false],
];
}
public function match(SupplierProductDTO $dto): MatchResult
{
foreach ($this->matchers as $config) {
$result = $config['matcher']->find($dto);
if (!$result) continue;
if ($result->confidence >= $config['threshold'] && $config['auto_accept']) {
return new MatchResult(
masterProductId: $result->productId,
confidence: $result->confidence,
method: $result->method,
status: 'active',
);
}
if ($result->confidence >= 0.70) {
// Send to manual review queue
return new MatchResult(
masterProductId: $result->productId,
confidence: $result->confidence,
method: $result->method,
status: 'pending_review',
);
}
}
// Not found — create new master product
return new MatchResult(masterProductId: null, confidence: 0, method: 'none', status: 'new');
}
}
GTIN Matcher
class GtinMatcher
{
public function find(SupplierProductDTO $dto): ?MatchCandidate
{
if (!$dto->barcode) return null;
$normalized = $this->normalizeGtin($dto->barcode);
// Search in known fingerprints
$fingerprint = ProductFingerprint::where('type', 'gtin')
->where('value', $normalized)
->first();
if (!$fingerprint) return null;
return new MatchCandidate(
productId: $fingerprint->product_id,
confidence: 1.0,
method: 'gtin',
);
}
private function normalizeGtin(string $raw): string
{
$digits = preg_replace('/\D/', '', $raw);
// EAN-8 → EAN-13
if (strlen($digits) === 8) {
$digits = str_pad($digits, 13, '0', STR_PAD_LEFT);
}
return $digits;
}
}
Vector Matcher via OpenAI Embeddings
class VectorMatcher
{
public function find(SupplierProductDTO $dto): ?MatchCandidate
{
$text = $this->buildText($dto);
// Get embedding for new product
$vector = $this->openai->embeddings()->create([
'model' => 'text-embedding-3-small',
'input' => $text,
])->embeddings[0]->embedding;
// Find nearest neighbor in pgvector
$result = DB::selectOne("
SELECT product_id, 1 - (embedding <=> :vec) AS similarity
FROM product_embeddings
WHERE 1 - (embedding <=> :vec) > 0.80
ORDER BY embedding <=> :vec
LIMIT 1
", ['vec' => '[' . implode(',', $vector) . ']']);
if (!$result) return null;
return new MatchCandidate(
productId: $result->product_id,
confidence: (float) $result->similarity,
method: 'vector',
);
}
private function buildText(SupplierProductDTO $dto): string
{
return implode(' ', array_filter([
$dto->brand,
$dto->name,
$dto->sku,
implode(' ', array_values($dto->attributes)),
]));
}
}
PostgreSQL requires the pgvector extension:
CREATE EXTENSION IF NOT EXISTS vector;
CREATE TABLE product_embeddings (
product_id BIGINT PRIMARY KEY REFERENCES products(id),
embedding vector(1536), -- OpenAI text-embedding-3-small
updated_at TIMESTAMP
);
CREATE INDEX idx_embeddings_cosine ON product_embeddings
USING ivfflat (embedding vector_cosine_ops) WITH (lists = 100);
Manual Review Interface
Products with pending_review status go to moderator queue. Interface shows:
- Left — supplier product (name, SKU, photo)
- Right — catalog candidate with match percentage
- Buttons: Confirm, Reject, Find Another
- Hotkeys for speed (→ accept, ← reject)
An experienced moderator processes 100–150 pairs per hour.
Feedback Loop for Model Improvement
Each moderator decision is a training example:
class MatchFeedbackService
{
public function recordDecision(int $matchId, string $decision, int $userId): void
{
$match = ProductMatch::findOrFail($matchId);
$match->update([
'status' => $decision === 'accept' ? 'active' : 'rejected',
'reviewed_by' => $userId,
]);
// Save for retraining
MatchTrainingExample::create([
'supplier_product_data' => $match->supplierProduct->toArray(),
'master_product_id' => $match->master_id,
'label' => $decision === 'accept' ? 1 : 0,
'confidence_was' => $match->confidence,
]);
// If rejected — create new master product
if ($decision === 'reject') {
$this->createNewMaster($match->supplierProduct);
}
}
}
Performance
With 100,000+ position catalogs, matching can't run by iterating all pairs. Optimizations:
- Blocking: first filter candidates by brand/category, then match only within block
- Batch embeddings: request vectors in batches of 100 from OpenAI
- pgvector IVFFlat index: approximate nearest neighbor in milliseconds
Timeline
- GtinMatcher + MpnBrandMatcher + FuzzyMatcher: 2 days
- VectorMatcher + pgvector: 2 days
- Pipeline + manual review queue + interface: 2–3 days
- Feedback loop + metrics: 1 day
Total: 7–8 business days.







