Implementing Automatic Product Supplier Catalog Mapping
Supplier sends price with 50,000 items. Each has SKU, name and category in supplier system. Need to understand: which are already in site catalog, which need to be created, which are duplicates under different SKU. This is matching task (mapping), solved by combination of deterministic rules and fuzzy search.
Matching Levels
Mapping works layer by layer — from exact to approximate:
- Exact SKU match — most reliable way
- EAN/barcode match — if supplier provides it
- Normalized name match — after cleanup from extra characters
- Fuzzy match (Jaro-Winkler or Levenshtein) — for everything else
- Manual matching — for what wasn't found automatically
class ProductMatcher
{
/** @return MatchResult */
public function match(SupplierProduct $sp): MatchResult
{
// Level 1: exact SKU
if ($p = Product::where('sku', $sp->article)->first()) {
return MatchResult::exact($p->id, 'sku');
}
// Level 2: EAN
if ($sp->ean && $p = Product::where('ean', $sp->ean)->first()) {
return MatchResult::exact($p->id, 'ean');
}
// Level 3: normalized name
$normalized = $this->normalize($sp->name);
if ($p = Product::where('name_normalized', $normalized)->first()) {
return MatchResult::exact($p->id, 'name_normalized');
}
// Level 4: fuzzy
$candidate = $this->fuzzySearch($normalized);
if ($candidate && $candidate->score >= 0.88) {
return MatchResult::fuzzy($candidate->id, $candidate->score);
}
return MatchResult::unmatched();
}
}
Name Normalization
Before comparing, bring strings to uniform format:
private function normalize(string $name): string
{
$name = mb_strtolower($name);
$name = preg_replace('/[\s\-\_\/]+/', ' ', $name); // spaces
$name = preg_replace('/[^\p{L}\p{N}\s]/u', '', $name); // special chars
$name = preg_replace('/\b(art|code|ref|no)\b\.?\s*/iu', '', $name); // service words
return trim($name);
}
Normalized value stored in separate indexed field name_normalized — don't recalculate on each match.
Fuzzy Search via PostgreSQL
For fuzzy search in PostgreSQL connect pg_trgm extension:
CREATE EXTENSION IF NOT EXISTS pg_trgm;
CREATE INDEX products_name_trgm_idx ON products USING gin (name_normalized gin_trgm_ops);
Query for similar:
SELECT id, name_normalized,
similarity(name_normalized, :query) AS score
FROM products
WHERE similarity(name_normalized, :query) > 0.7
ORDER BY score DESC
LIMIT 5;
In PHP via Eloquent:
private function fuzzySearch(string $query): ?object
{
return DB::selectOne(
"SELECT id, similarity(name_normalized, ?) AS score
FROM products
WHERE similarity(name_normalized, ?) > 0.7
ORDER BY score DESC
LIMIT 1",
[$query, $query]
);
}
Storing Mapping
CREATE TABLE supplier_product_mapping (
id serial PRIMARY KEY,
supplier_id int NOT NULL,
supplier_sku varchar(100) NOT NULL,
product_id int REFERENCES products(id),
match_type varchar(20), -- exact_sku | exact_ean | fuzzy | manual | new
match_score float, -- for fuzzy
confirmed boolean DEFAULT false,
confirmed_by int, -- user_id
confirmed_at timestamptz,
created_at timestamptz DEFAULT now(),
UNIQUE (supplier_id, supplier_sku)
);
Confirmed mappings (confirmed = true) used directly. Unconfirmed fuzzy mappings require operator review.
Unmatched Items Processing Workflow
MatchResult::unmatched()
└─> check: similar product exists but score < 0.88?
├─> YES → create mapping (confirmed=false) + notify operator
└─> NO → create draft product or skip
Operator in admin sees list of confirmed=false with suggested candidates and "Accept" / "Reject" / "Find Another" buttons.
Supplier Category Mapping
Supplier uses own categories — need to map to site category tree:
CREATE TABLE supplier_category_mapping (
supplier_id int,
supplier_category varchar(200),
site_category_id int REFERENCES categories(id),
created_at timestamptz DEFAULT now(),
PRIMARY KEY (supplier_id, supplier_category)
);
After initial manual category mapping, new arriving products automatically go to correct site category.
Duplicate Detection
Separate task — find items under different supplier SKUs but same product in reality:
class DuplicateDetector
{
public function findDuplicatePairs(int $supplierId): array
{
// Find items with same EAN from one supplier
return DB::select(
"SELECT a.supplier_sku AS sku_a, b.supplier_sku AS sku_b, a.ean
FROM supplier_products a
JOIN supplier_products b ON a.ean = b.ean AND a.supplier_sku < b.supplier_sku
WHERE a.supplier_id = ? AND b.supplier_id = ?",
[$supplierId, $supplierId]
);
}
}
Performance with Large Catalogs
With 100,000+ items, full fuzzy pass is too slow. Optimizations:
- First batch exact match query (single SQL with
IN (...)) - Fuzzy search only for remaining unmatched
- Cache
sku → product_idmapping in Redis before processing - Partition processing into Queue chunks
// Pre-load known mappings into memory
$knownMappings = SupplierProductMapping::where([
'supplier_id' => $supplierId,
'confirmed' => true,
])->pluck('product_id', 'supplier_sku')->all();
// Later — O(1) lookup by SKU
Implementation Timeline
- Exact match by SKU/EAN, mapping storage, new drafts — 3 days
- Normalization + fuzzy via pg_trgm + confirmation queue — +2 days
- Category mapping, duplicate detector, admin UI for review — +2–3 days







