Implementing Product Deduplication When Importing From Multiple Sources
Deduplication is the most complex part of multi-supplier imports. Suppliers describe the same product differently: different SKUs, different names, different barcodes, or none at all. A naive "compare by name" approach produces 30–40% false positives and misses an equal number of real duplicates.
Duplicate Identification Strategies
Deduplication is structured sequentially: hard matches first, then fuzzy.
1. Exact match by GTIN/EAN/UPC
2. Exact match by manufacturer part number (MPN) + brand
3. Normalized name + brand
4. Fuzzy text match
5. Manual linking via interface
Each subsequent level is less reliable and requires verification or a confident threshold.
Data Normalization Before Comparison
Data must be standardized before comparison:
class ProductNormalizer
{
public function normalizeName(string $name): string
{
$name = mb_strtolower($name);
$name = preg_replace('/\s+/', ' ', $name);
$name = trim($name);
// Remove units in parentheses: "Cable (1m)" → "Cable 1m"
$name = preg_replace('/\((\d+\s*[а-яa-z]+)\)/u', '$1', $name);
// Normalize numeric values: "64 GB" → "64gb"
$name = preg_replace('/(\d+)\s*(gb|tb|mb|мб|гб|тб)/ui', '$1$2', $name);
$name = preg_replace('/(\d+)\s*(мгц|ghz|mhz)/ui', '$1$2', $name);
// Stop words for tech products
$stopWords = ['новый', 'оригинал', 'original', 'retail', 'box', 'версия'];
foreach ($stopWords as $word) {
$name = preg_replace('/\b' . preg_quote($word, '/') . '\b/ui', '', $name);
}
return trim(preg_replace('/\s+/', ' ', $name));
}
public function normalizeBarcode(string $barcode): string
{
// Convert to EAN-13: remove leading zeros, pad to 13 characters
$barcode = preg_replace('/\D/', '', $barcode);
$barcode = ltrim($barcode, '0');
return str_pad($barcode, 13, '0', STR_PAD_LEFT);
}
public function normalizeBrand(string $brand): string
{
$map = [
'самсунг' => 'samsung',
'сяоми' => 'xiaomi',
'эппл' => 'apple',
'lg' => 'lg',
'l.g.' => 'lg',
];
$key = mb_strtolower(trim($brand));
return $map[$key] ?? $key;
}
}
Fingerprint Approach
Instead of comparing on the fly, compute a "fingerprint" during import and compare fingerprints:
class ProductFingerprint
{
public function __construct(private ProductNormalizer $normalizer) {}
public function compute(SupplierProductDTO $dto): array
{
$prints = [];
// Fingerprint 1: barcode (most reliable)
if ($dto->barcode) {
$prints['barcode'] = 'bc:' . $this->normalizer->normalizeBarcode($dto->barcode);
}
// Fingerprint 2: SKU + brand
if ($dto->sku && $dto->brand) {
$prints['sku_brand'] = 'sb:' . $this->normalizer->normalizeBrand($dto->brand)
. ':' . mb_strtolower(trim($dto->sku));
}
// Fingerprint 3: normalized name + brand
if ($dto->brand) {
$prints['name_brand'] = 'nb:' . $this->normalizer->normalizeBrand($dto->brand)
. ':' . $this->normalizer->normalizeName($dto->name);
}
return $prints;
}
}
Fingerprints table:
CREATE TABLE product_fingerprints (
id BIGSERIAL PRIMARY KEY,
product_id BIGINT REFERENCES products(id) ON DELETE CASCADE,
type VARCHAR(20) NOT NULL, -- 'barcode', 'sku_brand', 'name_brand'
value VARCHAR(500) NOT NULL,
UNIQUE(type, value)
);
CREATE INDEX idx_fingerprints_value ON product_fingerprints(value);
Deduplication Algorithm During Import
class DeduplicationService
{
public function findOrCreateProduct(SupplierProductDTO $dto): Product
{
$prints = $this->fingerprint->compute($dto);
// Search by fingerprints in order of reliability
foreach (['barcode', 'sku_brand', 'name_brand'] as $type) {
if (!isset($prints[$type])) continue;
$existing = ProductFingerprint::where('type', $type)
->where('value', $prints[$type])
->first();
if ($existing) {
// Add new fingerprints to the found product
$this->mergeFingerprints($existing->product_id, $prints, $type);
return $existing->product;
}
}
// Fuzzy match for not found
if ($candidate = $this->fuzzyMatch($dto)) {
// If similarity > threshold — treat as duplicate, but log for review
$this->logFuzzyMatch($dto, $candidate);
if ($candidate['score'] >= 0.92) {
return $candidate['product'];
}
}
// Create new product
return $this->createNewProduct($dto, $prints);
}
}
Fuzzy Matching
For fuzzy comparison, use Jaro-Winkler algorithm or TF-IDF + cosine similarity. For PHP projects, yiisoft/strings is convenient, or implement your own:
class FuzzyMatcher
{
public function jaroWinkler(string $a, string $b): float
{
// Jaro similarity
$maxDist = (int) floor(max(mb_strlen($a), mb_strlen($b)) / 2) - 1;
$matches = 0;
$aMatched = [];
$bMatched = [];
for ($i = 0; $i < mb_strlen($a); $i++) {
$start = max(0, $i - $maxDist);
$end = min($i + $maxDist + 1, mb_strlen($b));
for ($j = $start; $j < $end; $j++) {
if (!isset($bMatched[$j]) && mb_substr($a, $i, 1) === mb_substr($b, $j, 1)) {
$aMatched[$i] = true;
$bMatched[$j] = true;
$matches++;
break;
}
}
}
if ($matches === 0) return 0.0;
// Winkler prefix bonus
$prefix = 0;
for ($i = 0; $i < min(4, mb_strlen($a), mb_strlen($b)); $i++) {
if (mb_substr($a, $i, 1) === mb_substr($b, $i, 1)) $prefix++;
else break;
}
$jaro = ($matches / mb_strlen($a) + $matches / mb_strlen($b) + 1.0) / 3;
return $jaro + $prefix * 0.1 * (1 - $jaro);
}
}
Manual Review Queue
Products in the "gray zone" (score 0.75–0.92) go to manual moderation:
CREATE TABLE dedup_review_queue (
id BIGSERIAL PRIMARY KEY,
new_dto JSONB NOT NULL,
candidate_id BIGINT REFERENCES products(id),
score FLOAT,
match_type VARCHAR(20), -- 'fuzzy_name', 'fuzzy_sku'
status VARCHAR(20) DEFAULT 'pending', -- 'pending','merged','rejected'
reviewed_by INT REFERENCES users(id),
created_at TIMESTAMP DEFAULT NOW()
);
The moderation interface shows two products side by side with highlighted matches — the operator confirms or rejects the merge with one click.
Quality Metrics
Track these:
| Metric | Standard |
|---|---|
| Precision (share of correct merges) | > 98% for barcode, > 95% for sku_brand |
| Recall (share of found duplicates) | > 85% |
| Share of products in manual queue | < 5% of imports |
| Processing time per product | < 50 ms |
Timeline
- Normalizer + fingerprint scheme + exact match: 2 days
- Fuzzy matching (Jaro-Winkler): 1 day
- Manual moderation queue + interface: 2 days
- Metrics + decision logging: 1 day
Total: 5–6 business days.







