Product Deduplication on Import from Multiple Sources

Our company is engaged in the development, support and maintenance of sites of any complexity. From simple one-page sites to large-scale cluster systems built on micro services. Experience of developers is confirmed by certificates from vendors.
Development and maintenance of all types of websites:
Informational websites or web applications
Business card websites, landing pages, corporate websites, online catalogs, quizzes, promo websites, blogs, news resources, informational portals, forums, aggregators
E-commerce websites or web applications
Online stores, B2B portals, marketplaces, online exchanges, cashback websites, exchanges, dropshipping platforms, product parsers
Business process management web applications
CRM systems, ERP systems, corporate portals, production management systems, information parsers
Electronic service websites or web applications
Classified ads platforms, online schools, online cinemas, website builders, portals for electronic services, video hosting platforms, thematic portals

These are just some of the technical types of websites we work with, and each of them can have its own specific features and functionality, as well as be customized to meet the specific needs and goals of the client.

Showing 1 of 1 servicesAll 2065 services
Product Deduplication on Import from Multiple Sources
Complex
~5 business days
FAQ
Our competencies:
Development stages
Latest works
  • image_web-applications_feedme_466_0.webp
    Development of a web application for FEEDME
    1161
  • image_ecommerce_furnoro_435_0.webp
    Development of an online store for the company FURNORO
    1041
  • image_crm_enviok_479_0.webp
    Development of a web application for Enviok
    822
  • image_crm_chasseurs_493_0.webp
    CRM development for Chasseurs
    847
  • image_website-sbh_0.png
    Website development for SBH Partners
    999
  • image_website-_0.png
    Website development for Red Pear
    451

Implementing Product Deduplication When Importing From Multiple Sources

Deduplication is the most complex part of multi-supplier imports. Suppliers describe the same product differently: different SKUs, different names, different barcodes, or none at all. A naive "compare by name" approach produces 30–40% false positives and misses an equal number of real duplicates.

Duplicate Identification Strategies

Deduplication is structured sequentially: hard matches first, then fuzzy.

1. Exact match by GTIN/EAN/UPC
2. Exact match by manufacturer part number (MPN) + brand
3. Normalized name + brand
4. Fuzzy text match
5. Manual linking via interface

Each subsequent level is less reliable and requires verification or a confident threshold.

Data Normalization Before Comparison

Data must be standardized before comparison:

class ProductNormalizer
{
    public function normalizeName(string $name): string
    {
        $name = mb_strtolower($name);
        $name = preg_replace('/\s+/', ' ', $name);
        $name = trim($name);

        // Remove units in parentheses: "Cable (1m)" → "Cable 1m"
        $name = preg_replace('/\((\d+\s*[а-яa-z]+)\)/u', '$1', $name);

        // Normalize numeric values: "64 GB" → "64gb"
        $name = preg_replace('/(\d+)\s*(gb|tb|mb|мб|гб|тб)/ui', '$1$2', $name);
        $name = preg_replace('/(\d+)\s*(мгц|ghz|mhz)/ui', '$1$2', $name);

        // Stop words for tech products
        $stopWords = ['новый', 'оригинал', 'original', 'retail', 'box', 'версия'];
        foreach ($stopWords as $word) {
            $name = preg_replace('/\b' . preg_quote($word, '/') . '\b/ui', '', $name);
        }

        return trim(preg_replace('/\s+/', ' ', $name));
    }

    public function normalizeBarcode(string $barcode): string
    {
        // Convert to EAN-13: remove leading zeros, pad to 13 characters
        $barcode = preg_replace('/\D/', '', $barcode);
        $barcode = ltrim($barcode, '0');
        return str_pad($barcode, 13, '0', STR_PAD_LEFT);
    }

    public function normalizeBrand(string $brand): string
    {
        $map = [
            'самсунг' => 'samsung',
            'сяоми'   => 'xiaomi',
            'эппл'    => 'apple',
            'lg'      => 'lg',
            'l.g.'    => 'lg',
        ];
        $key = mb_strtolower(trim($brand));
        return $map[$key] ?? $key;
    }
}

Fingerprint Approach

Instead of comparing on the fly, compute a "fingerprint" during import and compare fingerprints:

class ProductFingerprint
{
    public function __construct(private ProductNormalizer $normalizer) {}

    public function compute(SupplierProductDTO $dto): array
    {
        $prints = [];

        // Fingerprint 1: barcode (most reliable)
        if ($dto->barcode) {
            $prints['barcode'] = 'bc:' . $this->normalizer->normalizeBarcode($dto->barcode);
        }

        // Fingerprint 2: SKU + brand
        if ($dto->sku && $dto->brand) {
            $prints['sku_brand'] = 'sb:' . $this->normalizer->normalizeBrand($dto->brand)
                . ':' . mb_strtolower(trim($dto->sku));
        }

        // Fingerprint 3: normalized name + brand
        if ($dto->brand) {
            $prints['name_brand'] = 'nb:' . $this->normalizer->normalizeBrand($dto->brand)
                . ':' . $this->normalizer->normalizeName($dto->name);
        }

        return $prints;
    }
}

Fingerprints table:

CREATE TABLE product_fingerprints (
    id          BIGSERIAL PRIMARY KEY,
    product_id  BIGINT REFERENCES products(id) ON DELETE CASCADE,
    type        VARCHAR(20) NOT NULL,  -- 'barcode', 'sku_brand', 'name_brand'
    value       VARCHAR(500) NOT NULL,
    UNIQUE(type, value)
);
CREATE INDEX idx_fingerprints_value ON product_fingerprints(value);

Deduplication Algorithm During Import

class DeduplicationService
{
    public function findOrCreateProduct(SupplierProductDTO $dto): Product
    {
        $prints = $this->fingerprint->compute($dto);

        // Search by fingerprints in order of reliability
        foreach (['barcode', 'sku_brand', 'name_brand'] as $type) {
            if (!isset($prints[$type])) continue;

            $existing = ProductFingerprint::where('type', $type)
                ->where('value', $prints[$type])
                ->first();

            if ($existing) {
                // Add new fingerprints to the found product
                $this->mergeFingerprints($existing->product_id, $prints, $type);
                return $existing->product;
            }
        }

        // Fuzzy match for not found
        if ($candidate = $this->fuzzyMatch($dto)) {
            // If similarity > threshold — treat as duplicate, but log for review
            $this->logFuzzyMatch($dto, $candidate);
            if ($candidate['score'] >= 0.92) {
                return $candidate['product'];
            }
        }

        // Create new product
        return $this->createNewProduct($dto, $prints);
    }
}

Fuzzy Matching

For fuzzy comparison, use Jaro-Winkler algorithm or TF-IDF + cosine similarity. For PHP projects, yiisoft/strings is convenient, or implement your own:

class FuzzyMatcher
{
    public function jaroWinkler(string $a, string $b): float
    {
        // Jaro similarity
        $maxDist = (int) floor(max(mb_strlen($a), mb_strlen($b)) / 2) - 1;
        $matches = 0;
        $aMatched = [];
        $bMatched = [];

        for ($i = 0; $i < mb_strlen($a); $i++) {
            $start = max(0, $i - $maxDist);
            $end   = min($i + $maxDist + 1, mb_strlen($b));

            for ($j = $start; $j < $end; $j++) {
                if (!isset($bMatched[$j]) && mb_substr($a, $i, 1) === mb_substr($b, $j, 1)) {
                    $aMatched[$i] = true;
                    $bMatched[$j] = true;
                    $matches++;
                    break;
                }
            }
        }

        if ($matches === 0) return 0.0;

        // Winkler prefix bonus
        $prefix = 0;
        for ($i = 0; $i < min(4, mb_strlen($a), mb_strlen($b)); $i++) {
            if (mb_substr($a, $i, 1) === mb_substr($b, $i, 1)) $prefix++;
            else break;
        }

        $jaro = ($matches / mb_strlen($a) + $matches / mb_strlen($b) + 1.0) / 3;
        return $jaro + $prefix * 0.1 * (1 - $jaro);
    }
}

Manual Review Queue

Products in the "gray zone" (score 0.75–0.92) go to manual moderation:

CREATE TABLE dedup_review_queue (
    id              BIGSERIAL PRIMARY KEY,
    new_dto         JSONB NOT NULL,
    candidate_id    BIGINT REFERENCES products(id),
    score           FLOAT,
    match_type      VARCHAR(20),  -- 'fuzzy_name', 'fuzzy_sku'
    status          VARCHAR(20) DEFAULT 'pending',  -- 'pending','merged','rejected'
    reviewed_by     INT REFERENCES users(id),
    created_at      TIMESTAMP DEFAULT NOW()
);

The moderation interface shows two products side by side with highlighted matches — the operator confirms or rejects the merge with one click.

Quality Metrics

Track these:

Metric Standard
Precision (share of correct merges) > 98% for barcode, > 95% for sku_brand
Recall (share of found duplicates) > 85%
Share of products in manual queue < 5% of imports
Processing time per product < 50 ms

Timeline

  • Normalizer + fingerprint scheme + exact match: 2 days
  • Fuzzy matching (Jaro-Winkler): 1 day
  • Manual moderation queue + interface: 2 days
  • Metrics + decision logging: 1 day

Total: 5–6 business days.