Implementing Deduplication of Scraped Data
Scraping multiple sources inevitably leads to duplicates: one product is present on the manufacturer's website, in three distributor catalogs, and on a marketplace. Naive comparison by URL or name works poorly — smarter approaches are needed.
Deduplication Levels
Level 1 — Exact Match. By normalized key: SKU, EAN/GTIN, manufacturer part number. Most reliable approach, works where unique identifier exists.
def normalize_sku(raw_sku: str) -> str:
# remove spaces, hyphens, convert to uppercase
return re.sub(r'[\s\-_/]', '', raw_sku).upper()
Level 2 — Content Hashing. For content (articles, descriptions) — normalize text and compute hash.
def content_hash(text: str) -> str:
normalized = ' '.join(text.lower().split()) # remove extra spaces
return hashlib.sha256(normalized.encode()).hexdigest()
Level 3 — Fuzzy Matching. For products without explicit SKU — compare names using Levenshtein distance or Token Sort/Token Set Ratio algorithms.
from rapidfuzz import fuzz, process
def find_duplicate(new_title: str, existing_titles: list[str], threshold=85):
result = process.extractOne(
new_title,
existing_titles,
scorer=fuzz.token_sort_ratio
)
if result and result[1] >= threshold:
return result[0]
return None
token_sort_ratio sorts words before comparison — works well with word reordering in product names.
Level 4 — Vector Similarity. For semantically meaningful texts — embeddings via sentence-transformers and cosine similarity.
from sentence_transformers import SentenceTransformer
import numpy as np
model = SentenceTransformer('paraphrase-multilingual-MiniLM-L12-v2')
def are_similar(text1: str, text2: str, threshold=0.92) -> bool:
embeddings = model.encode([text1, text2])
cosine_sim = np.dot(embeddings[0], embeddings[1]) / (
np.linalg.norm(embeddings[0]) * np.linalg.norm(embeddings[1])
)
return float(cosine_sim) >= threshold
For large volumes — index in pgvector (PostgreSQL) or Milvus for approximate vector search.
Performance
Pairwise comparison is unacceptable for millions of records. Strategies:
- MinHash + LSH (Locality Sensitive Hashing) — fast candidate finding for duplicates in large text sets
- Blocking — first filter by exact attributes (category, price range), then fuzzy comparison only within blocks
-
PostgreSQL indexes —
pg_trgmfor fuzzy string search withsimilarity()and%operator
-- Install extension
CREATE EXTENSION pg_trgm;
CREATE INDEX ON products USING GIN (title gin_trgm_ops);
-- Find similar titles
SELECT id, title, similarity(title, 'Iphone 15 pro max 256') AS sim
FROM products
WHERE title % 'Iphone 15 pro max 256'
ORDER BY sim DESC
LIMIT 10;
Duplicate Management
Found duplicates are not deleted automatically. The system forms groups of candidates with computed match score. Final decision — either automatic (when score > 95%) or through manual review interface.
Timeline for implementing multi-level deduplication system: 4–7 business days.







