Implementing Content Audit Before Migration
Content audit is inventorying and analyzing all materials on current site before transferring to new platform. Allows understanding what to transfer, what to update, and what to delete.
Audit Tasks
- Compile complete URL list on site
- Identify outdated and duplicate content
- Determine SEO metrics for each page
- Find broken links and missing metadata
- Prioritize content for transfer
Site Crawling for Inventory
Screaming Frog — standard audit tool:
Configuration:
- Configuration → Spider → Crawl all subdomains
- Enable: JS rendering (for SPA)
- Export: All tabs → Save as CSV
Result: CSV with fields — URL, Title, Meta Description, H1, Status Code, Indexability, Word Count, Inlinks, Outlinks.
Python crawler for automation:
import scrapy
from scrapy.crawler import CrawlerProcess
class ContentAuditSpider(scrapy.Spider):
name = 'content_audit'
start_urls = ['https://company.com']
custom_settings = {
'DEPTH_LIMIT': 10,
'DOWNLOAD_DELAY': 0.5,
'FEEDS': {'audit_results.csv': {'format': 'csv'}}
}
def parse(self, response):
yield {
'url': response.url,
'status': response.status,
'title': response.css('title::text').get(''),
'h1': response.css('h1::text').get(''),
'meta_description': response.css('meta[name="description"]::attr(content)').get(''),
'canonical': response.css('link[rel="canonical"]::attr(href)').get(''),
'robots': response.css('meta[name="robots"]::attr(content)').get('all'),
'word_count': len(' '.join(response.css('main *::text').getall()).split()),
'internal_links': len(response.css('a[href^="/"]')),
'images_without_alt': len(response.css('img:not([alt])')),
'last_modified': response.headers.get('Last-Modified', b'').decode()
}
for link in response.css('a::attr(href)').getall():
yield response.follow(link, self.parse)
Analyzing SEO Data
Export from Google Search Console:
- Performance → Pages: clicks, impressions, CTR, position
- Coverage: indexed / not indexed pages
- URL Inspection: status of specific URLs
Matching crawl data in Python:
import pandas as pd
crawl_data = pd.read_csv('audit_results.csv')
gsc_data = pd.read_csv('gsc_pages.csv') # export from GSC
merged = crawl_data.merge(gsc_data, on='url', how='left')
merged['has_seo_value'] = merged['clicks'] > 0 # pages with traffic
Content Classification
Each page gets a label:
| Decision | Criteria |
|---|---|
| Transfer | Clicks > 0, unique content, relevant |
| Update during transfer | Content outdated but has SEO value |
| Consolidate | Duplicate pages on same topic |
| Delete + redirect | No traffic, duplicate, thin content |
| Don't transfer | Test pages, archive, service URLs |
def classify_page(row):
if row.get('noindex') or row['status'] != 200:
return 'skip'
if row.get('clicks', 0) > 100 or row.get('inlinks', 0) > 5:
return 'migrate_priority_high'
if row.get('word_count', 0) < 100:
return 'review_thin_content'
if row.get('clicks', 0) > 0:
return 'migrate'
return 'archive'
Media File Analysis
# List all media files on server
find /var/www/uploads -type f | awk -F. '{print $NF}' | sort | uniq -c
# Files without links from content (potential garbage)
# Step 1: extract all img src from DB
mysql -e "SELECT DISTINCT image_url FROM posts WHERE image_url IS NOT NULL" > used_files.txt
# Step 2: compare with files on disk
comm -23 <(ls /uploads/ | sort) <(sort used_files.txt)
SEO Metadata Inventory
# Find pages without meta description
missing_meta = merged[merged['meta_description'].isna() | (merged['meta_description'] == '')]
print(f"Without meta description: {len(missing_meta)} pages")
# Find duplicate Titles
duplicate_titles = merged[merged.duplicated(subset='title', keep=False)]
print(f"Duplicate Titles: {len(duplicate_titles)} pages")
# Export tasks for copywriters
missing_meta[['url', 'title', 'h1']].to_csv('tasks_add_meta.csv', index=False)
Final Audit Report
Report structure:
- Summary statistics (total URLs, statuses, distribution by type)
- SEO health (% pages with meta description, H1, canonical)
- Technical issues (broken links, error pages)
- Page list by decision (table with URL and action)
- Recommendations for transfer priorities
Execution Time
Audit of site up to 1000 pages with classification and report — 3–5 working days.







