Existing content audit before website migration

Our company is engaged in the development, support and maintenance of sites of any complexity. From simple one-page sites to large-scale cluster systems built on micro services. Experience of developers is confirmed by certificates from vendors.
Development and maintenance of all types of websites:
Informational websites or web applications
Business card websites, landing pages, corporate websites, online catalogs, quizzes, promo websites, blogs, news resources, informational portals, forums, aggregators
E-commerce websites or web applications
Online stores, B2B portals, marketplaces, online exchanges, cashback websites, exchanges, dropshipping platforms, product parsers
Business process management web applications
CRM systems, ERP systems, corporate portals, production management systems, information parsers
Electronic service websites or web applications
Classified ads platforms, online schools, online cinemas, website builders, portals for electronic services, video hosting platforms, thematic portals

These are just some of the technical types of websites we work with, and each of them can have its own specific features and functionality, as well as be customized to meet the specific needs and goals of the client.

Our competencies:
Development stages
Latest works
  • image_web-applications_feedme_466_0.webp
    Development of a web application for FEEDME
    1161
  • image_ecommerce_furnoro_435_0.webp
    Development of an online store for the company FURNORO
    1041
  • image_crm_enviok_479_0.webp
    Development of a web application for Enviok
    822
  • image_crm_chasseurs_493_0.webp
    CRM development for Chasseurs
    847
  • image_website-sbh_0.png
    Website development for SBH Partners
    999
  • image_website-_0.png
    Website development for Red Pear
    451

Implementing Content Audit Before Migration

Content audit is inventorying and analyzing all materials on current site before transferring to new platform. Allows understanding what to transfer, what to update, and what to delete.

Audit Tasks

  • Compile complete URL list on site
  • Identify outdated and duplicate content
  • Determine SEO metrics for each page
  • Find broken links and missing metadata
  • Prioritize content for transfer

Site Crawling for Inventory

Screaming Frog — standard audit tool:

Configuration:
- Configuration → Spider → Crawl all subdomains
- Enable: JS rendering (for SPA)
- Export: All tabs → Save as CSV

Result: CSV with fields — URL, Title, Meta Description, H1, Status Code, Indexability, Word Count, Inlinks, Outlinks.

Python crawler for automation:

import scrapy
from scrapy.crawler import CrawlerProcess

class ContentAuditSpider(scrapy.Spider):
    name = 'content_audit'
    start_urls = ['https://company.com']
    custom_settings = {
        'DEPTH_LIMIT': 10,
        'DOWNLOAD_DELAY': 0.5,
        'FEEDS': {'audit_results.csv': {'format': 'csv'}}
    }

    def parse(self, response):
        yield {
            'url': response.url,
            'status': response.status,
            'title': response.css('title::text').get(''),
            'h1': response.css('h1::text').get(''),
            'meta_description': response.css('meta[name="description"]::attr(content)').get(''),
            'canonical': response.css('link[rel="canonical"]::attr(href)').get(''),
            'robots': response.css('meta[name="robots"]::attr(content)').get('all'),
            'word_count': len(' '.join(response.css('main *::text').getall()).split()),
            'internal_links': len(response.css('a[href^="/"]')),
            'images_without_alt': len(response.css('img:not([alt])')),
            'last_modified': response.headers.get('Last-Modified', b'').decode()
        }

        for link in response.css('a::attr(href)').getall():
            yield response.follow(link, self.parse)

Analyzing SEO Data

Export from Google Search Console:

  • Performance → Pages: clicks, impressions, CTR, position
  • Coverage: indexed / not indexed pages
  • URL Inspection: status of specific URLs

Matching crawl data in Python:

import pandas as pd

crawl_data = pd.read_csv('audit_results.csv')
gsc_data = pd.read_csv('gsc_pages.csv')  # export from GSC

merged = crawl_data.merge(gsc_data, on='url', how='left')
merged['has_seo_value'] = merged['clicks'] > 0  # pages with traffic

Content Classification

Each page gets a label:

Decision Criteria
Transfer Clicks > 0, unique content, relevant
Update during transfer Content outdated but has SEO value
Consolidate Duplicate pages on same topic
Delete + redirect No traffic, duplicate, thin content
Don't transfer Test pages, archive, service URLs
def classify_page(row):
    if row.get('noindex') or row['status'] != 200:
        return 'skip'
    if row.get('clicks', 0) > 100 or row.get('inlinks', 0) > 5:
        return 'migrate_priority_high'
    if row.get('word_count', 0) < 100:
        return 'review_thin_content'
    if row.get('clicks', 0) > 0:
        return 'migrate'
    return 'archive'

Media File Analysis

# List all media files on server
find /var/www/uploads -type f | awk -F. '{print $NF}' | sort | uniq -c

# Files without links from content (potential garbage)
# Step 1: extract all img src from DB
mysql -e "SELECT DISTINCT image_url FROM posts WHERE image_url IS NOT NULL" > used_files.txt

# Step 2: compare with files on disk
comm -23 <(ls /uploads/ | sort) <(sort used_files.txt)

SEO Metadata Inventory

# Find pages without meta description
missing_meta = merged[merged['meta_description'].isna() | (merged['meta_description'] == '')]
print(f"Without meta description: {len(missing_meta)} pages")

# Find duplicate Titles
duplicate_titles = merged[merged.duplicated(subset='title', keep=False)]
print(f"Duplicate Titles: {len(duplicate_titles)} pages")

# Export tasks for copywriters
missing_meta[['url', 'title', 'h1']].to_csv('tasks_add_meta.csv', index=False)

Final Audit Report

Report structure:

  1. Summary statistics (total URLs, statuses, distribution by type)
  2. SEO health (% pages with meta description, H1, canonical)
  3. Technical issues (broken links, error pages)
  4. Page list by decision (table with URL and action)
  5. Recommendations for transfer priorities

Execution Time

Audit of site up to 1000 pages with classification and report — 3–5 working days.