Optimizing Crawl Budget for Your Website
Crawl Budget is the number of pages Googlebot is willing to index on your site per unit of time. On large sites, improper budget spending leads to important pages not being indexed while robots waste time on useless URLs.
What Consumes Crawl Budget
- URLs with sorting and filtering parameters (
?sort=price&color=red) - Pagination in endless combinations
- Duplicate pages (with and without trailing slash, http/https)
- Pages with session parameters (
?session_id=abc123) - Technical pages (cart, account, search)
- Pages with UTM tags
Analyzing Current Budget
Google Search Console → Settings → Crawl Stats shows:
- Average number of requests per day
- Average download time
- Responses by type (successful, redirects, 404)
Tools for analysis: Screaming Frog, server log files:
# Analyze access.log: what Googlebot crawls
grep "Googlebot" /var/log/nginx/access.log | \
awk '{print $7}' | sort | uniq -c | sort -rn | head -50
# Find parameters in URLs the bot crawls
grep "Googlebot" /var/log/nginx/access.log | \
grep "?" | awk '{print $7}' | \
sed 's/=.*/=X/g' | sort | uniq -c | sort -rn | head -30
robots.txt: Blocking Unnecessary URLs
User-agent: *
Disallow: /search?
Disallow: /cart/
Disallow: /checkout/
Disallow: /account/
Disallow: /admin/
Disallow: /*?session_id=
Disallow: /*?utm_source=
Disallow: /*?utm_medium=
Disallow: /*?ref=
Disallow: /wp-json/
Disallow: /wp-admin/
Disallow: /*.pdf$
# Allow important files
Allow: /sitemap.xml
Allow: /robots.txt
Canonical for Duplicate Content
<!-- Page with filter → canonical to base -->
<!-- /catalog/shoes?color=red&size=42 -->
<link rel="canonical" href="https://site.com/catalog/shoes">
<!-- /catalog/shoes/ (trailing slash) → canonical without -->
<link rel="canonical" href="https://site.com/catalog/shoes">
<!-- UTM parameters → canonical to clean URL -->
<link rel="canonical" href="https://site.com/articles/post-title">
URL Parameter Setup in GSC
Google Search Console → Legacy tools → URL Parameters (for old accounts) or through canonical tags for new.
Algorithm: each URL parameter is classified:
- Changes content → index (category, page)
- Doesn't change content → don't crawl (utm_source, ref, sid)
- Sorting/filtering → canonical to base URL
# nginx: remove UTM parameters on redirect
if ($arg_utm_source) {
# Remove all UTM parameters via map
}
map $args $clean_args {
~*(?:^|&)(utm_[^&]*)(&|$) $1; # find UTM
default $args;
}
Sitemap.xml Optimization
Sitemap should contain only important, indexable URLs:
def generate_optimized_sitemap(db):
pages = db.query("""
SELECT url, updated_at, priority
FROM pages
WHERE status = 'published'
AND noindex = false
AND updated_at > NOW() - INTERVAL '2 years'
ORDER BY priority DESC, updated_at DESC
""")
xml = ['<?xml version="1.0" encoding="UTF-8"?>',
'<urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9">']
for page in pages:
xml.extend([
'<url>',
f' <loc>{escape(page["url"])}</loc>',
f' <lastmod>{page["updated_at"].strftime("%Y-%m-%d")}</lastmod>',
f' <priority>{page["priority"]:.1f}</priority>',
'</url>'
])
xml.append('</urlset>')
return '\n'.join(xml)
Don't add to sitemap: pages with noindex, 404, redirects, pages without content.
Controlling Crawl Speed
GSC → Settings → Crawl rate allows requesting Google to crawl slower (useful for overloaded servers). Can't speed up crawling — Google determines this.
For Yandex: robots.txt directive Crawl-delay:
User-agent: Yandex
Crawl-delay: 2
Timeline
Crawl budget audit and optimization (robots.txt, canonical, sitemap) — 1–2 business days.







