Auditite
Back to blog
Technical SEO Technical SEO Audit 2025-07-02 8 min read

Robots.txt Best Practices Guide for SEO Teams

Learn robots.txt best practices to control crawler access, protect sensitive pages, and optimize your crawl budget effectively.

A

Auditite Team

robots.txttechnical SEOcrawl controlGooglebot

Understanding Robots.txt and Its Role in SEO

The robots.txt file is a simple text file placed in your website’s root directory that tells search engine crawlers which parts of your site they can and cannot access. Despite its simplicity, misconfiguring robots.txt is one of the most common technical SEO mistakes — and one of the most damaging.

A single misplaced directive can block your entire site from being indexed, while an overly permissive file can waste crawl budget on pages that should never appear in search results.

How Robots.txt Works

When a search engine crawler arrives at your domain, the first thing it requests is https://yourdomain.com/robots.txt. Based on the directives it finds, the crawler decides which URLs it is allowed to visit.

Key directives include:

  • User-agent — Specifies which crawler the rules apply to (e.g., Googlebot, Bingbot, or * for all)
  • Disallow — Blocks the specified path from being crawled
  • Allow — Explicitly permits crawling of a path (useful for overriding broader Disallow rules)
  • Sitemap — Points crawlers to your XML sitemap
  • Crawl-delay — Tells certain bots to wait between requests (not respected by Googlebot)

Important Limitations

Robots.txt is a directive, not a security measure. Well-behaved bots follow it, but malicious bots ignore it entirely. Never use robots.txt to hide sensitive information — use server-side authentication instead.

Also, blocking a page with robots.txt does not prevent it from appearing in search results. If other sites link to a blocked page, Google may still index the URL (without content). To fully prevent indexation, use the noindex meta tag instead.

Essential Robots.txt Rules for Every Site

Block Internal Search Results

Internal search result pages create near-infinite URL combinations that waste crawl budget and risk being flagged as thin or duplicate content:

Disallow: /search
Disallow: /search?

Block Faceted Navigation

E-commerce sites with filters for color, size, price range, and other attributes generate thousands of parameter-based URLs. Block these filtered paths:

Disallow: /*?sort=
Disallow: /*?filter=
Disallow: /*&color=

Block Admin and Login Areas

Crawlers have no business visiting your admin panel, login pages, or account management areas:

Disallow: /admin/
Disallow: /login
Disallow: /account/
Disallow: /wp-admin/

Block Cart and Checkout Pages

Shopping cart and checkout pages are session-specific and have no SEO value:

Disallow: /cart
Disallow: /checkout

Block Duplicate Content Paths

If your CMS generates multiple URL patterns for the same content (e.g., tag archives, date-based archives), block the duplicates:

Disallow: /tag/
Disallow: /author/

Include Your Sitemap Reference

Always include a sitemap directive at the bottom of your robots.txt:

Sitemap: https://yourdomain.com/sitemap-index.xml

Common Robots.txt Mistakes to Avoid

Blocking CSS and JavaScript Files

Years ago, it was common practice to block CSS and JS files from crawlers. Today, this is harmful because Google needs to render your pages fully to assess their quality. If Googlebot cannot access your CSS and JS, it cannot properly evaluate your Core Web Vitals or understand your page layout.

Using Disallow: / Without Realizing the Impact

A single Disallow: / directive under User-agent: * blocks your entire site from all crawlers. This is sometimes added during development and accidentally left in place after launch. Always audit your robots.txt after site migrations or redesigns.

Conflicting Allow and Disallow Rules

When multiple rules apply to the same URL path, search engines use the most specific matching rule. However, relying on this behavior creates confusion and maintenance headaches. Keep your rules clear and non-overlapping.

Forgetting to Update After Site Changes

Your robots.txt should evolve with your site. New URL patterns, deprecated sections, and structural changes all require robots.txt updates. Include robots.txt review in your regular technical SEO audit process.

Robots.txt for Different Platforms

WordPress

WordPress sites commonly need to block:

User-agent: *
Disallow: /wp-admin/
Allow: /wp-admin/admin-ajax.php
Disallow: /wp-includes/
Disallow: /tag/
Disallow: /author/
Disallow: /?s=
Disallow: /search/

Shopify

Shopify generates a default robots.txt that is generally well-configured, but you can customize it through the robots.txt.liquid template file. Common additions include blocking collection filter pages and variant URLs.

Next.js and Single-Page Applications

JavaScript-heavy sites need special attention. Ensure you do not block the JavaScript bundles that Googlebot needs to render your pages. Read our full guide on JavaScript SEO for more details.

Testing Your Robots.txt

Google Search Console URL Inspection

Use the URL Inspection tool to verify whether specific URLs are blocked by your robots.txt. This is the most authoritative test because it uses Google’s actual parser.

Robots.txt Tester Tools

Several online tools can validate your robots.txt syntax and test specific URLs against your rules. Run these tests after every change.

Automated Monitoring

Set up automated monitoring to alert you if your robots.txt changes unexpectedly. An accidental edit during deployment could block critical pages from being crawled. Auditite monitors your robots.txt and alerts you to any changes that could impact SEO.

Advanced Robots.txt Techniques

Crawl-Delay for Non-Google Bots

While Googlebot ignores the Crawl-delay directive, other bots like Bingbot and Yandex respect it. If secondary crawlers are overloading your server, add a crawl delay:

User-agent: Bingbot
Crawl-delay: 5

Pattern Matching with Wildcards

Both * (matches any sequence of characters) and $ (matches end of URL) can be used in robots.txt paths:

  • Disallow: /*.pdf$ — Blocks all PDF files
  • Disallow: /*?sessionid= — Blocks URLs with session parameters
  • Disallow: /category/*/feed/ — Blocks feeds under category paths

Bot-Specific Rules

You can create different rules for different crawlers. This is useful when you want Google to crawl certain pages but keep them out of Bing, or vice versa:

User-agent: Googlebot
Allow: /special-section/

User-agent: Bingbot
Disallow: /special-section/

Robots.txt Audit Checklist

Run through this checklist during every technical SEO audit:

  • File is accessible at https://yourdomain.com/robots.txt
  • No critical pages are blocked (homepage, product pages, blog posts)
  • CSS and JS files are not blocked
  • Low-value pages are properly blocked (search, filters, admin)
  • Sitemap URL is included
  • No conflicting directives exist
  • Syntax is valid and properly formatted
  • File matches current site structure (no outdated rules)

Key Takeaways

Your robots.txt file is a powerful but blunt instrument for controlling crawler access. Use it strategically to protect your crawl budget, but remember that it is not a substitute for proper noindex, or server-side access controls. Audit it regularly, test every change, and always keep it aligned with your current site architecture.canonical tags, or server-side access controls. Audit it regularly, test every change, and always keep it aligned with your current site architecture.

Stay in the loop

Get insights, strategies, and product updates delivered to your inbox.

No spam. Unsubscribe anytime.

Ready to see Auditite in action?

Get started and see how Auditite can transform your SEO auditing workflow.

Get started
Get started

Get insights delivered weekly

Join teams who get actionable playbooks, benchmarks, and product updates every week.