Guide Technical SEO Specialist

Robots.txt Best Practices Guide with Auditite

Master robots.txt configuration to control crawl access, protect sensitive pages, and optimize crawl budget allocation.

Overview

The robots.txt file controls which parts of your site search engine crawlers can access. A misconfigured robots.txt can block critical pages from indexing or waste crawl budget on low-value URLs. This guide covers best practices for every common scenario.

Step 1: Audit Your Current Robots.txt

Access your robots.txt at yourdomain.com/robots.txt.
Test it using Google Search Console’s robots.txt tester.
Verify no critical pages or resources are accidentally blocked.
Check that CSS and JavaScript files needed for rendering are not blocked.

Step 2: Understand the Syntax

Basic Directives

Directive	Purpose
User-agent	Specifies which crawler the rules apply to
Disallow	Blocks the specified path from crawling
Allow	Overrides a Disallow for a more specific path
Sitemap	Points to your XML sitemap location
Crawl-delay	Requests a delay between requests (not honored by Google)

Rules of Precedence

More specific paths override less specific paths.
Allow takes precedence over Disallow when path lengths are equal.
Rules are case-sensitive for paths.
Wildcards (*) match any sequence of characters.
The $ character indicates the end of a URL.

Step 3: Common Configuration Patterns

Block Admin and Internal Pages

User-agent: *
Disallow: /admin/
Disallow: /wp-admin/
Disallow: /cart/
Disallow: /checkout/
Disallow: /my-account/
Disallow: /internal/

Block Search Results and Filters

Disallow: /search
Disallow: /*?sort=
Disallow: /*?filter=
Disallow: /*&page=

Allow Specific Resources Within Blocked Directories

Disallow: /private/
Allow: /private/public-page/

Block Specific File Types

Disallow: /*.pdf$
Disallow: /*.doc$

Step 4: Crawl Budget Optimization

Block URL parameters that create duplicate content (sorting, session IDs, tracking parameters).
Block internal search result pages — these rarely provide SEO value and can generate infinite URLs.
Block paginated filter combinations that you handle with canonical tags.
Do not block pages you want to noindex — use meta robots instead. Blocking with robots.txt prevents Google from seeing the noindex directive.

Step 5: Common Mistakes to Avoid

Blocking CSS and JavaScript

Google needs to render your pages to evaluate them. Blocking CSS or JS files in robots.txt prevents rendering and can hurt rankings.

Using Robots.txt for Deindexing

Robots.txt prevents crawling, not indexing. If a page has external links pointing to it, Google may still index it without crawling it. Use noindex meta tags for deindexing.

Overly Broad Disallow Rules

A Disallow: / blocks your entire site. A Disallow: /blog blocks both /blog/ and /blogging-tips/. Always include trailing slashes for directories.

Forgetting the Sitemap Directive

Always include your sitemap location at the bottom of robots.txt:

Sitemap: https://yourdomain.com/sitemap.xml

Step 6: Testing and Monitoring

After any robots.txt change, test the updated file in Google Search Console.
Verify your most important pages are not blocked using the URL Inspection tool.
Monitor Google Search Console’s crawl stats for changes in crawl rate after robots.txt updates.
Use Auditite to audit your robots.txt configuration alongside your full site crawl.
Keep a version history of your robots.txt changes so you can roll back if issues arise.

Robots.txt for Multiple Environments

Ensure your staging and development environments block all crawlers:

User-agent: *
Disallow: /

Remove this block before launching any environment to production. Accidentally launching with a full Disallow is one of the most common SEO deployment mistakes.

Related playbooks

Guide

Stop copy-pasting. Start automating.

Auditite turns playbooks into live audit workflows. Get started to see how.

Get started View pricing