Guide to the robots.txt File

BeoHosting Team·3. novembar 2025.·8 min read read

What is the robots.txt file

robots.txt is a simple text file located in the root of your site that gives instructions to web crawlers (search engine bots) about which parts of the site they may index and which to ignore. Every serious search engine - Google, Bing, Yahoo - checks this file before starting to index your site.

robots.txt is not a security mechanism - it is a recommendation, not a prohibition. Well-behaved bots will respect it, but malicious ones will not. To restrict access to sensitive pages, use passwords or server-side authentication. robots.txt is an SEO tool that helps search engines index your site more efficiently.

Where robots.txt is located

The robots.txt file must be in the domain root, available at:

https://yourdomain.com/robots.txt

A file at any other path will not be recognized by crawlers. Every subdomain needs its own robots.txt - the file at yourdomain.com does not apply to blog.yourdomain.com.

Basic syntax

robots.txt uses simple syntax with only a few directives. Each block starts with a User-agent line that defines which bot the rules apply to.

User-agent

User-agent: * - rules apply to all bots.
User-agent: Googlebot - rules apply only to Google's bot.
User-agent: Bingbot - rules apply only to Bing's bot.

Disallow

Disallow: /admin/ - blocks access to the /admin/ directory and everything in it.
Disallow: /private.html - blocks access to a specific page.
Disallow: / - blocks access to the entire site (careful!).
Disallow: (empty) - allows access to everything (default behavior).

Allow

Allow: /admin/public/ - explicitly allows access to a subdirectory that would otherwise be blocked by a Disallow rule.
Allow is used to create exceptions to Disallow rules.

Common directives and examples

Basic robots.txt for a WordPress site

Here is a recommended robots.txt for WordPress sites that blocks unnecessary sections while allowing indexing of important content:

User-agent: * - applies to all bots
Disallow: /wp-admin/ - the admin panel should not be indexed
Allow: /wp-admin/admin-ajax.php - but the AJAX endpoint is needed for some themes and plugins to work
Disallow: /wp-includes/ - WordPress system files
Disallow: /wp-content/plugins/ - plugin files
Disallow: /wp-json/ - REST API (optional, depending on needs)
Disallow: /?s= - search pages (thin content)
Disallow: /author/ - author archives (prevents duplicate content)

Blocking specific file types

Disallow: /*.pdf$ - blocks indexing of PDF files.
Disallow: /*.xml$ - blocks XML files (but not the sitemap!).

Blocking specific bots

Some sites want to block AI crawlers that collect data for model training:
User-agent: GPTBot - OpenAI's bot
Disallow: /
User-agent: anthropic-ai - Anthropic's bot
Disallow: /

Wildcards

robots.txt supports a limited set of wildcard characters:

* (asterisk): Matches any sequence of characters. Example: Disallow: /*.php blocks all URLs containing .php.
$ (dollar): Marks the end of the URL. Example: Disallow: /*.php$ blocks only URLs ending in .php (not .php?parameter=value).

These wildcards are specific to robots.txt and are not standard regex. Use them carefully because they can have unexpected effects.

Testing robots.txt

Before publishing robots.txt to a production site, always test it to avoid accidentally blocking important content.

Testing tools

Google Search Console: In "Settings" → "Crawling" → "robots.txt" you can see how Google interprets your file and test specific URLs.
Bing Webmaster Tools: Similar functionality for the Bing search engine.
Online validators: Tools like robots-txt.com or technicalseo.com/tools/robots-txt/ check syntax and warn about errors.
Screaming Frog: A desktop SEO tool that can simulate crawling and show which pages are blocked by robots.txt.

Common mistakes

Blocking CSS/JS files: Google must access CSS and JavaScript to render pages properly. Do not block these resources.
Disallow: / for all bots - this blocks the entire site from indexing. A common mistake during site migrations.
Whitespace in paths: Paths must be exact, with no extra spaces.
File size: Google ignores robots.txt larger than 500KB. Keep the file short and clear.

robots.txt and sitemap

robots.txt and sitemap are complementary - robots.txt tells bots what not to index, and the sitemap tells them what to index.

Add the sitemap location at the end of the robots.txt file: Sitemap: https://yourdomain.com/sitemap.xml
This helps search engines find your sitemap even before you add it in Search Console.
You can list multiple sitemaps if you have them (e.g. for posts, pages, and products).
The sitemap URL must be an absolute path with protocol (https://).

Important note: a page that is in the sitemap but blocked in robots.txt will not be indexed. robots.txt takes priority. If you want a page in the index, it must not be blocked in robots.txt.

robots.txt vs meta robots tag

In addition to robots.txt, there is a meta robots tag placed in the HTML of individual pages. These two mechanisms complement each other:

robots.txt: Blocks crawling (access) to the page. The bot does not visit the page and does not read its content.
meta noindex: Allows the bot to visit the page but tells it not to include the page in the search index.
If you want a page out of Google results, use meta noindex. If you want the bot not to access the page at all (e.g. to save crawl budget), use robots.txt.
Caution: if robots.txt blocks a page, Google cannot see the meta noindex tag on it. In rare cases, Google may index a blocked page based on external links.

Conclusion

robots.txt is a small but powerful file that can significantly affect your site's SEO. A properly configured robots.txt helps search engines index your site more efficiently, saves crawl budget, and prevents indexing of unnecessary content. For most WordPress sites, the recommended robots.txt with blocked wp-admin, wp-includes, and search pages will be enough. Always test the file before pushing to production and check it regularly in Google Search Console.