Technical SEO

How to Write robots.txt
Basic Syntax and Avoiding Configuration Mistakes

From the basic syntax of robots.txt to how to avoid configuration mistakes that affect SEO and framework-specific considerations. A practical guide to properly controlling search engine crawling.

7 min read2026-05-15

robots.txt is a text file that tells search engine crawlers "do not crawl this path." It is placed in the root directory of your site (e.g., https://example.com/robots.txt) to control what gets crawled and to specify the location of your XML sitemap.

What Is robots.txt and Why Is It Necessary?

robots.txt (Robots Exclusion Protocol) is a standard mechanism for website administrators to tell search engine crawlers whether crawling is allowed or denied. It was proposed in 1994 and is now supported by all major search engines, including Google, Bing, and Yahoo!.

There are three main reasons why robots.txt is necessary:

  • Crawl Budget OptimizationSuppress crawling of paths that do not require indexing, such as admin pages, search results pages, and parameterized URLs, and concentrate crawl resources on important pages
  • Reducing Server LoadOn large sites, crawler access can cause server load. You can reduce the load by blocking unnecessary paths
  • XML Sitemap NotificationBy including the Sitemap directive in robots.txt, you can tell search engines the location of your sitemap

What robots.txt Cannot Do

robots.txt only controls crawling and does not have the ability to control indexing. A common misconception is that "Disallowed pages will not appear in search results," but if the page is linked from external sources, only the URL may remain in the index. To ensure exclusion from the index, use the noindex meta tag.

Basic Syntax of robots.txt

robots.txt consists of four directives. Simply place it as a text file in the root of your site and it will work.

1

User-agent (Specifying Target Crawler)

Specifies which crawler the rule applies to. * (asterisk) means all crawlers. You can also specify individual crawlers such as Googlebot.

# Applies to all crawlers
User-agent: *
# Applies only to Googlebot
User-agent: Googlebot
2

Disallow (Specifying Blocked Paths)

Specify paths you do not want crawlers to access. Paths are evaluated by prefix matching, so writing /admin/ will block all URLs under /admin/.

User-agent: *
Disallow: /admin/
Disallow: /search
Disallow: /tmp/
3

Allow (Specifying Permitted Paths)

Use this when you want to allow crawling of only a specific subpath within paths broadly blocked by Disallow. Google prioritizes the most specific (longest) path by design.

User-agent: *
Disallow: /private/
# Block everything under /private/
Allow: /private/public-page
# But allow only this page
4

Sitemap (Specifying XML Sitemap)

Enter the full URL of your XML sitemap. Write it outside User-agent blocks. If you have multiple sitemaps, you can specify them on multiple lines.

Sitemap: https://example.com/sitemap.xml
Sitemap: https://example.com/sitemap-posts.xml

Practical robots.txt Examples

Here are typical patterns of robots.txt used on real websites. Combine the necessary rules depending on the size and structure of your site.

Example for a Typical Website

# Rules for all crawlers
User-agent: *
Disallow: /admin/
Disallow: /search
Disallow: /api/
Disallow: /*?sort=
Disallow: /*?page=
# Sitemap
Sitemap: https://example.com/sitemap.xml

Allow Crawling All Pages (Minimal Configuration)

User-agent: *
Disallow:
Sitemap: https://example.com/sitemap.xml

Leaving the Disallow value empty means "no blocking." For small sites with no particular paths to block, simply specifying only the Sitemap in this minimal configuration is the simplest approach.

Common Configuration Mistakes and Their SEO Impact

Misconfiguring robots.txt can have a serious impact on your entire site's SEO. The following mistakes frequently occur in practice.

Blocking the Entire Site with Disallow: /

This is the most dangerous mistake. Setting Disallow: / in a development or staging environment and deploying it to production as-is. The entire site disappears from search results, and if discovery is delayed, recovery can take weeks to months.

Blocking Pages You Want to Noindex with robots.txt

This mistake is based on the misunderstanding that "I should Disallow because I want to exclude it from the index." When blocked by robots.txt, crawlers cannot read the noindex tag, so the page may remain in the index. This point is clearly noted in Google's official documentation.

Blocking CSS and JavaScript Files

Google needs to crawl CSS and JavaScript to render pages and evaluate content. Blocking them with robots.txt prevents Googlebot from properly rendering your pages, negatively affecting content evaluation and Core Web Vitals measurements.

Errors in Sitemap Directive

Using a relative path (/sitemap.xml) for the Sitemap URL or specifying http for an HTTPS site. Sitemap URLs must be written as complete URLs (absolute paths).

Incorrect robots.txt File Path

robots.txt must be placed directly under the domain root (https://example.com/robots.txt). Search engines will not recognize it if placed in a subdirectory. Also, save the file with UTF-8 encoding.

Required Check Before Production Deployment

If Disallow: / is set in robots.txt in the staging environment, be sure to remove it during production deployment. Adding a robots.txt validation step to your CI/CD pipeline is the most reliable measure.

Framework-Specific Considerations

Each major framework and CMS handles robots.txt differently. Without understanding framework-specific behavior, you may end up with unintended crawl control.

WordPress

WordPress automatically generates a virtual robots.txt by default, which includes blocking wp-admin/ and specifying the Sitemap. To customize, place an actual file in the root directory or manage it with a plugin such as Yoast SEO.

Note: Enabling the "Discourage search engines from indexing this site" setting (Settings → Reading) automatically adds Disallow: / to robots.txt. Many cases occur where this is enabled during development and forgotten.

Next.js

In Next.js (App Router), you can dynamically generate robots.txt using app/robots.ts. By switching between production and staging using environment variables, you can prevent deployment accidents.

Note: You can also place a static file at public/robots.txt, but dynamic generation is recommended as it does not support per-environment switching.

Static Sites (HTML/Hugo/Gatsby, etc.)

Place the robots.txt file directly in the root directory. If there is no mechanism to auto-generate it at build time, manual maintenance is required. The best practice is to include it in your build script so you don't forget to update robots.txt when your site structure changes.

Diagnose robots.txt and Sitemap Settings with a Tool

Even if you understand the syntax of robots.txt, verifying that it is correctly configured on your own site is a separate challenge. The following cases are particularly easy to overlook through manual inspection.

  • Whether robots.txt is served at the correct URL
  • Whether the XML sitemap is correctly specified in robots.txt
  • Whether the entire site is being blocked unintentionally
  • Whether there are problems with the XML sitemap syntax or settings

Bulk Check with CodeQuest.work SEO

With CodeQuest.work SEO, simply enter a URL to get a free diagnosis of 45 items including robots.txt verification, XML sitemap settings, meta tags, structured data, and Core Web Vitals. Use it for regular checks, as it allows you to understand the overall health of your technical SEO, not just the robots.txt settings.

Check Your Site's robots.txt and Technical SEO

Free diagnosis of 45 items including robots.txt verification, XML sitemap, meta tags, structured data, and Core Web Vitals. Fix code is automatically generated for areas that need improvement.

今井政和

Written by

今井政和

SEO Director / Frontend Developer

SEO Director with 20+ years of web industry experience. Creator of CodeQuest.work SEO and the official WordPress plugin "ORECTIC SEO CHECK." Author of a book on web strategy inspired by Edo-era merchant principles.

@imai_director

FAQ

Does not having a robots.txt hurt SEO?
No direct penalty — search engines will treat your entire site as crawlable. However, without robots.txt, you risk wasting crawl budget on unnecessary pages (admin panels, search result pages, duplicate parameter URLs). For larger sites, blocking irrelevant paths via robots.txt improves crawl efficiency.
What is the difference between robots.txt and noindex?
robots.txt blocks crawler access entirely. noindex (via meta robots tag) allows crawling but prevents indexing. A key difference: pages blocked by robots.txt can still appear in the index (URL only) if external links point to them. To reliably exclude pages from the index, use noindex and don't block the page in robots.txt.
Can robots.txt remove a page from Google's index?
No. Disallow in robots.txt only restricts crawling — it's not a de-indexing instruction. Google's official documentation explicitly states that blocked pages may still appear in the index. To remove pages from the index, use noindex meta tags or X-Robots-Tag HTTP headers.
How long does it take for robots.txt changes to take effect?
Google typically refreshes its robots.txt cache within 24 hours, though it may take several days. You can check the current cache status via Google Search Console's robots.txt Tester. For urgent updates, you can request a re-fetch from Search Console.
Where do I configure robots.txt in WordPress?
WordPress auto-generates a virtual robots.txt by default (with Disallow for wp-admin/ and wp-includes/, plus Sitemap). To customize it, place an actual robots.txt file in the root directory — it overrides the virtual one. Plugins like Yoast SEO or All in One SEO also let you edit it via the admin GUI.