robots.txt is a text file that tells search engine crawlers "do not crawl this path." It is placed in the root directory of your site (e.g., https://example.com/robots.txt) to control what gets crawled and to specify the location of your XML sitemap.
What Is robots.txt and Why Is It Necessary?
robots.txt (Robots Exclusion Protocol) is a standard mechanism for website administrators to tell search engine crawlers whether crawling is allowed or denied. It was proposed in 1994 and is now supported by all major search engines, including Google, Bing, and Yahoo!.
There are three main reasons why robots.txt is necessary:
- Crawl Budget Optimization — Suppress crawling of paths that do not require indexing, such as admin pages, search results pages, and parameterized URLs, and concentrate crawl resources on important pages
- Reducing Server Load — On large sites, crawler access can cause server load. You can reduce the load by blocking unnecessary paths
- XML Sitemap Notification — By including the Sitemap directive in robots.txt, you can tell search engines the location of your sitemap
What robots.txt Cannot Do
robots.txt only controls crawling and does not have the ability to control indexing. A common misconception is that "Disallowed pages will not appear in search results," but if the page is linked from external sources, only the URL may remain in the index. To ensure exclusion from the index, use the noindex meta tag.
Basic Syntax of robots.txt
robots.txt consists of four directives. Simply place it as a text file in the root of your site and it will work.
User-agent (Specifying Target Crawler)
Specifies which crawler the rule applies to. * (asterisk) means all crawlers. You can also specify individual crawlers such as Googlebot.
Disallow (Specifying Blocked Paths)
Specify paths you do not want crawlers to access. Paths are evaluated by prefix matching, so writing /admin/ will block all URLs under /admin/.
Allow (Specifying Permitted Paths)
Use this when you want to allow crawling of only a specific subpath within paths broadly blocked by Disallow. Google prioritizes the most specific (longest) path by design.
Sitemap (Specifying XML Sitemap)
Enter the full URL of your XML sitemap. Write it outside User-agent blocks. If you have multiple sitemaps, you can specify them on multiple lines.
Practical robots.txt Examples
Here are typical patterns of robots.txt used on real websites. Combine the necessary rules depending on the size and structure of your site.
Example for a Typical Website
Allow Crawling All Pages (Minimal Configuration)
Leaving the Disallow value empty means "no blocking." For small sites with no particular paths to block, simply specifying only the Sitemap in this minimal configuration is the simplest approach.
Common Configuration Mistakes and Their SEO Impact
Misconfiguring robots.txt can have a serious impact on your entire site's SEO. The following mistakes frequently occur in practice.
Blocking the Entire Site with Disallow: /
This is the most dangerous mistake. Setting Disallow: / in a development or staging environment and deploying it to production as-is. The entire site disappears from search results, and if discovery is delayed, recovery can take weeks to months.
Blocking Pages You Want to Noindex with robots.txt
This mistake is based on the misunderstanding that "I should Disallow because I want to exclude it from the index." When blocked by robots.txt, crawlers cannot read the noindex tag, so the page may remain in the index. This point is clearly noted in Google's official documentation.
Blocking CSS and JavaScript Files
Google needs to crawl CSS and JavaScript to render pages and evaluate content. Blocking them with robots.txt prevents Googlebot from properly rendering your pages, negatively affecting content evaluation and Core Web Vitals measurements.
Errors in Sitemap Directive
Using a relative path (/sitemap.xml) for the Sitemap URL or specifying http for an HTTPS site. Sitemap URLs must be written as complete URLs (absolute paths).
Incorrect robots.txt File Path
robots.txt must be placed directly under the domain root (https://example.com/robots.txt). Search engines will not recognize it if placed in a subdirectory. Also, save the file with UTF-8 encoding.
Required Check Before Production Deployment
If Disallow: / is set in robots.txt in the staging environment, be sure to remove it during production deployment. Adding a robots.txt validation step to your CI/CD pipeline is the most reliable measure.
Framework-Specific Considerations
Each major framework and CMS handles robots.txt differently. Without understanding framework-specific behavior, you may end up with unintended crawl control.
WordPress
WordPress automatically generates a virtual robots.txt by default, which includes blocking wp-admin/ and specifying the Sitemap. To customize, place an actual file in the root directory or manage it with a plugin such as Yoast SEO.
Note: Enabling the "Discourage search engines from indexing this site" setting (Settings → Reading) automatically adds Disallow: / to robots.txt. Many cases occur where this is enabled during development and forgotten.
Next.js
In Next.js (App Router), you can dynamically generate robots.txt using app/robots.ts. By switching between production and staging using environment variables, you can prevent deployment accidents.
Note: You can also place a static file at public/robots.txt, but dynamic generation is recommended as it does not support per-environment switching.
Static Sites (HTML/Hugo/Gatsby, etc.)
Place the robots.txt file directly in the root directory. If there is no mechanism to auto-generate it at build time, manual maintenance is required. The best practice is to include it in your build script so you don't forget to update robots.txt when your site structure changes.
Diagnose robots.txt and Sitemap Settings with a Tool
Even if you understand the syntax of robots.txt, verifying that it is correctly configured on your own site is a separate challenge. The following cases are particularly easy to overlook through manual inspection.
- Whether robots.txt is served at the correct URL
- Whether the XML sitemap is correctly specified in robots.txt
- Whether the entire site is being blocked unintentionally
- Whether there are problems with the XML sitemap syntax or settings
Bulk Check with CodeQuest.work SEO
With CodeQuest.work SEO, simply enter a URL to get a free diagnosis of 45 items including robots.txt verification, XML sitemap settings, meta tags, structured data, and Core Web Vitals. Use it for regular checks, as it allows you to understand the overall health of your technical SEO, not just the robots.txt settings.
