Table of contents of the article:
Every webmaster knows that there are certain aspects of a website that you don't want to crawl or index. The file robots.txt gives you the opportunity to specify these sections and pass them on to search engine crawlers. In this article, we will show common errors that can occur when creating a robots.txt file, how to avoid them, and how to monitor your robots.txt file.
There are many reasons why website operators may want to exclude certain parts of a website from the search engine index, for example if pages are hidden behind a login, are archived or if you want to test pages of a website before they are published. "A Standard for Robot Exclusion”Was released in 1994 to make it possible. This protocol establishes guidelines that before starting the crawl, the search engine crawler should first look for the robots.txt file in the root directory and read the instructions in the file.
Many possible errors can occur during the creation of the robots.txt file, such as syntax errors if a statement is not written correctly or errors resulting from unintentional locking of a directory.
Here are some of the most common robots.txt errors:
Mistake n. 1: use of incorrect syntax
robots.txt is a simple text file and can easily be created using a text editor. An entry in the robots.txt file is always made up of two parts: the first part specifies the interpreter to which to apply the instruction (eg Googlebot), and the second part contains commands, such as "Disallow", and contains a list of all subpages that do not need to be scanned. For the instructions in the robots.txt file to take effect, the correct syntax must be used as shown below.
User-agent: Googlebot Disallow: / example_directory /
In the example above, the Google crawler is prohibited from crawling the / example_directory /. If you want this to apply to all crawlers, you should use the following code in your robots.txt file:
User-agent: * Disallow: / example_directory /
The asterisk (also known as a wildcard) acts as a variable for all crawlers. Similarly, you can use a forward slash (/) to prevent the entire website from being indexed (for example, for a trial version before putting it online for production).
User-agent: * Disallow: /
Mistake n. 2: block path components instead of a directory (forgetting "/")
When excluding a directory from crawling, always remember to add the slash at the end of the directory name. For instance,
Disallow: / directory not only blocks / directory /, but also /directory-one.html
If you want to exclude multiple pages from indexing, you need to add each directory on a different line. Adding multiple paths in the same line usually leads to unwanted errors.
User-agent: googlebot Disallow: / example-directory / Disallow: / example-directory-2 / Disallow: /example-file.html
Mistake n. 3: unintentional blocking of directories
Before the robots.txt file is uploaded to the website root directory, you should always check if its syntax is correct. Even the smallest mistake could result in the crawler ignoring the instructions in the file and leading to crawling of pages that shouldn't be indexed. Always make sure that directories that are not to be indexed are listed after the command Disallow :.
Even in cases where the page structure of your website changes, for example due to a restyle, you should always check the robots.txt file for errors.
Mistake n. 4 - The robots.txt file is not saved in the root directory
The most common error associated with the robots.txt file fails to save the file to the website root directory. Subdirectories are generally ignored as user agents only look for the robots.txt file in the root directory.
The correct URL for a website's robots.txt file must have the following format:
http://www.your-website.com/robots.txt
Mistake n. 5: Don't allow pages with a redirect
If the pages blocked in your robots.txt file have redirects to other pages, the crawler may not recognize the redirects. In the worst case scenario, this could cause the page to still appear in search results but with an incorrect URL. Additionally, the Google Analytics data for your project may also be incorrect.
Hint: robots.txt versus noindex
It is important to note that excluding pages in the robots.txt file does not necessarily imply that the pages are not being indexed. For example, if a crawled URL in robots.txt is linked to an external page. The robots.txt file simply gives you control over the user agent. However, the following often appears instead of the Meta description as the bot is prohibited from crawling:
"A description for this result is not available due to this site's robots.txt file."
Figure 4: Example of a snippet of a blocked page using the robots.txt file but still indexed
As you can see, only one link on the respective page is enough for the page to be indexed, even if the URL is set to “Disallow” in the robots.txt file. Likewise, the use of the tag it may, in this case, not prevent indexing as the crawler was never able to read this part of the code due to the disallow command in the robots.txt file.
To prevent certain URLs from appearing in the Google index, you should use the tag , but still allow the crawler to access this directory.
Conclusions
We have seen and examined very quickly what are the main errors of the robots.txt file which in some cases can significantly compromise the visibility and positioning of your website, arriving in the most serious cases up to the total elimination of the SERP.
If you are thinking of not having such problems with the robots.txt file because you know how it works and you would never make improvised actions, you should know that sometimes the errors in the robots.txt file are the result of oversights in the CMS configuration such as WordPress or even malware attacks or sabotage actions aimed at making your site lose indexing and ranking.
The best advice we can give you is to constantly monitor the robots.txt file at least on a weekly basis and check its correct syntax and correct functioning when you notice alarm signals such as a sudden drop in traffic or the presence of search engines on the SERP. Research.