Robots.txt: the most common mistakes to avoid

Share this article

The robots.txt file is one of the most important elements for correct management of your own website e della sua optimization in ottica SEO. E’ infatti una direttiva che ci consente di comunicare con gli spider dei motori di ricerca per gestire al meglio la scansione, offrire un localizzazione precisa della sitemap e limitare l’accesso a risorse che ci farebbero sprecare crawl budget.

Being able to give the right indications to search engine crawlers and avoid improper or even incorrect use of these directives will increase the chances of our site pleasing search engines and offering users a complete and satisfying user experience .

It is therefore useful to clarify the use of robots.txt and mistakes to avoid.

First of all, it is essential to understand that the robots.txt file applies to a list of pages and directories that spiders CANNOT crawl. Therefore the indications that can be provided to the spiders concern only what they cannot do within our site through the disallow command and not what they can do. If there is no need to prevent search engines from crawling some pages of our site, the robots txt file should not be used.

What has just been said is extremely important, as many errors in using the robots.txt file originate from a misunderstanding of what is possible with them.

Below is a list of mistakes to avoid

1. Disallow Robots.txt to the URL to prevent it from appearing in search results

It is one of the most common errors. Blocking a URL with disallow robots txt does not prevent indexing. In fact, if the pages to which we have applied the robots.txt file are linked from other websites or shared on social channels, search engines could still index them and make them appear in Serp without a title and a snippet. When you want to block the indexing of one or more pages it is much more convenient to use the noindex tag. Even in this case, however, you must be careful as the two commands should never be used together. If we apply the disallow robots.txt command and the no index tag to the same page, we will find ourselves in the unfortunate situation whereby the spiders will not be able to read the indexing block as scanning has also been blocked.

2. Apply the Disallow Robots.txt to remove pages that no longer exist from the search results

The same situation can arise if we apply the disallow robots.txt command on pages that have been removed from our website. Preventing spiders from crawling the page does not prevent them from being indexed anyway. In this case it is much better to apply the status code 410 and allow search engines to verify that the resource has actually been deleted and therefore exclude it from the search results.

3. Apply the Disallow Robots.txt on a resource whose URL we have changed

If the robots.txt file does not allow us to de-index a page, it is a serious mistake to use it on a URL with redirects (status code 301 or 302 or metarefresh) as search engines cannot read the redirect. Also in this case the search engines will show the search result in the SERP but with the wrong URL.

In general, this rule applies in all those cases in which we apply a command to the page that disallow robots.txt does not allow to read, such as status codes or meta tags or the HTTP header.

4. Compose robots.txt files that are too complex and elaborate

Robots.txt files must be simple and precise. The standard limits the size of the robots.txt file to 500 kb, excess text will be ignored so there is no point in complicating its creation.

5. Use the robots.txt file to hide sensitive information

The robots.txt file is a public resource and accessible to anyone with a minimum of experience, so it is useless to use it to hide confidential pages or pages containing user data. In this case it is better to use other systems such as protection using credentials.

6. Use the wordpress robots txt to lock the wp-content folder

know how to use the robots txt file it is also very important for the user wordpress. Usually iThe classic robots.txt for a WordPress site has always been the following.

User-agent: *
Disallow: /wp-admin/
Disallow: /wp-includes/

Unlike a few years ago, search engines are also able to read and interpret Javascript code and the CSS language, i.e. the graphic formatting of pages. if you prevent access to WordPress folders like /wp-content/ or /wp-includes/ It may happen that Google does not have access to fundamental resources to render the page correctly. In fact, both folders often contain all the files functional to the theme used and applying a too closed robots txt to them could be a serious mistake, also in terms of indexing.

Per modificare il mio File Robots TXT e non commettere errori utilizzo il File Editor del Plugin Premium Seo Pack di WordPress

7. Wrong Robots txt editor

To compile the robots.txt file you must also pay attention to the syntax. The URL of the Robots.txt file is case-sensitive, that is, it differentiates between uppercase and lowercase letters, which is why errors can be generated if you call the ROBOTS.TXT file or if you write the URL inside it using uppercase and lowercase letters.

8. Crawl-delay

La direttiva crawl-delay imposta il numero di secondi che i bot devono attendere prima di effettuare una nuova scansione del sito web. E’ una direttiva molto utile per prevenire il sovraccarico del server tuttavia è utile sapere che gli spider di Google ignorano questo paramentro.

At this point it is very easy to understand how incorrect use of the robots txt file can lead to errors that worsen the user experience of our users by blocking the scanning of useful resources and pages that should instead be completely accessible to search engines.

Understanding how to use the robots.txt file can be a powerful tool in optics SEO because if handled well it allows us not to overload the server and above all limit access to areas or contents of our site that do not offer added value to our users.

So always pay attention to the use you make of robots.txt and remember not to overdo it, if the site is not huge in size, using the disallow command of robots.txt too much makes no sense. IF you have doubts or want to share your experience regarding the errors made with robots txt, write in the comments

Share this article

Leave a Reply

Gianluca Gentile