Indexed despite robots.txt blocking: what to do?
Indexed despite the robots.txt file being blocked: why, and what can be done?
In this article, I’ll discuss a topic that, in my opinion, concerns many website owners: why a page might be indexed despite a block in the robots.txt file ? You may have noticed that some pages on your site appear in Google search results, even though you explicitly asked search engine crawlers not to crawl them. Don’t worry—I’ll explain the possible reasons for this and, most importantly, how to fix it.
What is a robots.txt file, and what is it used for?
Before we get into the details, let me remind you what the file is robots.txt. It is a text file that you place in the root directory of your website, and it allows you to control search engine access to certain parts of your site. For example, you can prevent robots from crawling specific pages, but it’s important to understand that does not guarantee that it will be excluded from indexing on these pages.

Let's say you have a robots.txt file that looks like this:
In this case, you ask Google and other search engines not to crawl the page /admin/, but that doesn't necessarily mean this page won't be indexed if other conditions are met.
Why might your page be indexed even though it's blocked?
You're probably wondering why a page on your site keeps showing up in search results, even though you explicitly asked Google to block via the robots.txt file. There are several reasons for this, which I will explain in detail below.
Search engines can still index without crawling
The robots.txt file is designed to prevent a page from being crawled, but it does not prevent indexing. Google can still index a page if it is linked to via a backlink. In other words, even if you block the page from being crawled, if another site links to that page, Google may still add it to its index. This is an important point, because you shouldn’t rely solely on your robots.txt configuration to control indexing.
The presence of «noindex» tags»
If you block a page using the robots.txt file, but that page contains a «noindex» tag» In your HTML code, you tell Google to do not index it, even if the robot can crawl it. However, if you haven’t set this tag, your page may still be indexed, even after it has been crawled, which can cause confusion.
Here is an example of a «noindex» tag:
External links can bypass the robots.txt file because a Google Sandbox is different from a penalty and can be active even without a block.
As mentioned above, the external backlinks may allow Google to index a page blocked by robots.txt. If other sites link to the page in question, Google can find and index it directly from these links, even without having visited the page itself.
It is therefore essential to check the pages that link to your site. Sometimes, links from external sites can undermine your efforts to control indexing.
Indexing via JavaScript files or other technologies
Google has made significant progress in indexing dynamic content, particularly through the JavaScript. If some of the pages on your site are built using JavaScript, Google may be able to index in a different way, even if they are blocked in the robots.txt file. Google's crawler can execute JavaScript, crawl dynamic content, and add it to the index, even without direct crawling.
What can be done to prevent indexing even though robots.txt is blocked?
There are several ways to solve this problem, and prevent blocked pages from being indexed via robots.txt. Let's take a look at these solutions.
1. Add a «noindex» tag»
One of the first things to do is to add the «noindex» tag» on the pages you don't want to appear in search results. You add it directly to the page's HTML code.
This is an effective method because it tells Google: «Even if you crawl this page, don't index it.»
2. Use the «X-Robots-Tag» HTTP headers»
If the page is a file other than an HTML file (such as a PDF, image, or video), you can use the «X-Robots-Tag» HTTP headers» to tell Google not to index the page.
For example, for a PDF file, the following HTTP header will prevent indexing:
3. Disavow unwanted backlinks
If your page is indexed through external backlinks, you can try deleting these links or disavow using the Google Search Console tool. This will prevent Google from following these links and adding the page to its index.
4. To avoid this problem, it is essential to review your internal links and to ensure that they do not link to pages that should be private.
If you have pages blocked by robots.txt, make sure you don’t link to those pages via your internal links. An internal link can prompt Google to crawl and index a blocked page. Therefore, avoid linking to pages that should not be indexed.
5. Use Google Search Console to remove it
If a page has already been indexed despite your efforts to block it, you can use Google Search Console to request removal from the index. This process may take a little time, but it’s a surefire way to resolve the issue quickly.
How can we avoid this kind of problem in the future?
To prevent this type of situation from happening again, here are some best practices:
Check your robots.txt files regularly and make sure they are properly configured. Conduct regular audits of your site.
Use SEO tools such as Google Search Console, Ahrefs, or Screaming Frog to check whether your pages have been indexed.
Review your backlink and internal linking strategy, making sure not to link to sensitive pages that should not be indexed.
Personalized, no-obligation analysis, response within 24/48 hours with 3-5 concrete quick wins.
150 entrepreneurs have already put their trust in us
🔒 Your data is never shared with third parties
