Technical: Search Console ignoring "Do not index" directives?

j_holtslander · May 13, 2019

Background
I have a server that runs many staging sites for many client websites as individual subdomains. eg: client1.stagingserver.com, client2.stagingserver.com, client3.stagingserver.com, etc.

Each site contains both the production server's robots.txt file as well as a special robots-staging.txt file that is purposed with preventing the staging sites from being indexed.

The way it works is that if the domain that the website is running on is a subdomain of [stagingserver.com] then the robots-staging.txt file is "substituted" for robots.txt by a mod_rewrite rule for Apache in the site's .htaccess file.
More info.

So (in theory) if a crawler finds [client1.stagingserver.com] and tries to read robots.txt they'll be served the robots-staging.txt file instead which reads:

Code:

User-agent: *
Disallow: /
Noindex: /

But if a crawler finds the exact same website running on [client1.com] it gets the robots.txt file as normal which reads as:

Code:

User-agent: *
Disallow: /.git
Disallow: /cgi-bin
Disallow: /config

Sitemap: https://[client1.com]/sitemap.xml

The Problem
Google seems to somehow be ignoring the rewrite rule.

Humans are definitely getting the robots-staging.txt file's contents when requesting robots.txt from staging, but several client websites have had their staging sites indexed by Google anyway and are now duplicate content. (Canonicals for the URLs are defined by the Wordpress site's WP_SITEURL)

Anyone have any ideas why Google would be indexing these sites? Or a solution?

Yes, we could instead set the staging sites to "Do not index" within Wordpress' general settings for the staging site. But that setting could easily be accidentally cloned from Staging to Production which would be pretty disastrous.
We need a "set it and forget it" solution that's immune to memory lapses which is why we've gone this auto-substitution route.

Yan Gilbert · May 14, 2019

Can you not just put the robots.txt file in the staging directory directly? I assume then you run into the same issue about having to update the file when you push the staging to live.

Not sure of the answer, but does it take into account any www or non-www versions?

Does this page help: Correctly redirect bot requests to static version of a website

JoshuaMackens · May 14, 2019

1) Google seems to deindex slowly. Did you accidentally allow them to be indexed and you're now waiting for them to drop from the index?

2) Have you tried using a header for no index?

j_holtslander · May 14, 2019

Yan Gilbert said:
does it take into account any www or non-www versions?

Yeah, no issues there.

Does this page help: Correctly redirect bot requests to static version of a website

Nope. Afraid not.

JoshuaMackens said:
Did you accidentally allow them to be indexed and you're now waiting for them to drop from the index?

Unclear TBH, but yes currently waiting for them to be dropped after requesting removals.

2) Have you tried using a header for no index?

Unsure how to do that where it'd apply only to Staging and not to Production. Ideas?

JoshuaMackens · May 14, 2019

That may be the problem. Maybe Google is honoring the noindex, only it just now showed up. So they're dropping it from the index. I can tell you from experience that takes a long time sometimes. Months.

Yeah, I'm not sure how to do it either. If it's in a subdirectory you can do it. I did it awhile back but now I've forgotten how.

Technical: Search Console ignoring "Do not index" directives?

j_holtslander

Member

Yan Gilbert

JoshuaMackens

Member

j_holtslander

Member

JoshuaMackens

Member

Similar threads

Login / Register

Events

Newest Posts

Trending: Most Replies

Trending: Most Viewed

Promoted Posts

Share this page