More threads by j_holtslander

j_holtslander

Member
Joined
Feb 5, 2019
Messages
152
Reaction score
120
Background
I have a server that runs many staging sites for many client websites as individual subdomains. eg: client1.stagingserver.com, client2.stagingserver.com, client3.stagingserver.com, etc.

Each site contains both the production server's robots.txt file as well as a special robots-staging.txt file that is purposed with preventing the staging sites from being indexed.

The way it works is that if the domain that the website is running on is a subdomain of [stagingserver.com] then the robots-staging.txt file is "substituted" for robots.txt by a mod_rewrite rule for Apache in the site's .htaccess file.
More info.

So (in theory) if a crawler finds [client1.stagingserver.com] and tries to read robots.txt they'll be served the robots-staging.txt file instead which reads:
Code:
User-agent: *
Disallow: /
Noindex: /

But if a crawler finds the exact same website running on [client1.com] it gets the robots.txt file as normal which reads as:
Code:
User-agent: *
Disallow: /.git
Disallow: /cgi-bin
Disallow: /config

Sitemap: https://[client1.com]/sitemap.xml

The Problem
Google seems to somehow be ignoring the rewrite rule.

Humans are definitely getting the robots-staging.txt file's contents when requesting robots.txt from staging, but several client websites have had their staging sites indexed by Google anyway and are now duplicate content. (Canonicals for the URLs are defined by the Wordpress site's WP_SITEURL)

Anyone have any ideas why Google would be indexing these sites? Or a solution?

Yes, we could instead set the staging sites to "Do not index" within Wordpress' general settings for the staging site. But that setting could easily be accidentally cloned from Staging to Production which would be pretty disastrous.
We need a "set it and forget it" solution that's immune to memory lapses which is why we've gone this auto-substitution route.
 
Can you not just put the robots.txt file in the staging directory directly? I assume then you run into the same issue about having to update the file when you push the staging to live.

Not sure of the answer, but does it take into account any www or non-www versions?

Does this page help: Correctly redirect bot requests to static version of a website
 
1) Google seems to deindex slowly. Did you accidentally allow them to be indexed and you're now waiting for them to drop from the index?

2) Have you tried using a header for no index?
 
does it take into account any www or non-www versions?

Yeah, no issues there.


Nope. Afraid not.

Did you accidentally allow them to be indexed and you're now waiting for them to drop from the index?

Unclear TBH, but yes currently waiting for them to be dropped after requesting removals.

2) Have you tried using a header for no index?

Unsure how to do that where it'd apply only to Staging and not to Production. Ideas?
 
That may be the problem. Maybe Google is honoring the noindex, only it just now showed up. So they're dropping it from the index. I can tell you from experience that takes a long time sometimes. Months.

Yeah, I'm not sure how to do it either. If it's in a subdirectory you can do it. I did it awhile back but now I've forgotten how.
 

Login / Register

Already a member?   LOG IN
Not a member yet?   REGISTER

LocalU Event

  Promoted Posts

New advertising option: A review of your product or service posted by a Sterling Sky employee. This will also be shared on the Sterling Sky & LSF Twitter accounts, our Facebook group, LinkedIn, and both newsletters. More...
Top Bottom