j_holtslander
Member
- Joined
- Feb 5, 2019
- Messages
- 152
- Reaction score
- 120
Background
I have a server that runs many staging sites for many client websites as individual subdomains. eg: client1.stagingserver.com, client2.stagingserver.com, client3.stagingserver.com, etc.
Each site contains both the production server's
The way it works is that if the domain that the website is running on is a subdomain of
More info.
So (in theory) if a crawler finds
But if a crawler finds the exact same website running on
The Problem
Google seems to somehow be ignoring the rewrite rule.
Humans are definitely getting the
Anyone have any ideas why Google would be indexing these sites? Or a solution?
Yes, we could instead set the staging sites to "Do not index" within Wordpress' general settings for the staging site. But that setting could easily be accidentally cloned from Staging to Production which would be pretty disastrous.
We need a "set it and forget it" solution that's immune to memory lapses which is why we've gone this auto-substitution route.
I have a server that runs many staging sites for many client websites as individual subdomains. eg: client1.stagingserver.com, client2.stagingserver.com, client3.stagingserver.com, etc.
Each site contains both the production server's
robots.txt
file as well as a special robots-staging.txt
file that is purposed with preventing the staging sites from being indexed.The way it works is that if the domain that the website is running on is a subdomain of
[stagingserver.com]
then the robots-staging.txt
file is "substituted" for robots.txt
by a mod_rewrite rule for Apache in the site's .htaccess
file. More info.
So (in theory) if a crawler finds
[client1.stagingserver.com]
and tries to read robots.txt
they'll be served the robots-staging.txt
file instead which reads:
Code:
User-agent: *
Disallow: /
Noindex: /
But if a crawler finds the exact same website running on
[client1.com]
it gets the robots.txt
file as normal which reads as:
Code:
User-agent: *
Disallow: /.git
Disallow: /cgi-bin
Disallow: /config
Sitemap: https://[client1.com]/sitemap.xml
The Problem
Google seems to somehow be ignoring the rewrite rule.
Humans are definitely getting the
robots-staging.txt
file's contents when requesting robots.txt
from staging, but several client websites have had their staging sites indexed by Google anyway and are now duplicate content. (Canonicals for the URLs are defined by the Wordpress site's WP_SITEURL
)Anyone have any ideas why Google would be indexing these sites? Or a solution?
Yes, we could instead set the staging sites to "Do not index" within Wordpress' general settings for the staging site. But that setting could easily be accidentally cloned from Staging to Production which would be pretty disastrous.
We need a "set it and forget it" solution that's immune to memory lapses which is why we've gone this auto-substitution route.