Only three things are certain in life: death, taxes and your test site getting indexed by Google.
Very rarely do you come across a new site launch without at some point realising the staging server has been left open to bots to come crawl and index.
It’s not necessarily the end of the world if a search engine were to index a test site as it’s fairly easy to resolve – but if you are running a test environment long term to develop new functionality alongside a live site, then you need to ensure it is protected correctly as early as possible to avoid duplicate content issues, and to ensure real life humans don’t visit and interact (i.e. try to buy something).
I am formerly a developer, and probably made these mistakes myself more than once, but back then I didn’t have an SEO being a pain in my arse all the time pointing these things out (back then, old school brochure-come-web designers who didn’t understand the limitation of tables and inline CSS where the pain in my arse).
The following techniques are all tried and tested methods that I’ve used to identify these issues in the wild, though to protect the identity of my clients and their developers, I’ve taken the selfless decision to set up a couple of test sites using my own website content in order illustrate what you need to do, those being:
Though by the time you read this, I will have followed my own advice and taken these down, I need all the visibility I can get, the last thing I need are indexed test sites holding me back.
1) Google Search Console (GSC) Domain Property
One of the great things about the new GSC is that you can set up domain properties which gives you key insights across all subdomains associated with your website – on both HTTP and HTTPS. To set this up, simply select the domain option when adding a property (you also need to carry out the potentially not so straightforward task of adding a TXT record to your domain’s DNS):
There are a whole host of reasons why a domain property is useful, in this case it’s because if you have your test site set up on a sub domain and it’s generating impression and clicks in search, you can spot this from within the “Performance” section by filtering or ordering your pages:
In addition, you should also check the “coverage” section – in some cases, Google will index your content:
Whilst In other cases, they will spot that you have duplicate content in place, and kindly refrain from indexing, in which case you would find it within the section “Duplicate, Google chose different canonical than user”:
Even if this is the case, you should still endeavour to ensure it’s not crawled moving forward.
2) Check Google SERPs Using Link Clump
If you don’t have access to GSC domain properties, or any access to GSC (if not, why not?) then you can check the SERPs to see if any test URLs have made their way into the index.
This is also a handy technique when pitching for new business, what better way to win over a potential client than to make their internal or external development team look like they are dicing with search visibility death by allowing this to happen in the first place, and that you’re here to save the day.
The steps are as follows:
i) install the Link Clump Google Chrome Extension, which allows you to copy and paste multiple URLs from a page to somewhere more useful like Excel.
ii) Amend your Link Clump settings as follows:
The most important one to note is the Action “copied to clipboard” – the last thing you want to happen here is to open up to a hundred URLs at once.
iii) Go to your favourite (or local) Google TLD, click “settings” which you should see at the bottom right of the page, and select “search settings” where you can set your “results per page” to 100.
iv) Return to the Google home page and use the “site:“ query
operator and append your domain. If you
use www or similar, remove this – so the command would be as follows:
You will be presented with a sample of up to 300 URLs currently indexed by Google across all the subdomains. Whilst you could manually review each result to spot rogue sites:
I find it far quicker and easier to right click and drag all the way to the bottom of the page. You will know if Link Clump is working as you will see the following occur to denote links are being selected and copied:
Repeat this across SERPs 2 and 3 if available, and once all URLs are pasted into Excel, use sort by A-Z to easily identify your indexed content across all relevant sub domains.
3) Search For Text Unique To Your Site
The above methods work if your test site is hosted on a subdomain on the same domain as your live website. However, if your test site is located elsewhere, e.g. test.webdevcompany.com, then they won’t work. In which case, this or the following methods might.
Find some content you believe is unique to your website – in my case I’ve gone with the strapline of: “Enhance Your Website’s Organic Visibility And Traffic” – then search for this within quotation marks. If a test site containing this content has been indexed, this search should reveal it:
As you can see, the home pages on the main site, test sub domain and separate test domain all appear. You may also inadvertently spot a competitor who has ripped off your content. Some would take that as a compliment, others would issue DMCAs – it’s up to you, but the last thing you want is someone outranking you with your own copy.
4) Crawl The Site Using Screaming Frog
I presume you’re into SEO and therefore use Screaming Frog. If either of those answers is no, then well done for making it this far into this article (let me guess you’re a developer who has dropped a bollock and looking to cover your arse before anyone else finds out?).
If you don’t have it, download it here.
Within the Basic Settings, tick “Crawl All Subdomains”. You can also tick “Follow Internal ‘nofollow’” as some test environments may have this in place.
Once the crawl is complete, peruse the list to see if there are any internal links in place to test sites. I came across this recently where a new Drupal site had gone live but with all internal links within the blog posts pointing to a beta subdomain:
You can then click on each test URL and click on InLinks at the bottom to find the offending internal link from the live to test site. In this case, I amended the Contact Us link on the sitemap to point to the test URL:
Once spotted, amend and re-crawl till these are no more internal links taking visitors elsewhere. If you are using WordPress, use a search/replace plugin to find all test URLs and replace them with the live one.
5) Check Google Analytics Hostnames
If your test site has the same Google Analytics account’s tracking code installed as your live site, you will be able to spot this within GA if you go to a section such as “Behavior” -> “Site Content” -> “All Pages” and select “Hostname” as a secondary dimension:
Further to this, you can also then filter the data further by excluding from the report all visits to the main domain, which will leave all other instances in the list. In addition to test sites, you may also uncover GA Spam being triggered on a 3rd party site:
There are pros and cons to having the same GA tracking ID running on both your live and test environments, but personally, I see no reason to have separate accounts and instead would create multiple views within your one account. For the live site, set up a filter to only include traffic to the live hostname, and vice versa for the test site.
How To Remove and Prevent Your Test Site From Getting Indexed
So you’ve discovered your test site in the index using one of the techniques above, or, you want to make sure it doesn’t happen in the first place. The following will all help with this:
1) Remove URLs via GSC
If your site is indexed, whether it’s generating traffic or not, it’s best to get it removed. To do this, you can use the “Remove URLs” section from the “old” GSC.
Note, this will not work at domain property level as these aren’t catered for in old GSC. In order to do this, you need to set up set up a property for the individual test domain.
Once set up, “Go To The Old Version” and go to “Google Index” -> “Remove URLs”. From here, select “Temporarily Hide” and enter as single forward slash as the URL you wish to block which will submit your entire site for removal:
This will remove your site from the SERPs for 90 days, in order to ensure it doesn’t return, you must take further steps. One of the following will suffice (and should be carried out regardless of whether you are able to Remove via GSC)
2) Set robots tag to noindex on test site
Ask your developers to ensure that when running on the test domain, each page across the site generates a robots noindex tag:
<meta name="robots" content="noindex" />
If your site is WordPress, you can set this via “Settings” -> “Reading” and selecting “Discourage search engines from indexing this site”:
Whatever code or settings you use to prevent the test site from being indexed, you must ensure this is not migrated to the live site when new content or functionality is made live. Test site settings going live are one of the most common and most sure-fire ways to mess up your live site’s visibility.
3) Password Protect Your Test Site
From your web control panel or via the server, password protect the directory in which your test site resides. There are numerous ways to do this – the best bet is to ask your hosting company or developers to configure this, or, there are plenty good resources out there that will show you how to do this, such as:
Once blocked, you should see an alert box when trying to access your test site:
This will prevent search engines from crawling and indexing the site.
4) Delete site and return page status 410
If you no longer have need for your test site, you can simply delete it. When search engines try to visit pages on longer live, they will see the pages are deleted. By default, a broken page will return status 404 (“Not Found”) – whilst this will get the site de-indexed in time, it will take a while as there will be follow up visits to see if the broken page has returned.
Instead, set the status to 410 (“Permanently Gone”) which will return the following message:
To do this across an entire domain, delete the site and leave the .htaccess file in place with the following command:
Redirect 410 /
This will ensure the site gets de-indexed at the first time of asking (or at least quicker than a 404)
5) Block via robots.txt
You can block the site from being crawled by implementing the following commands in the test site’s robots.txt file:
User-agent: * Disallow: /
This will prevent bots from crawling the site. Note: if your test site is currently indexed, and you have gone down the route of adding noindex tags to the site, do not add the robots.txt command in until all pages have been de-indexed. If you add this in before all pages have de-indexed, this will prevent them from being crawled and the robots tag detected, so the pages will remain indexed.
And that’s it – I hope the above will be enough for you to find, deindex and prevent your test from being crawled ever again.
I cannot stress this enough – if you decide to implement robots meta tags or robots.txt which disallow all bots from crawling and indexing your test site, make sure when you put your test site live that you do not carry these configurations over to the live site, as you will risk losing your organic visibility altogether.
And we’ve all been there, right?